The dataset I chose was a data set on crime in San Francisco.

I have filtered the dataset which is large at 350 mb down to only theft incidents in 2014. I did this by taking that file and running commands:

egrep "2014" SFPD_Incidents_-_from_1_January_2003.csv > SFPD_2014.csv
egrep "LARCENY/THEFT" SFPD_2014.csv > SFPD_2014_theft.csv
cat col_format.csv SFPD_2014_theft.csv > SFPD_2014_theft_col.csv

Dashboards follow each hypothesis.


  1. My first hypothesis is that most thefts occur at night in residential neighborhoods. Given that many people have to park their cars outside at night, I assume that most crime occurs when people are asleep.

By looking into this data, I decided that it would likely be more interesting to try and look at it day by day (rather than just cumulatively). We can see that evening time is by far the most likely time that thefts will occur with weekdays frequently around 6PM to 8PM while the weekends see a jump in thefts around midnight as well. Sunday was by far the calmest day - something I might want to look at further along.

After completing that histogram of just dates and times I need to dive deeper into the data and better understand time and certain neighborhoods.

Then I created a new parameter that is "Residential" and "Non-Residential".


  • Southern
  • Central
  • Bayview (kind of)


  • Mission
  • Northern
  • Park
  • Richmond
  • Ingleside
  • Taravel
  • Tenderloin

I then got the resulting box and whisker plot.

Once I was able to do that I was able to confirm my hypothesis, it seems that more crime occurs in residential neighborhoods than non-residential neighborhoods on any given day of the week. The one important thing to keep in mind is that I used my personal knowledge of the districts (as well as maps from each police station to classify it as residential or non-residential - this is an unofficial classification). In terms of my hypothesis it is not so clear that most crime occurs at "night" - it seems that dusk is the time at which most crime occurs.

  1. Given my previous data, I was able to see that the Southern District has the most theft. My hypothesis is that it has the most unresolved crime. I also hypothesize that more crimes are resolved in the richest neighborhoods than in poor neighborhoods. Specifically, I believe that the Richmond and Northern districts have the high proportion of solved crimes per crime committed.

I created a new variable which is the number of thefts resolved. A case is unresolved if the Resolution field is NONE.

This exploration yielded some interesting results and definitely not what I expected. We can see from the percent resolved that nearly 25% of the crimes in the tenderloin are resolved. I think that this might be because it's such a small area, that many people are caught doing relatively petty things and don't end up getting very far from where they committed the crime. Interestingly, the Northern District seemed to have the smallest percentage resolved - yet also had a lot of crime numbers. This fact makes me question my Residential/Non-Residential qualifications from earlier - although they may be valid. Interestingly, the Richmond seems to have a low crime resolution rate as well - not necessarily what I would have expected.

Looking at the resolution field gave me a lot of interesting ideas to explore - what makes a theft something the DA doesn't want to prosecute? Where is the highest number of juvenile bookings?

Unfortunately I realized that my dataset was a bit constrained when I just focused on thefts. So I loaded in all crimes in 2014 to see what else I might be able to dig up.

cat col_format.csv SFPD_2014.csv > SFPD_2014_col.csv

Gives me all data from 2014.

  1. I hypothesize that the theft distribution follows a similar distribution to the general population of crime. By this I mean thefts will be a representative sample from the population - that all crimes will occur roughly in the same locations, at the same times, in the same places, at the same amounts.

By looking at the dashboard - this seems to be roughly true. It seems that they're more or less the same - obviously the totals are throw off but this brought me to a new visual, what are the most common crimes? (ranked on the right side of this dashboard) It's not by any means a perfect representation but it was close enough (visually) that it's fair to say that's it is a relatively accurate representation of where crime occurs in the city.

  1. Now that I had a better idea of how the data might be made up - I thought it would be good to jump into a map perspective. I thought looking at assaults might be a good next step - they're the 4th most common category of crime in 2014. I hypothesize that thefts and assaults are related and occur in more or less the same locations. By that I mean that their ratios are the same. If assault occurs as X% of total crime in a given district theft will occur at 2X% of total crime.

I visualized this through 2 mediums. Firstly through a map where I plotted only thefts and assaults at locations that were reported more than 20 times. This gave me a general idea of where thefts and assaults are likely to occur. When we filter out the incidents that occur at locations only one or two times. When we take a look at this we can immediately see that some hotspots are shared and some are different. Assaults clearly occur in the Mission much more than thefts. What's interesting when we use the map view is that we can see these differences much more quickly (in relation to where in the city they are likely to occur). I noticed immediately that there seems to be a fair amount of theft in a specific location in Golden Gate Park. Definitely worth thinking about and exploring further.

Looking at the table tells an even more interesting story. I've done some working to basically look at the percent of total crime in a given district - we can see that in the Mission - theft is not as common as assault. What does this mean? In SOMA there seems to be a lot more theft, possibly from tourists while in the Mission it seems that more fights may be breaking out leading to assaults instead of thefts.

Quite simply, I could reject my hypothesis - there seems to be no relation between theft occurring at a certain percent more across the city.

Download the workbook here.

Builder Tactics - 15 for 15 Friday, Mar 1, 2024

As product builders, we're constantly making sure we're building the right thing.

Hyperlint - AI to Help Write and Maintain Great Documentation Friday, Feb 9, 2024

Over the past couple of months, I've been working on a new project called Hyperlint.

do you buy groceries every week? Monday, Oct 23, 2023

I do.

Do not water it down Tuesday, Oct 17, 2023

Optionality and volatility Friday, Oct 13, 2023

One of the challenges with optionality is volatility.

Building Conviction Thursday, Oct 12, 2023

Where's your bad work? Wednesday, Oct 11, 2023

Show me your bad work.

Do you have the runway? Tuesday, Oct 10, 2023

At a previous startup there were times of stress and challenge. Bickering, indecision, lack of perspective. It happens at most startups at one point or another.

Mind Games on the Trail Monday, Oct 9, 2023

The mind games. Life is just mind games. Whether it's team dynamics. Whether it's just you vs you. It's all mind games.

Rendering Markdown in Nuxt 3 & Vue 3 Wednesday, Sep 27, 2023

When working on my Scrappy Startup project using Vue 3, I encountered a need to render markdown. This markdown could either be fetched from a database or written inline within my application. Markdown, with its ease of writing and readability, serves as an excellent format for managing text-based content, especially when you have a considerable amount of textual data to handle. - Deep Search Podcasts to Find Relevant Snippets Friday, Sep 22, 2023 is a tool that allows you to search deep into podcasts for relevant snippets or podcasts that you might want to listen to.

The Scrappy Startup - The Reverse Product Template Tuesday, Sep 19, 2023

Sometimes, creating a Press Release / FAQ can be a bit heavyweight. I wrote this template to write punchier proposals that allow for more testing and iteration. The goal is to prove or disprove ideas and document my process for doing so.

Amazon's Press Release FAQ Template Sunday, Sep 17, 2023

The following is the template for Press Release - FAQs as popularized by Amazon. This template is here as a resource for others to use.

Chat With Your Data using LangChain Thursday, Aug 10, 2023

Note: See the accompanying GitHub repo for this blogpost here.

So You Want to Join a Startup by David Henke Friday, Jul 21, 2023

The following is a memo that David Henke wrote in 1998. It was a formative article for me and has helped me make serious decisions about my career and where I chose to work. I asked him if I could reproduce it, since I couldn't find it online, and he obliged. Here's what he had to say about it...

The Next Step in the Journey Wednesday, Jun 14, 2023

This post will be subject to change and evolution. It represents the starting point for me 'starting up'.

Car Rental Companies and Branding - What sharing desks teaches us about product management Tuesday, Jul 23, 2019

You decide that you're going to make a trip, a business trip. You're going to visit some customers and you hop onto whatever search engine and reserve a car, maybe through National. You get to the destination airport, stroll off the aircraft, grab your bag and walk to the car rental counters only to realize that the Enterprise and National all share the same desk.

Applying SaaS Company Metrics to Product Adoption Friday, Jan 11, 2019

Recently, there's been a renewed focus on monitoring and understanding company (or product) growth, especially when it comes to SaaS products. Werner Vogels recently mentioned something quite similar in a blog post, "People often ask me if developing for the cloud is any different from developing on-premises software. It really is." I couldn't agree more, it's awesome for understanding products, how users are using them, and what you can do to improve them.

Thoughts on Shutting Down Projects and Looking to 2019 Monday, Dec 24, 2018

After having up for several years, it's time to shut it down. I haven't written for the site in years at this point and it's not doing me any good now that I have The Definitive Guide published.

Thoughts on 'The Black Swan' by Nassim Taleb Friday, Feb 16, 2018

This was my second time reading The Black Swan by Nassim Taleb although admittedly I think I was a bit young the first time to fully absorb the content. That is not to say that I didn't get the TL;DR of "hey sometimes stuff happens that you can't predict that's meaningful", but what I missed was a lot of the nuance in the actual application of the principles to my life.

Spark: The Definitive Guide published by O'Reilly! Thursday, Feb 8, 2018

As of February 6th, 2018, Spark: The Definitive Guide has gone to print. This was the most intensive project and process that I've ever undertaken in my life. It was filled with frustrations and anticipations, excitements and fears. I must extend thanks to those that encouraged me to lead the writing of the book, namely Ion Stoica, Patrick Wendell, Ali Ghodsi, and (somewhat obviously) Matei Zaharia. These folks were the ones that recommended that I take the lead on the book and I am forever grateful for them to grant me such an opportunity.

Getting Started with Apache Spark Sunday, Dec 6, 2015

Lately I've been playing around with Spark for data processing. It provides some really amazing features like MLLib and Spark SQL and there's no better way to learn something that to use it. I've attended a couple of meet ups about Spark and its related tools including the famous ampcamp put on by the developers of spark and, although I'm not an expert, I thought it would be good to consolidate my knowledge and teach others.

Introducing Thursday, Sep 10, 2015

I've recently launched a website called Spark Tutorials aims to educate the general public about the utility of Spark as a tool for data science. I would encourage you to read more on the website and learn something new!

A Simple Link Shortener in Scala Wednesday, Jun 10, 2015

Recently I took it upon myself to dive into Scala. This post describes what my reaction was after writing a link shortener service using it. For those only interested in the code, check out my github.

A Simple Link Shortener in Clojure Monday, Jun 1, 2015

Recently I took it upon myself to dive into clojure. This post describes what my reaction was after writing a link shortener service using it. For those only interested in the code, check out my github.

Visualizing Crime in San Francisco during the 2014 World Series Wednesday, May 20, 2015

During the World Series, especially during the Giants win, there was a mass rioting and looting. For our data visualization class, a classmate, John Semerdjian, and I made an interactive visualization of the crime in the city during each game.

Plotting Your AWS Redshift Data with Plotly Friday, May 8, 2015

08 May 2015

Plotting Spark DataFrames with Plotly Monday, May 4, 2015

This was a post that I did for Plotly covering the basics of plotting Spark DataFrames with plotly.

Visualizing Flights Origins and Departures with d3.js Sunday, Apr 5, 2015

This was built for a class project in my Information Visualization class.

K-Means Clustering - Liquor & Assaults in San Francisco Tuesday, Mar 31, 2015

This notebook walks through an example of KMeans clustering crime data with alcohol license locations. This clustering is performed solely based on the Lat/Long locations of stores and crimes. The tools I use are

Interactive Salesforce Graphing with Plotly Monday, Mar 23, 2015

This was a post that I did for Plotly covering the basics of the tool with Salesforce.

Hackday - Data Science and Docker Working Together Sunday, Jan 18, 2015

This past weekend was at the data science hack day and had a great time. The team is clearly intelligent and I really enjoy working and learning in that kind of environment.

Python NLP - NLTK and scikit-learn Wednesday, Jan 14, 2015

This post is meant as a summary of many of the concepts that I learned in Marti Hearst's Natural Language Processing class at the UC Berkeley School of Information. I wanted to record the concepts and approaches that I had learned with quick overviews of the code you need to get it working. I figured that it could help some other people get a handle on the goals and code to get things done.

Data Challenge - Rebalancing Bike Terminals in SF Thursday, Jan 8, 2015

Leada has recently set out to email out new datasets every week with a couple of interesting questions. I thought that this week's challenge posed some interesting questions that provide great examples of ways to use Python's pandas library.

Basic Statistical NLP Part 2 - TF-IDF And Cosine Similarity Monday, Dec 22, 2014

This is a two part post, you can see part 1 here. Please read that post (if you haven't already) before continuing or just check out the code in this gist.

Basic Statistical NLP Part 1 - Jaccard Similarity and TF-IDF Sunday, Dec 21, 2014

This is a two part post, you can see part 2 here.

The Future of Privacy Friday, Dec 5, 2014

Contemporary notions of privacy are complex and it is common to hear commentators calling the current state of privacy, or lack thereof, unprecedented. I would challenge the notion of an unprecedented violations of privacy on the basis of historical relativity. In absolute terms there is little question that the world we live in challenges any notions of privacy that have ever existed. However in relative terms, from a certain level of privacy to another, the rise of newspapers and the telegraph are interesting to compare to the modern era. In this paper I will revisit several key cultural and legal landmarks that have guided us to our current construct of privacy and look at future privacy implications of technologies like Amazon Echo and services like Facebook.

Deploying PostgreSQL for the California Civic Data Coalition's Django Project Tuesday, Nov 25, 2014

First, I'd like to introduce the California Civic Data Coalition. They are self described as a loosely coupled team from the Los Angeles Times Data Desk, The Center for Investigative Reporting and Stanford's Computational Journalism Lab.

A Gentle Introduction to Static Site Generators Tuesday, Nov 11, 2014

This document will be a simple introduction to static site generators. We'll go over the basics of what they are, why you should use them, which one you should use and finally how to get started.

DataKindSF - Data Analysis for the Greater Good Wednesday, Oct 8, 2014

DataKindSF just got their start and the reception was incredible. There was a huge turn out of people wanting to contribute by using high impact skills for greater good. I found out about the program through their meetup.

EverDone - A Project for An Evernote Hackathon Wednesday, Sep 10, 2014

Wow, Hackathons are an experience. Firstly I was amazed by the turn out, realistically probably 40 teams all competed in a 12 hour hackathon for Evernote at the Computer Science Department at Berkeley. I found the atmosphere to be supportive and fiercely competitive at the same time. Hackathon's are a strange creation and I've struggled to come up with a parallel in history. But that's for another post.

User Experience Critique - Habits in Timeful Sunday, Aug 31, 2014

Several weeks ago I sent a review of a feature in an app I use called Timeful. This is my letter to that company where I tried to get a better understanding of their motivations for the user experience of part of their application.

SurpriseHaiku - Discovering Unusual Haikus On Twitter Wednesday, Aug 20, 2014

SurpriseHaiku was an experiment that parsed random twitter tweets to see if they followed the Haiku cadence of 5 / 7 / 5 (syllables). I ran this experiment during the 2014 Olympics to try and focus around Olympics related tweets. This was an application that I built to learn more about the twitter API and dip my toes in the world of Natural Language Parsing or NLP.