The dataset I chose was a data set on crime in San Francisco.
I have filtered the dataset which is large at 350 mb down to only theft incidents in 2014. I did this by taking that file and running commands:
egrep "2014" SFPD_Incidents_-_from_1_January_2003.csv > SFPD_2014.csv egrep "LARCENY/THEFT" SFPD_2014.csv > SFPD_2014_theft.csv cat col_format.csv SFPD_2014_theft.csv > SFPD_2014_theft_col.csv
Dashboards follow each hypothesis.
- My first hypothesis is that most thefts occur at night in residential neighborhoods. Given that many people have to park their cars outside at night, I assume that most crime occurs when people are asleep.
By looking into this data, I decided that it would likely be more interesting to try and look at it day by day (rather than just cumulatively). We can see that evening time is by far the most likely time that thefts will occur with weekdays frequently around 6PM to 8PM while the weekends see a jump in thefts around midnight as well. Sunday was by far the calmest day - something I might want to look at further along.
After completing that histogram of just dates and times I need to dive deeper into the data and better understand time and certain neighborhoods.
Then I created a new parameter that is "Residential" and "Non-Residential".
- Bayview (kind of)
I then got the resulting box and whisker plot.
Once I was able to do that I was able to confirm my hypothesis, it seems that more crime occurs in residential neighborhoods than non-residential neighborhoods on any given day of the week. The one important thing to keep in mind is that I used my personal knowledge of the districts (as well as maps from each police station to classify it as residential or non-residential - this is an unofficial classification). In terms of my hypothesis it is not so clear that most crime occurs at "night" - it seems that dusk is the time at which most crime occurs.
- Given my previous data, I was able to see that the Southern District has the most theft. My hypothesis is that it has the most unresolved crime. I also hypothesize that more crimes are resolved in the richest neighborhoods than in poor neighborhoods. Specifically, I believe that the Richmond and Northern districts have the high proportion of solved crimes per crime committed.
I created a new variable which is the number of thefts resolved. A case is unresolved if the Resolution field is NONE.
This exploration yielded some interesting results and definitely not what I expected. We can see from the percent resolved that nearly 25% of the crimes in the tenderloin are resolved. I think that this might be because it's such a small area, that many people are caught doing relatively petty things and don't end up getting very far from where they committed the crime. Interestingly, the Northern District seemed to have the smallest percentage resolved - yet also had a lot of crime numbers. This fact makes me question my Residential/Non-Residential qualifications from earlier - although they may be valid. Interestingly, the Richmond seems to have a low crime resolution rate as well - not necessarily what I would have expected.
Looking at the resolution field gave me a lot of interesting ideas to explore - what makes a theft something the DA doesn't want to prosecute? Where is the highest number of juvenile bookings?
Unfortunately I realized that my dataset was a bit constrained when I just focused on thefts. So I loaded in all crimes in 2014 to see what else I might be able to dig up.
cat col_format.csv SFPD_2014.csv > SFPD_2014_col.csv
Gives me all data from 2014.
- I hypothesize that the theft distribution follows a similar distribution to the general population of crime. By this I mean thefts will be a representative sample from the population - that all crimes will occur roughly in the same locations, at the same times, in the same places, at the same amounts.
By looking at the dashboard - this seems to be roughly true. It seems that they're more or less the same - obviously the totals are throw off but this brought me to a new visual, what are the most common crimes? (ranked on the right side of this dashboard) It's not by any means a perfect representation but it was close enough (visually) that it's fair to say that's it is a relatively accurate representation of where crime occurs in the city.
- Now that I had a better idea of how the data might be made up - I thought it would be good to jump into a map perspective. I thought looking at assaults might be a good next step - they're the 4th most common category of crime in 2014. I hypothesize that thefts and assaults are related and occur in more or less the same locations. By that I mean that their ratios are the same. If assault occurs as X% of total crime in a given district theft will occur at 2X% of total crime.
I visualized this through 2 mediums. Firstly through a map where I plotted only thefts and assaults at locations that were reported more than 20 times. This gave me a general idea of where thefts and assaults are likely to occur. When we filter out the incidents that occur at locations only one or two times. When we take a look at this we can immediately see that some hotspots are shared and some are different. Assaults clearly occur in the Mission much more than thefts. What's interesting when we use the map view is that we can see these differences much more quickly (in relation to where in the city they are likely to occur). I noticed immediately that there seems to be a fair amount of theft in a specific location in Golden Gate Park. Definitely worth thinking about and exploring further.
Looking at the table tells an even more interesting story. I've done some working to basically look at the percent of total crime in a given district - we can see that in the Mission - theft is not as common as assault. What does this mean? In SOMA there seems to be a lot more theft, possibly from tourists while in the Mission it seems that more fights may be breaking out leading to assaults instead of thefts.
Quite simply, I could reject my hypothesis - there seems to be no relation between theft occurring at a certain percent more across the city.
One of the challenges with optionality is volatility.
Show me your bad work.
At a previous startup there were times of stress and challenge. Bickering, indecision, lack of perspective. It happens at most startups at one point or another.
The mind games. Life is just mind games. Whether it's team dynamics. Whether it's just you vs you. It's all mind games.
When working on my Scrappy Startup project using Vue 3, I encountered a need to render markdown. This markdown could either be fetched from a database or written inline within my application. Markdown, with its ease of writing and readability, serves as an excellent format for managing text-based content, especially when you have a considerable amount of textual data to handle.
Sniplet.xyz is a tool that allows you to search deep into podcasts for relevant snippets or podcasts that you might want to listen to.
Sometimes, creating a Press Release / FAQ can be a bit heavyweight. I wrote this template to write punchier proposals that allow for more testing and iteration. The goal is to prove or disprove ideas and document my process for doing so.
The following is the template for Press Release - FAQs as popularized by Amazon. This template is here as a resource for others to use.
Note: See the accompanying GitHub repo for this blogpost here.
The following is a memo that David Henke wrote in 1998. It was a formative article for me and has helped me make serious decisions about my career and where I chose to work. I asked him if I could reproduce it, since I couldn't find it online, and he obliged. Here's what he had to say about it...
This post will be subject to change and evolution. It represents the starting point for me 'starting up'.
You decide that you're going to make a trip, a business trip. You're going to visit some customers and you hop onto whatever search engine and reserve a car, maybe through National. You get to the destination airport, stroll off the aircraft, grab your bag and walk to the car rental counters only to realize that the Enterprise and National all share the same desk.
Recently, there's been a renewed focus on monitoring and understanding company (or product) growth, especially when it comes to SaaS products. Werner Vogels recently mentioned something quite similar in a blog post, "People often ask me if developing for the cloud is any different from developing on-premises software. It really is." I couldn't agree more, it's awesome for understanding products, how users are using them, and what you can do to improve them.
After having sparktutorials.net up for several years, it's time to shut it down. I haven't written for the site in years at this point and it's not doing me any good now that I have The Definitive Guide published.
This was my second time reading The Black Swan by Nassim Taleb although admittedly I think I was a bit young the first time to fully absorb the content. That is not to say that I didn't get the TL;DR of "hey sometimes stuff happens that you can't predict that's meaningful", but what I missed was a lot of the nuance in the actual application of the principles to my life.
As of February 6th, 2018, Spark: The Definitive Guide has gone to print. This was the most intensive project and process that I've ever undertaken in my life. It was filled with frustrations and anticipations, excitements and fears. I must extend thanks to those that encouraged me to lead the writing of the book, namely Ion Stoica, Patrick Wendell, Ali Ghodsi, and (somewhat obviously) Matei Zaharia. These folks were the ones that recommended that I take the lead on the book and I am forever grateful for them to grant me such an opportunity.
Lately I've been playing around with Spark for data processing. It provides some really amazing features like MLLib and Spark SQL and there's no better way to learn something that to use it. I've attended a couple of meet ups about Spark and its related tools including the famous ampcamp put on by the developers of spark and, although I'm not an expert, I thought it would be good to consolidate my knowledge and teach others.
I've recently launched a website called SparkTutorials.net. Spark Tutorials aims to educate the general public about the utility of Spark as a tool for data science. I would encourage you to read more on the website and learn something new!
Recently I took it upon myself to dive into Scala. This post describes what my reaction was after writing a link shortener service using it. For those only interested in the code, check out my github.
Recently I took it upon myself to dive into clojure. This post describes what my reaction was after writing a link shortener service using it. For those only interested in the code, check out my github.
During the World Series, especially during the Giants win, there was a mass rioting and looting. For our data visualization class, a classmate, John Semerdjian, and I made an interactive visualization of the crime in the city during each game.
08 May 2015
This was a post that I did for Plotly covering the basics of plotting Spark DataFrames with plotly.
This was built for a class project in my Information Visualization class.
This notebook walks through an example of KMeans clustering crime data with alcohol license locations. This clustering is performed solely based on the Lat/Long locations of stores and crimes. The tools I use are
This was a post that I did for Plotly covering the basics of the tool with Salesforce.
This past weekend was at the wise.io data science hack day and had a great time. The team is clearly intelligent and I really enjoy working and learning in that kind of environment.
This post is meant as a summary of many of the concepts that I learned in Marti Hearst's Natural Language Processing class at the UC Berkeley School of Information. I wanted to record the concepts and approaches that I had learned with quick overviews of the code you need to get it working. I figured that it could help some other people get a handle on the goals and code to get things done.
Leada has recently set out to email out new datasets every week with a couple of interesting questions. I thought that this week's challenge posed some interesting questions that provide great examples of ways to use Python's pandas library.
This is a two part post, you can see part 1 here. Please read that post (if you haven't already) before continuing or just check out the code in this gist.
This is a two part post, you can see part 2 here.
Contemporary notions of privacy are complex and it is common to hear commentators calling the current state of privacy, or lack thereof, unprecedented. I would challenge the notion of an unprecedented violations of privacy on the basis of historical relativity. In absolute terms there is little question that the world we live in challenges any notions of privacy that have ever existed. However in relative terms, from a certain level of privacy to another, the rise of newspapers and the telegraph are interesting to compare to the modern era. In this paper I will revisit several key cultural and legal landmarks that have guided us to our current construct of privacy and look at future privacy implications of technologies like Amazon Echo and services like Facebook.
First, I'd like to introduce the California Civic Data Coalition. They are self described as a loosely coupled team from the Los Angeles Times Data Desk, The Center for Investigative Reporting and Stanford's Computational Journalism Lab.
This document will be a simple introduction to static site generators. We'll go over the basics of what they are, why you should use them, which one you should use and finally how to get started.
DataKindSF just got their start and the reception was incredible. There was a huge turn out of people wanting to contribute by using high impact skills for greater good. I found out about the program through their meetup.
Wow, Hackathons are an experience. Firstly I was amazed by the turn out, realistically probably 40 teams all competed in a 12 hour hackathon for Evernote at the Computer Science Department at Berkeley. I found the atmosphere to be supportive and fiercely competitive at the same time. Hackathon's are a strange creation and I've struggled to come up with a parallel in history. But that's for another post.
Several weeks ago I sent a review of a feature in an app I use called Timeful. This is my letter to that company where I tried to get a better understanding of their motivations for the user experience of part of their application.
SurpriseHaiku was an experiment that parsed random twitter tweets to see if they followed the Haiku cadence of 5 / 7 / 5 (syllables). I ran this experiment during the 2014 Olympics to try and focus around Olympics related tweets. This was an application that I built to learn more about the twitter API and dip my toes in the world of Natural Language Parsing or NLP.