K-Means Clustering - Liquor & Assaults in San Francisco

31 March 2015

This notebook walks through an example of KMeans clustering crime data with alcohol license locations. This clustering is performed solely based on the Lat/Long locations of stores and crimes. The tools I use are

The most basic question being answered is:

Given Lat/Long - can we draw some association between a liquor store's centroid and crime/a type of crime's centroid? Or another way, will groups of crime overlap with groups of liquor stores.

The data we're using is from SFGOV as well as the Alcoholic Beverage Control.

%matplotlib inline
import pandas as pd
import numpy as np
from pandas.tools.plotting import scatter_matrix
from sklearn.cross_validation import train_test_split
import matplotlib.pyplot as plt
import random
pd.options.display.mpl_style = 'default'

alc = pd.read_csv("data/alcohol_licenses_locations.csv")
crime = pd.read_csv("data/Map__Crime_Incidents_-_from_1_Jan_2003_REDUCED.csv")
alc.columns
Index([u'Unnamed: 0', u'Join_Count', u'Status', u'Score', u'Match_type', u'Side', u'X', u'Y', u'Match_addr', u'ARC_Street', u'Entry_no', u'Owner_name', u'street', u'city', u'state', u'zip', u'Entry_no_1', u'License_Nu', u'Status_1', u'License_Ty', u'Orig_Iss_D', u'Expir_Date', u'Census_tra', u'Business_N', u'Mailing_Ad', u'Geo_Code', u'Tract2010', u'coords.x1', u'coords.x2'], dtype='object')
crime.columns
Index([u'IncidntNum', u'Category', u'Descript', u'DayOfWeek', u'Date', u'Time', u'PdDistrict', u'Resolution', u'Address', u'X', u'Y', u'Location'], dtype='object')

This is an outer join combining the reduced crime set and the alcohol licenses location data. It performs a join on the X & Y Columns (lat & Lon).

combo = pd.merge(alc, crime, on=['X','Y'], how='outer')

At this point I am reducing the license types to just 20 and 21 - which are offsite types.

Reference dictionary here: http://www.abc.ca.gov/datport/SubAnnStatRep.pdf

features = ['X','Y']

K Means

http://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html

from sklearn.cluster import KMeans

Clustering Liquor Stores

I'm just look at just 20, 21 license types because these are for off-site sales.

print len(alc)
alc = alc[(alc['License_Ty'] == 20) | (alc['License_Ty'] == 21)]
print len(alc)

3635
809
alc_X = alc[features]


for num_clusters in range(10,75,5):
    km = KMeans(num_clusters)
    km_fit = km.fit(alc_X)
    ax = alc_X.plot(kind='scatter',x='X',y='Y', legend=str(num_clusters), figsize=(8, 6))
    pd.DataFrame(km_fit.cluster_centers_).plot(kind='scatter',x=0,y=1,color='k',ax=ax)
    ax.set_title(str(num_clusters) + " License 20 & 21 Clusters")

png

png

png

png

A thoroughly unscientific analysis had 55 clusters for the alcohol stores jump out at me as a approximately correct measure - it seems to be a decent balance of different spots on the map.

Proceeding with 55 clusters, feel free to change as you see fit

num_clusters = 55
liq_km = KMeans(num_clusters)
liq_km_fit = liq_km.fit(alc_X)
liq_ax = alc_X.plot(kind='scatter',x='X',y='Y', legend=str(num_clusters), figsize=(8, 6))
pd.DataFrame(liq_km_fit.cluster_centers_).plot(kind='scatter',x=0,y=1,color='k',ax=liq_ax)
liq_ax.set_title(str(num_clusters) + " License 20 & 21 Clusters")

png

Now we're going to repeat the process for crime locations. The goal here is to see if there are any location overlaps in the above stores + crime. Eventually we'll move into categories of crimes.

Clustering Crime Categories

print crime.Category.unique()
crime.Date = crime.Date.apply(pd.to_datetime)

['ASSAULT' 'OTHER OFFENSES' 'NON-CRIMINAL' 'SEX OFFENSES, FORCIBLE'
 'SUSPICIOUS OCC' 'DRUG/NARCOTIC' 'WEAPON LAWS' 'VANDALISM' 'TRESPASS'
 'SECONDARY CODES' 'DRIVING UNDER THE INFLUENCE' 'FAMILY OFFENSES'
 'DRUNKENNESS' 'LOITERING' 'PROSTITUTION' 'LIQUOR LAWS'
 'DISORDERLY CONDUCT' 'SUICIDE' 'SEX OFFENSES, NON FORCIBLE'
 'PORNOGRAPHY/OBSCENE MAT']

sub_crime = crime[(crime['Category'] == "ASSAULT")] #look at just Assaults

I looked at just assaults in order to dive a bit deeper into the data itself.

print len(crime)
sub_crime = crime[(crime['Category'] == "ASSAULT")]
print len(sub_crime)
sub_crime = sub_crime[sub_crime.Date > '2013-1-1'][sub_crime.Date < '2014-1-1'].reset_index()
crime_X = sub_crime[features]
print len(crime_X)

404080
64033
12588


/Library/Python/2.7/site-packages/pandas/core/frame.py:1771: UserWarning: Boolean Series key will be reindexed to match DataFrame index.
  "DataFrame index.", UserWarning)



for num_clusters in range(10,75,5):
    km = KMeans(num_clusters)
    km_fit = km.fit(crime_X)
    ax = crime_X.plot(kind='scatter',x='X',y='Y', legend=str(num_clusters), figsize=(8, 6))
    pd.DataFrame(km_fit.cluster_centers_).plot(kind='scatter',x=0,y=1,color='k',ax=ax)
    ax.set_title(str(num_clusters) + " Assault Clusters")

png

png

png

png

num_clusters = 55
crime_km = KMeans(num_clusters)
crime_km_fit = crime_km.fit(crime_X)
crime_ax = crime_X.plot(kind='scatter',x='X',y='Y', legend=str(num_clusters), figsize=(8, 6))
pd.DataFrame(crime_km_fit.cluster_centers_).plot(kind='scatter',x=0,y=1,color='k',ax=crime_ax)
crime_ax.set_title(str(num_clusters) + " Assault Clusters")

png

Blue are the underlying liquor Stores

Red are Assault Centroids

Black are Liquor Store Centroids

alc_base = alc_X.plot(kind='scatter',x='X',y='Y', legend=str(num_clusters), figsize=(10, 8))
pd.DataFrame(crime_km_fit.cluster_centers_).plot(kind='scatter',x=0,y=1,color='r', ax=alc_base)
pd.DataFrame(liq_km_fit.cluster_centers_).plot(kind='scatter',x=0,y=1,color='k',ax=alc_base)
alc_base.set_title("Clustering of Assaults and Liquor Store")

png

Leaflet

I also decided to plot this data in Leaflet to exercise my JavaScript skills a bit. Leaflet is a javascript plotting library.

  • Red: Assault (Centroids being larger)
  • Black: Liquor Stores (Centroids being larger)

Conclusion

It appears that there is at least some basic correlation between crimes and liquor stores. Obviously this will vary with the type of crime but it is worth exploring further. This was not intended to be a scientific analysis - much more of an exploration. Due to any number of biases, this information is not something that, at face value, you can derive explicit relationships. I wanted to play around with a visual display of k-means and sci-kit learn.