Data Challenge - Rebalancing Bike Terminals in SF
Leada has recently set out to email out new datasets every week with a couple of interesting questions. I thought that this week's challenge posed some interesting questions that provide great examples of ways to use Python's pandas library.
Here's their email.
You just made your 2015 New Year's resolution to improve your data analysis skills and become data literate! Every two weeks, we will e-mail you a dataset and some problems to work on. You can use any tool you'd like, although we suggest using R or Python. R is easy to install and if you are new to programming, the Introduction to R lesson is free in our courses! Here's the first dataset and problem! Answers will be released at a later date. Tweet your solution @LeadaHQ! ==================== Dataset Description: A CSV of bicycle rental transactions for a rental company in California. Dataset Download Problem 1: What was the average total time (in minutes) used by a bicycle in the data? Problem 2: What was the most popular day by trip frequency in this dataset? Problem 3 (harder): Assuming there are 30 bikes per station, find what date and time the bikes FIRST need to be rebalanced. As in, there are 0 bikes at a terminal for a customer to rent. Problem 3 (easier): Assuming there are 30 bikes per station, find what date the bikes FIRST need to be rebalanced. As in, there are 0 bikes at a terminal for a customer to rent. Do this ignoring "Start.Date" and "End.Date" columns. ====================
So let's get started on answering those questions. Here are the first couple of lines of code, standard imports and getting a sense of our columns.
%matplotlib inline import pandas as pd import numpy as np import matplotlib.pyplot as plt pd.read_csv('data/bike_trip_data.csv').head()
|Trip ID||Duration||Start Date||Start Station||Start Terminal||End Date||End Station||End Terminal||Bike #||Subscription Type||Zip Code|
|0||4576||63||8/29/13 14:13||South Van Ness at Market||66||8/29/13 14:14||South Van Ness at Market||66||520||Subscriber||94127|
|1||4607||70||8/29/13 14:42||San Jose City Hall||10||8/29/13 14:43||San Jose City Hall||10||661||Subscriber||95138|
|2||4130||71||8/29/13 10:16||Mountain View City Hall||27||8/29/13 10:17||Mountain View City Hall||27||48||Subscriber||97214|
|3||4251||77||8/29/13 11:29||San Jose City Hall||10||8/29/13 11:30||San Jose City Hall||10||26||Subscriber||95060|
|4||4299||83||8/29/13 12:02||South Van Ness at Market||66||8/29/13 12:04||Market at 10th||67||319||Subscriber||94103|
Now that we know what columns are dates we can perform our import. Pretty straightforward stuff - we use
parse_dates in order to turn our dates in date times. Then we can get our data types to make sure that everything is correct.
df = pd.read_csv('data/bike_trip_data.csv',parse_dates=['Start Date', 'End Date']) print df.dtypes
Trip ID int64 Duration int64 Start Date datetime64[ns] Start Station object Start Terminal int64 End Date datetime64[ns] End Station object End Terminal int64 Bike # int64 Subscription Type object Zip Code object dtype: object
Problem 1: What was the average total time (in minutes) used by a bicycle in the data?
Well we've got to get the sum of all the times used by bicycles (each one individually), then get the mean of that and divide it by 60 to get minutes.
So on average they are used about 70 hours. This was a strange question though and I'm not sure what we'are trying to get out of it.
Problem 2: What was the most popular day by trip frequency in this dataset?
This question was a bit more interesting. While we could just try and get the value counts of the start dates, those include times as well. So we've got to map them to lambda values. Using
np.vectorize would make this more efficient however the execution time was not slow enough to warrant further optimization.
df['Start Date'].map(lambda x: "%d-%d-%d" % (x.year, x.month, x.day)).value_counts()[:5]
2013-9-25 1264 2013-10-17 1227 2013-11-7 1197 2013-10-15 1173 2013-11-8 1172 dtype: int64
df['End Date'].map(lambda x: "%d-%d-%d" % (x.year, x.month, x.day)).value_counts()[:5]
2013-9-25 1266 2013-10-17 1225 2013-11-7 1196 2013-11-8 1173 2013-10-15 1172 dtype: int64
So we see that 9-25 was the most popular day followed by the 17th of October then a couple of days in December.
Problem 3 (harder): Assuming there are 30 bikes per station, find what date and time the bikes FIRST need to be rebalanced. As in, there are 0 bikes at a terminal for a customer to rent.
I found this question to be a lot of fun because you really have to think about what you're trying to do. Basically we're trying to understand how bikes are flowing out of stations and how they're flowing into stations. Additionally different stations are made available at different times so you've got to find a scalable way to separate the stations and get the flows for each.
start_stations = df.groupby('Start Station') end_stations = df.groupby('End Station')
First we perform a quick
groupby to line up the stations and make it so that we can find the flows for each station. Trips leaving and trips arriving. Once we have in and out flows for each station, we can merge them and get the answer - which station needs to be balanced first.
def get_flow_data(df_start, df_end): df_start.sort(inplace=True) df_end.sort(inplace=True) start_counts = pd.Series(np.arange(30, 30 + (-1 * len(df_start.index)),-1), index=df_start.index,name='start', dtype=np.int64) end_counts = pd.Series(np.arange(len(df_end.index))+1, index=df_end.index, name='end', dtype=np.int64) merged = start_counts.to_frame().join(end_counts.to_frame(), how='outer') merged['start'] = merged.start.fillna(method='ffill').fillna(method='bfill').astype(np.int64) merged['end'] = merged.end.fillna(method='ffill').fillna(value=0).astype(np.int64) merged['combo'] = merged.start + merged.end return merged
Let's walk through this line by line.
First we need to sort the dates to make sure that our index (of dates) is lined up correctly.
start_counts = pd.Series(np.arange(29, 29 + (-1 * len(df_start.index)),-1), index=df_start.index,name='start', dtype=np.int64) end_counts = pd.Series(np.arange(len(df_end.index))+1, index=df_end.index, name='end', dtype=np.int64)
Then I created a new pandas
Series for start and end dates.
The start_counts series starts at 29 and goes down by one until it reaches the length of the entire index. This is basically how many bike trips originate at that station.
The end_counts series goes the entire length of the count of bike trips ending at that station.
We then merge them into one data frame.
merged = start_counts.to_frame().join(end_counts.to_frame(), how='outer')
We then fill in some values by using interpolation to back fill and forward fill the bike counts so that basically we get the flow of bikes coming into the station and those leaving the station.
merged['start'] = merged.start.fillna(method='ffill').fillna(method='bfill').astype(np.int64) merged['end'] = merged.end.fillna(method='ffill').fillna(value=0).astype(np.int64) merged['combo'] = merged.start + merged.end
Finally we merge those together and we have the number of bikes at the station. When that crosses the zero mark is when the bike station needs to be rebalanced.
data =  for name, group in start_stations: print name m = get_flow_data(group.set_index('Start Date'), end_stations.get_group(name).set_index('End Date')) if m.combo[m.combo <= 0].any(): d = m.combo[m.combo <= 0].index data.append((d, name))
However some stations did not have to be rebalanced so we had to test whether or not it did.
Here are the first 5 stations that have to be rebalanced as well as at which point they have to be.
[(Timestamp('2013-09-02 15:03:00'), 'Embarcadero at Bryant'), (Timestamp('2013-09-04 07:56:00'), 'Market at Sansome'), (Timestamp('2013-09-05 07:56:00'), '2nd at Folsom'), (Timestamp('2013-09-05 18:56:00'), 'Temporary Transbay Terminal (Howard at Beale)'), (Timestamp('2013-09-10 12:36:00'), 'San Francisco City Hall')]
Here's the final code for anyone that is interested. I hope that I get to continue tackling these problems - they're enough to sink your teeth into but certainly aren't too difficult to not be able to do in a short amount of time.