Data Challenge - Rebalancing Bike Terminals in SF
Leada has recently set out to email out new datasets every week with a couple of interesting questions. I thought that this week’s challenge posed some interesting questions that provide great examples of ways to use Python’s pandas library.
Here’s their email.
You just made your 2015 New Year's resolution to improve your data analysisskills and become data literate!
Every two weeks, we will e-mail you a dataset and some problems to work on. Youcan use any tool you'd like, although we suggest using R or Python. R is easy toinstall and if you are new to programming, the Introduction to R lesson is freein our courses!
Here's the first dataset and problem! Answers will be released at a later date.Tweet your solution @LeadaHQ!
====================
Dataset Description: A CSV of bicycle rental transactions for a rental companyin California.
Dataset Download
Problem 1: What was the average total time (in minutes) used by a bicycle in thedata?
Problem 2: What was the most popular day by trip frequency in this dataset?
Problem 3 (harder): Assuming there are 30 bikes per station, find what date andtime the bikes FIRST need to be rebalanced. As in, there are 0 bikes at aterminal for a customer to rent.
Problem 3 (easier): Assuming there are 30 bikes per station, find what date thebikes FIRST need to be rebalanced. As in, there are 0 bikes at a terminal for acustomer to rent. Do this ignoring "Start.Date" and "End.Date" columns.
====================
So let’s get started on answering those questions. Here are the first couple of lines of code, standard imports and getting a sense of our columns.
%matplotlib inlineimport pandas as pdimport numpy as npimport matplotlib.pyplot as plt
pd.read\_csv('data/bike\_trip\_data.csv').head()
Trip ID | Duration | Start Date | Start Station | Start Terminal | End Date | End Station | End Terminal | Bike # | Subscription Type | Zip Code | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 4576 | 63 | 8/29/13 14:13 | South Van Ness at Market | 66 | 8/29/13 14:14 | South Van Ness at Market | 66 | 520 | Subscriber | 94127 |
1 | 4607 | 70 | 8/29/13 14:42 | San Jose City Hall | 10 | 8/29/13 14:43 | San Jose City Hall | 10 | 661 | Subscriber | 95138 |
2 | 4130 | 71 | 8/29/13 10:16 | Mountain View City Hall | 27 | 8/29/13 10:17 | Mountain View City Hall | 27 | 48 | Subscriber | 97214 |
3 | 4251 | 77 | 8/29/13 11:29 | San Jose City Hall | 10 | 8/29/13 11:30 | San Jose City Hall | 10 | 26 | Subscriber | 95060 |
4 | 4299 | 83 | 8/29/13 12:02 | South Van Ness at Market | 66 | 8/29/13 12:04 | Market at 10th | 67 | 319 | Subscriber | 94103 |
Now that we know what columns are dates we can perform our import. Pretty straightforward stuff - we use parse_dates
in order to turn our dates in date times. Then we can get our data types to make sure that everything is correct.
df = pd.read\_csv('data/bike\_trip\_data.csv',parse\_dates=['Start Date', 'End Date'])
print df.dtypes
Trip ID int64Duration int64Start Date datetime64[ns]Start Station objectStart Terminal int64End Date datetime64[ns]End Station objectEnd Terminal int64Bike # int64Subscription Type objectZip Code objectdtype: object
Problem 1: What was the average total time (in minutes) used by a bicycle in the data?
Well we’ve got to get the sum of all the times used by bicycles (each one individually), then get the mean of that and divide it by 60 to get minutes.
df.groupby('Bike #').Duration.sum().mean()/60
4288.08717948718
So on average they are used about 70 hours. This was a strange question though and I’m not sure what we’are trying to get out of it.
Problem 2: What was the most popular day by trip frequency in this dataset?
This question was a bit more interesting. While we could just try and get the value counts of the start dates, those include times as well. So we’ve got to map them to lambda values. Using np.vectorize
would make this more efficient however the execution time was not slow enough to warrant further optimization.
df['Start Date'].map(lambda x: "%d-%d-%d" % (x.year, x.month, x.day)).value\_counts()[:5]
2013-9-25 12642013-10-17 12272013-11-7 11972013-10-15 11732013-11-8 1172dtype: int64
df['End Date'].map(lambda x: "%d-%d-%d" % (x.year, x.month, x.day)).value\_counts()[:5]
2013-9-25 12662013-10-17 12252013-11-7 11962013-11-8 11732013-10-15 1172dtype: int64
So we see that 9-25 was the most popular day followed by the 17th of October then a couple of days in December.
Problem 3 (harder): Assuming there are 30 bikes per station, find what date and time the bikes FIRST need to be rebalanced. As in, there are 0 bikes at a terminal for a customer to rent.
I found this question to be a lot of fun because you really have to think about what you’re trying to do. Basically we’re trying to understand how bikes are flowing out of stations and how they’re flowing into stations. Additionally different stations are made available at different times so you’ve got to find a scalable way to separate the stations and get the flows for each.
start\_stations = df.groupby('Start Station')end\_stations = df.groupby('End Station')
First we perform a quick groupby
to line up the stations and make it so that we can find the flows for each station. Trips leaving and trips arriving. Once we have in and out flows for each station, we can merge them and get the answer - which station needs to be balanced first.
def get\_flow\_data(df\_start, df\_end): df\_start.sort(inplace=True) df\_end.sort(inplace=True) start\_counts = pd.Series(np.arange(30, 30 + (-1 \* len(df\_start.index)),-1), index=df\_start.index,name='start', dtype=np.int64) end\_counts = pd.Series(np.arange(len(df\_end.index))+1, index=df\_end.index, name='end', dtype=np.int64) merged = start\_counts.to\_frame().join(end\_counts.to\_frame(), how='outer') merged['start'] = merged.start.fillna(method='ffill').fillna(method='bfill').astype(np.int64) merged['end'] = merged.end.fillna(method='ffill').fillna(value=0).astype(np.int64) merged['combo'] = merged.start + merged.end return merged
Let’s walk through this line by line.
df\_start.sort(inplace=True)df\_end.sort(inplace=True)
First we need to sort the dates to make sure that our index (of dates) is lined up correctly.
start\_counts = pd.Series(np.arange(29, 29 + (-1 \* len(df\_start.index)),-1), index=df\_start.index,name='start', dtype=np.int64)end\_counts = pd.Series(np.arange(len(df\_end.index))+1, index=df\_end.index, name='end', dtype=np.int64)
Then I created a new pandas Series
for start and end dates.
The start_counts series starts at 29 and goes down by one until it reaches the length of the entire index. This is basically how many bike trips originate at that station.
The end_counts series goes the entire length of the count of bike trips ending at that station.
We then merge them into one data frame.
merged = start\_counts.to\_frame().join(end\_counts.to\_frame(), how='outer')
We then fill in some values by using interpolation to back fill and forward fill the bike counts so that basically we get the flow of bikes coming into the station and those leaving the station.
merged['start'] = merged.start.fillna(method='ffill').fillna(method='bfill').astype(np.int64)merged['end'] = merged.end.fillna(method='ffill').fillna(value=0).astype(np.int64)merged['combo'] = merged.start + merged.end
Finally we merge those together and we have the number of bikes at the station. When that crosses the zero mark is when the bike station needs to be rebalanced.
data = []for name, group in start\_stations: print name m = get\_flow\_data(group.set\_index('Start Date'), end\_stations.get\_group(name).set\_index('End Date')) if m.combo[m.combo <= 0].any(): d = m.combo[m.combo <= 0].index[0] data.append((d, name))
However some stations did not have to be rebalanced so we had to test whether or not it did.
print sorted(data)[:5]
Here are the first 5 stations that have to be rebalanced as well as at which point they have to be.
[(Timestamp('2013-09-02 15:03:00'), 'Embarcadero at Bryant'), (Timestamp('2013-09-04 07:56:00'), 'Market at Sansome'), (Timestamp('2013-09-05 07:56:00'), '2nd at Folsom'), (Timestamp('2013-09-05 18:56:00'), 'Temporary Transbay Terminal (Howard at Beale)'), (Timestamp('2013-09-10 12:36:00'), 'San Francisco City Hall')]
Here’s the final code for anyone that is interested. I hope that I get to continue tackling these problems - they’re enough to sink your teeth into but certainly aren’t too difficult to not be able to do in a short amount of time.