Leada has recently set out to email out new datasets every week with a couple of interesting questions. I thought that this week's challenge posed some interesting questions that provide great examples of ways to use Python's pandas library.
Here's their email.
You just made your 2015 New Year's resolution to improve your data analysis
skills and become data literate!
Every two weeks, we will e-mail you a dataset and some problems to work on. You
can use any tool you'd like, although we suggest using R or Python. R is easy to
install and if you are new to programming, the Introduction to R lesson is free
in our courses!
Here's the first dataset and problem! Answers will be released at a later date.
Tweet your solution @LeadaHQ!
====================
Dataset Description: A CSV of bicycle rental transactions for a rental company
in California.
Dataset Download
Problem 1: What was the average total time (in minutes) used by a bicycle in the
data?
Problem 2: What was the most popular day by trip frequency in this dataset?
Problem 3 (harder): Assuming there are 30 bikes per station, find what date and
time the bikes FIRST need to be rebalanced. As in, there are 0 bikes at a
terminal for a customer to rent.
Problem 3 (easier): Assuming there are 30 bikes per station, find what date the
bikes FIRST need to be rebalanced. As in, there are 0 bikes at a terminal for a
customer to rent. Do this ignoring "Start.Date" and "End.Date" columns.
====================
So let's get started on answering those questions. Here are the first couple of lines of code, standard imports and getting a sense of our columns.
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
pd.read\_csv('data/bike\_trip\_data.csv').head()
Trip ID | Duration | Start Date | Start Station | Start Terminal | End Date | End Station | End Terminal | Bike # | Subscription Type | Zip Code | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 4576 | 63 | 8/29/13 14:13 | South Van Ness at Market | 66 | 8/29/13 14:14 | South Van Ness at Market | 66 | 520 | Subscriber | 94127 |
1 | 4607 | 70 | 8/29/13 14:42 | San Jose City Hall | 10 | 8/29/13 14:43 | San Jose City Hall | 10 | 661 | Subscriber | 95138 |
2 | 4130 | 71 | 8/29/13 10:16 | Mountain View City Hall | 27 | 8/29/13 10:17 | Mountain View City Hall | 27 | 48 | Subscriber | 97214 |
3 | 4251 | 77 | 8/29/13 11:29 | San Jose City Hall | 10 | 8/29/13 11:30 | San Jose City Hall | 10 | 26 | Subscriber | 95060 |
4 | 4299 | 83 | 8/29/13 12:02 | South Van Ness at Market | 66 | 8/29/13 12:04 | Market at 10th | 67 | 319 | Subscriber | 94103 |
Now that we know what columns are dates we can perform our import. Pretty straightforward stuff - we use parse_dates
in order to turn our dates in date times. Then we can get our data types to make sure that everything is correct.
df = pd.read\_csv('data/bike\_trip\_data.csv',parse\_dates=['Start Date', 'End Date'])
print df.dtypes
Trip ID int64
Duration int64
Start Date datetime64[ns]
Start Station object
Start Terminal int64
End Date datetime64[ns]
End Station object
End Terminal int64
Bike # int64
Subscription Type object
Zip Code object
dtype: object
Problem 1: What was the average total time (in minutes) used by a bicycle in the data?
Well we've got to get the sum of all the times used by bicycles (each one individually), then get the mean of that and divide it by 60 to get minutes.
df.groupby('Bike #').Duration.sum().mean()/60
4288.08717948718
So on average they are used about 70 hours. This was a strange question though and I'm not sure what we'are trying to get out of it.
Problem 2: What was the most popular day by trip frequency in this dataset?
This question was a bit more interesting. While we could just try and get the value counts of the start dates, those include times as well. So we've got to map them to lambda values. Using np.vectorize
would make this more efficient however the execution time was not slow enough to warrant further optimization.
df['Start Date'].map(lambda x: "%d-%d-%d" % (x.year, x.month, x.day)).value\_counts()[:5]
2013-9-25 1264
2013-10-17 1227
2013-11-7 1197
2013-10-15 1173
2013-11-8 1172
dtype: int64
df['End Date'].map(lambda x: "%d-%d-%d" % (x.year, x.month, x.day)).value\_counts()[:5]
2013-9-25 1266
2013-10-17 1225
2013-11-7 1196
2013-11-8 1173
2013-10-15 1172
dtype: int64
So we see that 9-25 was the most popular day followed by the 17th of October then a couple of days in December.
Problem 3 (harder): Assuming there are 30 bikes per station, find what date and time the bikes FIRST need to be rebalanced. As in, there are 0 bikes at a terminal for a customer to rent.
I found this question to be a lot of fun because you really have to think about what you're trying to do. Basically we're trying to understand how bikes are flowing out of stations and how they're flowing into stations. Additionally different stations are made available at different times so you've got to find a scalable way to separate the stations and get the flows for each.
start\_stations = df.groupby('Start Station')
end\_stations = df.groupby('End Station')
First we perform a quick groupby
to line up the stations and make it so that we can find the flows for each station. Trips leaving and trips arriving. Once we have in and out flows for each station, we can merge them and get the answer - which station needs to be balanced first.
def get\_flow\_data(df\_start, df\_end):
df\_start.sort(inplace=True)
df\_end.sort(inplace=True)
start\_counts = pd.Series(np.arange(30, 30 + (-1 \* len(df\_start.index)),-1), index=df\_start.index,name='start', dtype=np.int64)
end\_counts = pd.Series(np.arange(len(df\_end.index))+1, index=df\_end.index, name='end', dtype=np.int64)
merged = start\_counts.to\_frame().join(end\_counts.to\_frame(), how='outer')
merged['start'] = merged.start.fillna(method='ffill').fillna(method='bfill').astype(np.int64)
merged['end'] = merged.end.fillna(method='ffill').fillna(value=0).astype(np.int64)
merged['combo'] = merged.start + merged.end
return merged
Let's walk through this line by line.
df\_start.sort(inplace=True)
df\_end.sort(inplace=True)
First we need to sort the dates to make sure that our index (of dates) is lined up correctly.
start\_counts = pd.Series(np.arange(29, 29 + (-1 \* len(df\_start.index)),-1), index=df\_start.index,name='start', dtype=np.int64)
end\_counts = pd.Series(np.arange(len(df\_end.index))+1, index=df\_end.index, name='end', dtype=np.int64)
Then I created a new pandas Series
for start and end dates.
The start_counts series starts at 29 and goes down by one until it reaches the length of the entire index. This is basically how many bike trips originate at that station.
The end_counts series goes the entire length of the count of bike trips ending at that station.
We then merge them into one data frame.
merged = start\_counts.to\_frame().join(end\_counts.to\_frame(), how='outer')
We then fill in some values by using interpolation to back fill and forward fill the bike counts so that basically we get the flow of bikes coming into the station and those leaving the station.
merged['start'] = merged.start.fillna(method='ffill').fillna(method='bfill').astype(np.int64)
merged['end'] = merged.end.fillna(method='ffill').fillna(value=0).astype(np.int64)
merged['combo'] = merged.start + merged.end
Finally we merge those together and we have the number of bikes at the station. When that crosses the zero mark is when the bike station needs to be rebalanced.
data = []
for name, group in start\_stations:
print name
m = get\_flow\_data(group.set\_index('Start Date'), end\_stations.get\_group(name).set\_index('End Date'))
if m.combo[m.combo <= 0].any():
d = m.combo[m.combo <= 0].index[0]
data.append((d, name))
However some stations did not have to be rebalanced so we had to test whether or not it did.
print sorted(data)[:5]
Here are the first 5 stations that have to be rebalanced as well as at which point they have to be.
[(Timestamp('2013-09-02 15:03:00'), 'Embarcadero at Bryant'),
(Timestamp('2013-09-04 07:56:00'), 'Market at Sansome'),
(Timestamp('2013-09-05 07:56:00'), '2nd at Folsom'),
(Timestamp('2013-09-05 18:56:00'),
'Temporary Transbay Terminal (Howard at Beale)'),
(Timestamp('2013-09-10 12:36:00'), 'San Francisco City Hall')]
Here's the final code for anyone that is interested. I hope that I get to continue tackling these problems - they're enough to sink your teeth into but certainly aren't too difficult to not be able to do in a short amount of time.