Data Challenge - Rebalancing Bike Terminals in SF

08 January 2015

Leada has recently set out to email out new datasets every week with a couple of interesting questions. I thought that this week's challenge posed some interesting questions that provide great examples of ways to use Python's pandas library.

Here's their email.

You just made your 2015 New Year's resolution to improve your data analysis
skills and become data literate!

Every two weeks, we will e-mail you a dataset and some problems to work on. You
can use any tool you'd like, although we suggest using R or Python. R is easy to
install and if you are new to programming, the Introduction to R lesson is free
in our courses!

Here's the first dataset and problem! Answers will be released at a later date.
Tweet your solution @LeadaHQ!

====================

Dataset Description: A CSV of bicycle rental transactions for a rental company
in California.

Dataset Download

Problem 1: What was the average total time (in minutes) used by a bicycle in the
data?

Problem 2: What was the most popular day by trip frequency in this dataset?

Problem 3 (harder): Assuming there are 30 bikes per station, find what date and
time the bikes FIRST need to be rebalanced. As in, there are 0 bikes at a
terminal for a customer to rent.

Problem 3 (easier): Assuming there are 30 bikes per station, find what date the
bikes FIRST need to be rebalanced. As in, there are 0 bikes at a terminal for a
customer to rent. Do this ignoring "Start.Date" and "End.Date" columns.

====================

So let's get started on answering those questions. Here are the first couple of lines of code, standard imports and getting a sense of our columns.

%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

pd.read_csv('data/bike_trip_data.csv').head()
Trip ID Duration Start Date Start Station Start Terminal End Date End Station End Terminal Bike # Subscription Type Zip Code
0 4576 63 8/29/13 14:13 South Van Ness at Market 66 8/29/13 14:14 South Van Ness at Market 66 520 Subscriber 94127
1 4607 70 8/29/13 14:42 San Jose City Hall 10 8/29/13 14:43 San Jose City Hall 10 661 Subscriber 95138
2 4130 71 8/29/13 10:16 Mountain View City Hall 27 8/29/13 10:17 Mountain View City Hall 27 48 Subscriber 97214
3 4251 77 8/29/13 11:29 San Jose City Hall 10 8/29/13 11:30 San Jose City Hall 10 26 Subscriber 95060
4 4299 83 8/29/13 12:02 South Van Ness at Market 66 8/29/13 12:04 Market at 10th 67 319 Subscriber 94103

Now that we know what columns are dates we can perform our import. Pretty straightforward stuff - we use parse_dates in order to turn our dates in date times. Then we can get our data types to make sure that everything is correct.

df = pd.read_csv('data/bike_trip_data.csv',parse_dates=['Start Date', 'End Date'])

print df.dtypes
Trip ID                       int64
Duration                      int64
Start Date           datetime64[ns]
Start Station                object
Start Terminal                int64
End Date             datetime64[ns]
End Station                  object
End Terminal                  int64
Bike #                        int64
Subscription Type            object
Zip Code                     object
dtype: object

Problem 1: What was the average total time (in minutes) used by a bicycle in the data?

Well we've got to get the sum of all the times used by bicycles (each one individually), then get the mean of that and divide it by 60 to get minutes.

df.groupby('Bike #').Duration.sum().mean()/60
4288.08717948718

So on average they are used about 70 hours. This was a strange question though and I'm not sure what we'are trying to get out of it.

Problem 2: What was the most popular day by trip frequency in this dataset?

This question was a bit more interesting. While we could just try and get the value counts of the start dates, those include times as well. So we've got to map them to lambda values. Using np.vectorize would make this more efficient however the execution time was not slow enough to warrant further optimization.

df['Start Date'].map(lambda x: "%d-%d-%d" % (x.year, x.month, x.day)).value_counts()[:5]
2013-9-25     1264
2013-10-17    1227
2013-11-7     1197
2013-10-15    1173
2013-11-8     1172
dtype: int64
df['End Date'].map(lambda x: "%d-%d-%d" % (x.year, x.month, x.day)).value_counts()[:5]
2013-9-25     1266
2013-10-17    1225
2013-11-7     1196
2013-11-8     1173
2013-10-15    1172
dtype: int64

So we see that 9-25 was the most popular day followed by the 17th of October then a couple of days in December.

Problem 3 (harder): Assuming there are 30 bikes per station, find what date and time the bikes FIRST need to be rebalanced. As in, there are 0 bikes at a terminal for a customer to rent.

I found this question to be a lot of fun because you really have to think about what you're trying to do. Basically we're trying to understand how bikes are flowing out of stations and how they're flowing into stations. Additionally different stations are made available at different times so you've got to find a scalable way to separate the stations and get the flows for each.

start_stations = df.groupby('Start Station')
end_stations = df.groupby('End Station')

First we perform a quick groupby to line up the stations and make it so that we can find the flows for each station. Trips leaving and trips arriving. Once we have in and out flows for each station, we can merge them and get the answer - which station needs to be balanced first.

def get_flow_data(df_start, df_end):
    df_start.sort(inplace=True)
    df_end.sort(inplace=True)
    start_counts = pd.Series(np.arange(30, 30 + (-1 * len(df_start.index)),-1), index=df_start.index,name='start', dtype=np.int64)
    end_counts = pd.Series(np.arange(len(df_end.index))+1, index=df_end.index, name='end', dtype=np.int64)
    merged = start_counts.to_frame().join(end_counts.to_frame(), how='outer')
    merged['start'] = merged.start.fillna(method='ffill').fillna(method='bfill').astype(np.int64)
    merged['end'] = merged.end.fillna(method='ffill').fillna(value=0).astype(np.int64)
    merged['combo'] = merged.start + merged.end
    return merged

Let's walk through this line by line.

df_start.sort(inplace=True)
df_end.sort(inplace=True)

First we need to sort the dates to make sure that our index (of dates) is lined up correctly.

start_counts = pd.Series(np.arange(29, 29 + (-1 * len(df_start.index)),-1), index=df_start.index,name='start', dtype=np.int64)
end_counts = pd.Series(np.arange(len(df_end.index))+1, index=df_end.index, name='end', dtype=np.int64)

Then I created a new pandas Series for start and end dates.

The start_counts series starts at 29 and goes down by one until it reaches the length of the entire index. This is basically how many bike trips originate at that station.

The end_counts series goes the entire length of the count of bike trips ending at that station.

We then merge them into one data frame.

merged = start_counts.to_frame().join(end_counts.to_frame(), how='outer')

We then fill in some values by using interpolation to back fill and forward fill the bike counts so that basically we get the flow of bikes coming into the station and those leaving the station.

merged['start'] = merged.start.fillna(method='ffill').fillna(method='bfill').astype(np.int64)
merged['end'] = merged.end.fillna(method='ffill').fillna(value=0).astype(np.int64)
merged['combo'] = merged.start + merged.end

Finally we merge those together and we have the number of bikes at the station. When that crosses the zero mark is when the bike station needs to be rebalanced.

data = []
for name, group in start_stations:
    print name
    m = get_flow_data(group.set_index('Start Date'), end_stations.get_group(name).set_index('End Date'))
    if m.combo[m.combo <= 0].any():
        d = m.combo[m.combo <= 0].index[0]
        data.append((d, name))

However some stations did not have to be rebalanced so we had to test whether or not it did.

print sorted(data)[:5]

Here are the first 5 stations that have to be rebalanced as well as at which point they have to be.

[(Timestamp('2013-09-02 15:03:00'), 'Embarcadero at Bryant'),
 (Timestamp('2013-09-04 07:56:00'), 'Market at Sansome'),
 (Timestamp('2013-09-05 07:56:00'), '2nd at Folsom'),
 (Timestamp('2013-09-05 18:56:00'),
  'Temporary Transbay Terminal (Howard at Beale)'),
 (Timestamp('2013-09-10 12:36:00'), 'San Francisco City Hall')]

Here's the final code for anyone that is interested. I hope that I get to continue tackling these problems - they're enough to sink your teeth into but certainly aren't too difficult to not be able to do in a short amount of time.