Predicting Bike Rentals in Washington D.C.

image.png

Introduction

On-demand bike rentals have become a major trend in cities across the U.S. where within a mile or two of wherever a resident might be within the city, and especially city centers, they likely will find a self-service bike rental kiosk. There are considerations a company or a city might take to ensure that bicycle demand is being met where and when they are needed most. In this project we will attempt to predict this demand using linear regression and descision trees.

Washington D.C. Bike Rental Dataset Overview

Our dataset will focus on Washington D.C., which collects detailed data on bike rentals, from 2011 to 2012. The dataset was compiled in csv format by Hadi Fanee-T, and can be downloaded from the University of California, Irvine's website.

In [1]:
import pandas as pd
import datetime as dt
import matplotlib.pyplot as plt
import seaborn as sns
In [2]:
# read in the data
rentals = pd.read_csv("bike_rental_hour.csv")
In [3]:
rentals.head(2)
Out[3]:
instant dteday season yr mnth hr holiday weekday workingday weathersit temp atemp hum windspeed casual registered cnt
0 1 2011-01-01 1 0 1 0 0 6 0 1 0.24 0.2879 0.81 0.0 3 13 16
1 2 2011-01-01 1 0 1 1 0 6 0 1 0.22 0.2727 0.80 0.0 8 32 40
In [4]:
rentals.tail(2)
Out[4]:
instant dteday season yr mnth hr holiday weekday workingday weathersit temp atemp hum windspeed casual registered cnt
17377 17378 2012-12-31 1 1 12 22 0 1 1 1 0.26 0.2727 0.56 0.1343 13 48 61
17378 17379 2012-12-31 1 1 12 23 0 1 1 1 0.26 0.2727 0.65 0.1343 12 37 49
In [5]:
rentals.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17379 entries, 0 to 17378
Data columns (total 17 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   instant     17379 non-null  int64  
 1   dteday      17379 non-null  object 
 2   season      17379 non-null  int64  
 3   yr          17379 non-null  int64  
 4   mnth        17379 non-null  int64  
 5   hr          17379 non-null  int64  
 6   holiday     17379 non-null  int64  
 7   weekday     17379 non-null  int64  
 8   workingday  17379 non-null  int64  
 9   weathersit  17379 non-null  int64  
 10  temp        17379 non-null  float64
 11  atemp       17379 non-null  float64
 12  hum         17379 non-null  float64
 13  windspeed   17379 non-null  float64
 14  casual      17379 non-null  int64  
 15  registered  17379 non-null  int64  
 16  cnt         17379 non-null  int64  
dtypes: float64(4), int64(12), object(1)
memory usage: 2.3+ MB

Our data consists of 17 columns and 17,379 rows of data. There are 16 numeric columns and 1 string-object column. There are no apparent missing values. Each row in the dataset is an one-hour increment snapshot and takes place over two years total, starting 1/1/2011 and ending 12/31/2012.

The columns follow the following descriptions:

  • instant - A unique sequential ID number for each row
  • dteday - The date of the rentals
  • season - The season in which the rentals occurred
  • yr - The year the rentals occurred
  • mnth - The month the rentals occurred
  • hr - The hour the rentals occurred
  • holiday - Whether or not the day was a holiday
  • weekday - The day of the week (as a number, 0 to 7)
  • workingday - Whether or not the day was a working day
  • weathersit - The weather (as a categorical variable)
    1: clear or few clouds
    2: mist or cloudy
    3: light rain, light snow, thunderstorm
    4: heavy rain, snow, ice pellets, fog
  • temp - The temperature, on a 0-1 scale
  • atemp - The adjusted temperature
  • hum - The humidity, on a 0-1 scale
  • windspeed - The wind speed, on a 0-1 scale
  • casual - The number of casual riders (people who hadn't previously signed up with the bike sharing program)
  • registered - The number of registered riders (people who had already signed up)
  • cnt - The total number of bike rentals (casual + registered)

The primary target for our predictive models will be the cnt column, which is the total number of bike rentals at a given hour. casual and registered columns will also be considered.

Transforming Time Metrics

Let's do some clean-up on dates so that we can more easily splice and run time-series analysis on our data. The the related columns that we will evaluate are dteday, yr, mnth, and hr. Rather than changing each of these to datetime, we can create a single datetime object with dteday and hr columns so that we can easily access the underlying time data with datetime logic. Let's add the hr and dteday columns now.

In [6]:
# convert the hr column to string, combine it with dteday in a format datetime understands
rentals['hr'] = rentals['hr'].astype(str)
rentals['datetime'] = rentals['dteday'] + ' ' + rentals['hr']+':00:00'
rentals['hr'] = rentals['hr'].astype(int) # return to int

# transform the new column to datetime
rentals['datetime'] = pd.to_datetime(rentals['datetime'])
In [7]:
rentals['datetime']
Out[7]:
0       2011-01-01 00:00:00
1       2011-01-01 01:00:00
2       2011-01-01 02:00:00
3       2011-01-01 03:00:00
4       2011-01-01 04:00:00
                ...        
17374   2012-12-31 19:00:00
17375   2012-12-31 20:00:00
17376   2012-12-31 21:00:00
17377   2012-12-31 22:00:00
17378   2012-12-31 23:00:00
Name: datetime, Length: 17379, dtype: datetime64[ns]

Let's check the completeness of our dataset. There should be two years of hour by hour data.

In [8]:
two_yr_hr = 2*365*24
length_rentals = len(rentals)
print('dataset:', length_rentals)
print('expected:', two_yr_hr)
print('difference:', two_yr_hr - length_rentals)
print('missing days:', (two_yr_hr-length_rentals)/24)
dataset: 17379
expected: 17520
difference: 141
missing days: 5.875

We're missing just under six days of data. We will accept this discrepancy as it is not pervasive enough to have a strong adverse affect on predictions.

Exploration

Let's complete some basic analysis to understand our timeseries, we're specifically interested in our primary target cnt.

In [9]:
plt.figure(figsize = (12, 4))
plt.plot(rentals['datetime'], rentals['cnt'])
plt.title('Count of Rentals over Time')
plt.show()

A first impression of the distribution is that there are more rentals on average in 2012 than 2011. Some research shows that washington expanded its bikeshare program at the end of 2011, which logically makes sense as if there's more locations and bikes in service there are more potential users for the program. This may cause an undersirable influence in our predictions, however, we may be able to overcome by min-max scaling the count values in their respective years.

Further, there are indications that seasonality plays a role, but at the hour-day-year granularity it is difficult to see what exactly is going on. Let's generate some histograms bucketed by day of week and month.

In [10]:
plt.bar(x = rentals['mnth'], height = rentals['cnt'])
plt.title('Bikeshare Usage by Month')
plt.show()
In [11]:
plt.bar(x = rentals['weekday'], height = rentals['cnt'])
plt.title('Bikeshare Usage by Day of Week')
plt.show()

Our distributions indicate that there is a seasonal affect on bikeshare usage. Spring, Summer and Fall months have a relatively stable usage, but the winter months, November through February, have a drastic dropoff.

The day of week also has an impact on usage, weekdays have a high and stable usage and weekends, Saturday and Sunday, have less use.

Let's recreate the above bar charts but further split by registered and casual. This will give us indicators whether casual and registered users are more sensitive to seasonality.

In [24]:
fig, ax = plt.subplots()
ax.bar(rentals['mnth'], rentals['cnt'], label = 'total')
ax.bar(rentals['mnth'], rentals['registered'], label = 'registered')
ax.bar(rentals['mnth'], rentals['casual'], label = 'casual')
plt.ylim(0,1000)
plt.title('Bikeshare Usage by Month and Usergroup')
plt.legend(loc = 'best')
plt.show()
In [12]:
plt.bar(x = rentals['mnth'], height = rentals['cnt'], label = 'total')
plt.bar(x = rentals['mnth'], height = rentals['registered'], label = 'registed')
plt.bar(x = rentals['mnth'], height = rentals['casual'],label = 'casual')
plt.title('Bikeshare Usage by Month and Usergroup')
plt.legend(loc = 'best')
plt.show()
C:\Users\Kevin\anaconda3\lib\site-packages\IPython\core\pylabtools.py:132: UserWarning: Creating legend with loc="best" can be slow with large amounts of data.
  fig.canvas.print_figure(bytes_io, **kw)
In [13]:
plt.bar(x = rentals['weekday'], height = rentals['cnt'], label = 'total')
plt.bar(x = rentals['weekday'], height = rentals['registered'], label = 'registed')
plt.bar(x = rentals['weekday'], height = rentals['casual'],label = 'casual')
plt.title('Bikeshare Usage by Day of Week and Usergroup')
plt.legend()
plt.show()
In [14]:
plt.bar(x = rentals['hr'], height = rentals['cnt'], label = 'total')
plt.bar(x = rentals['hr'], height = rentals['registered'], label = 'registed')
plt.bar(x = rentals['hr'], height = rentals['casual'],label = 'casual')
plt.title('Bikeshare Usage by Time of Day and Usergroup')
plt.legend()
plt.show()

The behavior of bike rental user groups is different when cut along time and seasonality. Because of this we may be able to get more accurate results if we predict the usage of these cohorts rather than just aiming at the total usage.

The cnt is supposed to be the sum of registered and casual users, but evaluating the bar charts there may be underlying issues with the cnt column, in that registered plus casual is less than what it is supposed to be. Let's validate the cnt column is the sum of these two values.

Validating cnt

Our validation process will be to add the registered and casual columns into a new column and subtract the new value from cnt. If we get any value other than zero we know there is a discrepancy.

In [15]:
rentals['cnt_validation'] = (rentals['registered'] + rentals['casual']) - rentals['cnt']
rentals['cnt_validation'].value_counts()
Out[15]:
0    17379
Name: cnt_validation, dtype: int64
In [16]:
rentals
Out[16]:
instant dteday season yr mnth hr holiday weekday workingday weathersit temp atemp hum windspeed casual registered cnt datetime cnt_validation
0 1 2011-01-01 1 0 1 0 0 6 0 1 0.24 0.2879 0.81 0.0000 3 13 16 2011-01-01 00:00:00 0
1 2 2011-01-01 1 0 1 1 0 6 0 1 0.22 0.2727 0.80 0.0000 8 32 40 2011-01-01 01:00:00 0
2 3 2011-01-01 1 0 1 2 0 6 0 1 0.22 0.2727 0.80 0.0000 5 27 32 2011-01-01 02:00:00 0
3 4 2011-01-01 1 0 1 3 0 6 0 1 0.24 0.2879 0.75 0.0000 3 10 13 2011-01-01 03:00:00 0
4 5 2011-01-01 1 0 1 4 0 6 0 1 0.24 0.2879 0.75 0.0000 0 1 1 2011-01-01 04:00:00 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
17374 17375 2012-12-31 1 1 12 19 0 1 1 2 0.26 0.2576 0.60 0.1642 11 108 119 2012-12-31 19:00:00 0
17375 17376 2012-12-31 1 1 12 20 0 1 1 2 0.26 0.2576 0.60 0.1642 8 81 89 2012-12-31 20:00:00 0
17376 17377 2012-12-31 1 1 12 21 0 1 1 1 0.26 0.2576 0.60 0.1642 7 83 90 2012-12-31 21:00:00 0
17377 17378 2012-12-31 1 1 12 22 0 1 1 1 0.26 0.2727 0.56 0.1343 13 48 61 2012-12-31 22:00:00 0
17378 17379 2012-12-31 1 1 12 23 0 1 1 1 0.26 0.2727 0.65 0.1343 12 37 49 2012-12-31 23:00:00 0

17379 rows × 19 columns

We have confirmed that the barchart discrepancy is an illusion.

Feature Selection

We will move forward with the plan to generate separate predictions for cnt, registered and casual. We will need to repeat the feature selection process for each category.

Feature Creation

sunrise/sunset (is the sun out)

Adding Sunrise/Sunset Data

Sunrise and sunset occur at different hours of the day depending on the season. This causes causes inconsistencies when considering

aa.usno.navy.mil/data/docs/RS_OneYear.php

In [ ]:
washington_dc_latlong = [38.895,-77.0366]

Target Correlations

One basic way to identify which columns will be good feature candidates for our machine learning algorithms is to look at correlations with the target metric. Too many features can result in an overfit model, so fewer, but well correlated features are usually desirable.

In [ ]:
cnt_feature_correlations = abs(rentals.corr()['cnt'].drop(['cnt', 'instant', 'registered','casual'])).sort_values(ascending = False)
cnt_feature_correlations
In [ ]:
cas_feature_correlations = abs(rentals.corr()['casual'].drop(['cnt', 'instant', 'registered','casual'])).sort_values(ascending = False)
cas_feature_correlations
In [ ]:
reg_feature_correlations = abs(rentals.corr()['registered'].drop(['cnt', 'instant', 'registered','casual'])).sort_values(ascending = False)
reg_feature_correlations

We removed columns from consideration if they aren't viable features, instant is a code similar to an index and cnt, registered and casual are predictive targets.

Our hypothesis is somewhat confirmed that the behavior of the registered and casual sub-groups are significant. As one might guess, registered users are less affected by enjoyment driven factors such and weather and are most correlated with hr. Thinking about what it means to be a registered user, we might think that these people are habitual users and may rely on this form of transportation for things like commuting to work. Inverserly a casual user might shy away from riding a bicycle for enjoyment if the weather is too hot or too cold to be enjoyable and may be habitually using other forms of transportation during the weekday to get to work.

There are no hard rules for determining the cut-off for correlations, but a rule of thumb is a correlation should have a minimum of 0.3 in order to be considered. Thinking about the values above for a moment, it makes sense that temperature and humidity are more important than the month or season, just because it's a Winter month doesn't mean a person won't take advantage of a nice day. Therefore we can

Let's create a new dataframe that contains values above 0.3.

Colinearity

Colinearities occur when feature candidates are linearly correlated, essentially creating a redundancy in the features that can hurt the accuracy of the model.

In [ ]: