1. Intro¶

In this project we'll build a machine learning model to predict the amount of bicycles rented every day in Washington D.C. in 2011 and 2012. There are a number of factors that influence the number of bike rents every day, very often those factors influence each other. A linear regression model is probably not going to deliver any good results in this scenario.

We'll start with simple linear regression model, just to confirm the hunch that it won't perform well. Then we'll move on to Random Forest Regressor and perform basic GridSearch to tweak that model. After that we'll test other models, including gradient boosting models and neural networks. Having tested quite a few models we'll move on to joining them together:

why rely on single model when we can use 2 or more models to predict the rental values, then calculate the average of predicted value?
we'll also try to build a meta model that learns on the output from other machine learning algorithms

Index:

Intro
Exploratory Data Analysis
Basic modelling
Importing more weather data
Testing different models
Stacking models
Conclusions

Links:

bikes

In [1]:

!pip install wikipedia
import wikipedia
import textwrap as tr

Collecting wikipedia
  Downloading wikipedia-1.4.0.tar.gz (27 kB)
Requirement already satisfied: beautifulsoup4 in /opt/conda/lib/python3.7/site-packages (from wikipedia) (4.10.0)
Requirement already satisfied: requests<3.0.0,>=2.0.0 in /opt/conda/lib/python3.7/site-packages (from wikipedia) (2.25.1)
Requirement already satisfied: certifi>=2017.4.17 in /opt/conda/lib/python3.7/site-packages (from requests<3.0.0,>=2.0.0->wikipedia) (2021.10.8)
Requirement already satisfied: urllib3<1.27,>=1.21.1 in /opt/conda/lib/python3.7/site-packages (from requests<3.0.0,>=2.0.0->wikipedia) (1.26.6)
Requirement already satisfied: chardet<5,>=3.0.2 in /opt/conda/lib/python3.7/site-packages (from requests<3.0.0,>=2.0.0->wikipedia) (4.0.0)
Requirement already satisfied: idna<3,>=2.5 in /opt/conda/lib/python3.7/site-packages (from requests<3.0.0,>=2.0.0->wikipedia) (2.10)
Requirement already satisfied: soupsieve>1.2 in /opt/conda/lib/python3.7/site-packages (from beautifulsoup4->wikipedia) (2.2.1)
Building wheels for collected packages: wikipedia
  Building wheel for wikipedia (setup.py) ... - \ done
  Created wheel for wikipedia: filename=wikipedia-1.4.0-py3-none-any.whl size=11696 sha256=7e3a7d31d8cc7cb01e7180e83315f72211f29b9852cf2e6fe957bf54f2b552cf
  Stored in directory: /root/.cache/pip/wheels/15/93/6d/5b2c68b8a64c7a7a04947b4ed6d89fb557dcc6bc27d1d7f3ba
Successfully built wikipedia
Installing collected packages: wikipedia
Successfully installed wikipedia-1.4.0
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv

In [2]:

my_str = wikipedia.summary("Bicycle-sharing system")
for line in tr.wrap(my_str, width=110):
    print(line)
print('source: Wikipedia')

A bicycle-sharing system, bike share program, public bicycle scheme, or public bike share (PBS) scheme, is a
shared transport service in which bicycles are made available for shared use to individuals on a short term
basis for a price or free. Many bike share systems allow people to borrow a bike from a "dock" and return it
at another dock belonging to the same system. Docks are special bike racks that lock the bike, and only
release it by computer control. The user enters payment information, and the computer unlocks a bike. The user
returns the bike by placing it in the dock, which locks it in place. Other systems are dockless. In recent
years, an increasing number of cities across the world have started to offer both mechanical bike share and
electric bicycle sharing systems, such as Dubai, New York, Paris, Montreal and Barcelona.For many systems,
smartphone mapping apps show nearby available bikes and open docks. In July 2020, Google Maps began including
bike shares in its route recommendations.
source: Wikipedia

1.2 Importing packages¶

In [3]:

import pandas as pd
import numpy as np
import random
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OrdinalEncoder
# from sklearn.impute import SimpleImputer
# from sklearn.compose import ColumnTransformer
from sklearn import linear_model
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import r2_score
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import MinMaxScaler
import missingno as msno

2. Exploratory Data Analysis¶

back to top

2.1. Initial Analysis¶

In [4]:

df = pd.read_csv('../input/bike-sharing-dataset/hour.csv')
df.head()

Out[4]:

	instant	dteday	season	mnth	hr	weekday	weathersit	temp	atemp	hum	casual	registered	cnt
0	1	2011-01-01	1	1	0	6	1	0.24	0.2879	0.81	3	13	16
1	2	2011-01-01	1	1	1	6	1	0.22	0.2727	0.80	8	32	40
2	3	2011-01-01	1	1	2	6	1	0.22	0.2727	0.80	5	27	32
3	4	2011-01-01	1	1	3	6	1	0.24	0.2879	0.75	3	10	13
4	5	2011-01-01	1	1	4	6	1	0.24	0.2879	0.75	0	1	1

In [5]:

df.isnull().sum().sum()

Out[5]:

In [6]:

df.describe().T

Out[6]:

	count	mean	std	min	25%	50%	75%	max
instant	17379.0	8690.000000	5017.029500	1.00	4345.5000	8690.0000	13034.5000	17379.0000
season	17379.0	2.501640	1.106918	1.00	2.0000	3.0000	3.0000	4.0000
yr	17379.0	0.502561	0.500008	0.00	0.0000	1.0000	1.0000	1.0000
mnth	17379.0	6.537775	3.438776	1.00	4.0000	7.0000	10.0000	12.0000
hr	17379.0	11.546752	6.914405	0.00	6.0000	12.0000	18.0000	23.0000
holiday	17379.0	0.028770	0.167165	0.00	0.0000	0.0000	0.0000	1.0000
weekday	17379.0	3.003683	2.005771	0.00	1.0000	3.0000	5.0000	6.0000
workingday	17379.0	0.682721	0.465431	0.00	0.0000	1.0000	1.0000	1.0000
weathersit	17379.0	1.425283	0.639357	1.00	1.0000	1.0000	2.0000	4.0000
temp	17379.0	0.496987	0.192556	0.02	0.3400	0.5000	0.6600	1.0000
atemp	17379.0	0.475775	0.171850	0.00	0.3333	0.4848	0.6212	1.0000
hum	17379.0	0.627229	0.192930	0.00	0.4800	0.6300	0.7800	1.0000
windspeed	17379.0	0.190098	0.122340	0.00	0.1045	0.1940	0.2537	0.8507
casual	17379.0	35.676218	49.305030	0.00	4.0000	17.0000	48.0000	367.0000
registered	17379.0	153.786869	151.357286	0.00	34.0000	115.0000	220.0000	886.0000
cnt	17379.0	189.463088	181.387599	1.00	40.0000	142.0000	281.0000	977.0000

Windspeed is not normalized¶

In [7]:

# lets generate a correlation heatmap:
mask = np.triu(np.ones_like(df.corr(), dtype=bool))
fig, ax = plt.subplots(figsize=(16,16))
sns.heatmap(abs(df.corr()), square=True, cmap='BrBG', mask=mask)
plt.show()

In [8]:

# we'll use the usual spines function to reduce plots code:
def spines(ax,yl='Rental counts',xl='',title=''):
    x1 = ax.spines['right'].set_visible(False)
    x2 = ax.spines['top'].set_visible(False)
    x3 = ax.spines['left'].set_linewidth(2)
    x4 = ax.spines['bottom'].set_linewidth(2)
    x5 = ax.set_ylabel(yl)
    x6 = ax.set_xlabel(xl)
    x7 = ax.set_title(title)
    return x1, x2, x3, x4, x5, x6

fig, ax = plt.subplots(figsize=(16,16))
ax = plt.subplot(221)
plt.pie([df['registered'].sum(), df['casual'].sum()], labels=['registered','casual'])

ax = plt.subplot(222)
df.groupby('hr')['cnt'].sum().plot.bar()
spines(ax, title='Combined')

ax = plt.subplot(223)
df.groupby('hr')['registered'].sum().plot.bar()
spines(ax, title='Registered')

ax = plt.subplot(224)
df.groupby('hr')['casual'].sum().plot.bar()
spines(ax, title='casual')

plt.show()

Observations:

registered users are the vast majority of rental users
unfortunately their rental count is distributed throughout the day in a much more complex way than casual users rentals

2.2. Time columns:¶

In [9]:

fig, ax = plt.subplots(figsize=(16,8))
sns.stripplot(x = df.groupby('dteday')['cnt'].sum().index, y = df.groupby('dteday')['cnt'].sum(), alpha=0.3, color='black', size=6, jitter=0.4, zorder=2)
spines(ax, title='Counts per date', xl='Date')
X_TICKS = 50 
plt.xticks(range(0, len(df['dteday'].sort_values().unique()), X_TICKS), df['dteday'].sort_values().unique()[::X_TICKS], rotation = 30)
plt.show()

In [10]:

fig, ax = plt.subplots(figsize=(16,8))
sns.stripplot(x = df['hr'], y = df['cnt'], alpha=0.3, color='black', size=6, jitter=0.4, zorder=2)
plt.plot(df['hr'].unique(),df.groupby('hr')['cnt'].mean(), color='red', linewidth=6, zorder=3, alpha=0.6)
plt.legend(['avg','counts'])
spines(ax, xl='Hour', title='Rental counts per hour')
plt.show()

In [11]:

fig, ax = plt.subplots(figsize=(16,8))
sns.stripplot(x = df['weekday'], y = df['cnt'], alpha=0.3, color='black', size=6, jitter=0.4, zorder=2)
plt.plot(df['weekday'].sort_values().unique(),df.groupby('weekday')['cnt'].mean(), color='red', linewidth=6, zorder=3, alpha=0.6)
plt.legend(['avg','counts'])
spines(ax, xl='day',  title='Rental counts per day')
plt.show()

In [12]:

fig, ax = plt.subplots(figsize=(16,8))
sns.stripplot(x = df['mnth'], y = df['cnt'], alpha=0.3, color='black', size=6, jitter=0.4, zorder=2)
plt.plot(df['mnth'].unique()-1,df.groupby('mnth')['cnt'].mean(), color='red', linewidth=6, zorder=3, alpha=0.6)
# handles, labels = ax.get_legend_handles_labels()
plt.legend(['avg','counts'])
spines(ax, xl='mnth',  title='Rental counts per month')
plt.show()

We can observe an increase in rental counts in 2012
Hour column has the highest impact on the amount of bikes rented. We can clearly see the morning and afternoon peaks.
The month value has a much smaller influence - but we can see an obvious pattern - colder months don't register as many rentals.
Small influence of the day value.

2.3. Weather columns:¶

In [13]:

fig, ax = plt.subplots(figsize=(16,8))
sns.stripplot(x=df['hum'],y=df['cnt'], alpha=0.3, color='black', jitter=0.4)
plt.legend(['counts'])
spines(ax, xl='hum')
X_TICKS = 5 
plt.xticks(range(0, len(df['hum'].sort_values().unique()), X_TICKS), df['hum'].sort_values().unique()[::X_TICKS], rotation = 0)
plt.show()

In [14]:

fig, ax = plt.subplots(figsize=(16,8))
sns.stripplot(x=df['temp'],y=df['cnt'], alpha=0.3, color='black', jitter=0.4)
plt.legend(['counts'])
spines(ax, xl='temp')
X_TICKS = 5 
plt.xticks(range(0, len(df['temp'].sort_values().unique()), X_TICKS), df['temp'].sort_values().unique()[::X_TICKS], rotation = 0)
plt.show()

In [15]:

fig, ax = plt.subplots(figsize=(16,8))
sns.stripplot(x=df['windspeed'],y=df['cnt'], alpha=0.3, color='black', jitter=0.4)
plt.legend(['counts'])
spines(ax, xl='windspeed')
X_TICKS = 2 
plt.xticks(range(0, len(df['windspeed'].sort_values().unique()), X_TICKS), df['windspeed'].sort_values().unique()[::X_TICKS], rotation = 0)
plt.show()

We can notice an obvious correlation of temperature and windspeed with rental counts. This relationship doesn't look as obvious with humidity.

3. Modelling¶

back to top

We'll start with a basic linear regression (no good results are expected from that model), then we'll move on to Random Forest Regressor.

In [16]:

categorical_cols = ['season', 'mnth', 'holiday', 'weekday', 'workingday', 'weathersit', 'yr', 'hr']
numeric_cols = ['temp', 'hum', 'windspeed']

def select_split(df):
    X = df[categorical_cols+numeric_cols]
    y = df['cnt']
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)
    return X_train, X_test, y_train, y_test

def fit_model(model):
    model.fit(X_train, y_train)
    predictions_train = model.predict(X_train)
    predictions_test = model.predict(X_test)
    return r2_score(y_train, predictions_train),r2_score(y_test, predictions_test), predictions_test

lr = linear_model.LinearRegression()

X_train, X_test, y_train, y_test = select_split(df)
r2_score_train,r2_score_test, predictions1 = fit_model(lr)

print('train set r2 score: ',r2_score_train,'       test set r2 score: ', r2_score_test)

train set r2 score:  0.3849208346739741        test set r2 score:  0.39852904131877387

In [17]:

def scatter(y_test, predictions):
    results = pd.DataFrame({'counts':y_test,'predictions':predictions})
    results = results.sort_values('counts')
    results = results.reset_index()
    results = results.drop(columns='index')
    fig, ax = plt.subplots(figsize=(16,8))
    x1 = plt.scatter(results.index,results['predictions'], alpha=0.3, c='black', label='predictions')
    x2 = plt.scatter(results.index,results['counts'], alpha=0.3, c='red', label='counts')
    x7 = plt.legend()
    x3 = spines(ax, xl='index')
    fig2, ax2 = plt.subplots(figsize=(16,8))
    x4 = plt.scatter(results['counts'], results['predictions'] - results['counts'], alpha=0.35)
    x5 = plt.axhline(0, color='black')
    x6 = spines(ax2, xl='Rental counts', yl='Error') 
    return fig, ax, x1, x2, x7, x3, x4, x5, x6, 

scatter(y_test,predictions1)
plt.show()

From the start linear regression model fails to deliver predictions withing a decent margin of error. On top of that, as the rental count increases, the model fails to follow the growth of rental counts.

Lets try a basic random forest model:

In [18]:

rfg = RandomForestRegressor(random_state=1)
X_train, X_test, y_train, y_test = select_split(df)
r2_score_train,r2_score_test, predictions2 = fit_model(rfg)
print('train set r2 score: ',r2_score_train,'       test set r2 score: ', r2_score_test)

train set r2 score:  0.9920127960437292        test set r2 score:  0.9465913744945162

In [19]:

# plot feature importance for random forest model:
fig, ax = plt.subplots(figsize=(16,8))
importance = rfg.feature_importances_
plt.bar(X_train.columns, importance)
spines(ax, title='Feature importance for Random Forest model')
plt.show()

Clearly time of the day has the strongest influence on our model, followed by temperature and year.

Use GridSearch to find best hyperparameters for Random Forest¶

In [20]:

# we'll setup a very basic gridsearch for tuning the model:
rfg = RandomForestRegressor(random_state=1,n_estimators=100, max_features='auto', max_depth=None)
tuned_parameters = { 
    'n_estimators': [100, 150, 200],
    'max_features': ['auto', 'sqrt', 'log2'],
}

clf = GridSearchCV(rfg, tuned_parameters, cv=5, 
                   n_jobs=-1, verbose=1)
clf.fit(X_train, y_train)
clf.best_params_

Fitting 5 folds for each of 9 candidates, totalling 45 fits

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  45 out of  45 | elapsed:  1.1min finished

Out[20]:

{'max_features': 'auto', 'n_estimators': 200}

In [21]:

rfg = RandomForestRegressor(random_state=1, n_estimators=200, max_features='auto')
r2_score_train,r2_score_test, predictions = fit_model(rfg)
# print('train set r2 score: ',r2_score_train,'       test set r2 score: ', r2_score_test)
# 0.9467762482052581
print('train set r2 score: ',r2_score_train)
print('test set r2 score: ', r2_score_test)
print('test set rmse score: ', (mean_squared_error(predictions,y_test))**(1/2))

train set r2 score:  0.9922013663669105
test set r2 score:  0.9467820922480089
test set rmse score:  42.13610620954608

4. Import more weather data¶

Temperature is a very important feature for our model, but we're missing precipitation data. We'll start with importing that and pressure data, hopefully we can improve our results by adding this intel. Unfortunately the weather station for Washington doesn't record the sunshine time amount.

back to top

In [22]:

!pip install meteostat
import calendar
from datetime import datetime
from meteostat import Hourly
from datetime import datetime, timedelta
from meteostat import Stations

# Get nearby weather stations
stations = Stations()
# Coordinates of Washington D.C.:
stations = stations.nearby(38.9072, -77.0369)
station = stations.fetch(1)

# Print DataFrame
print(station)

Collecting meteostat
  Downloading meteostat-1.5.10.tar.gz (16 kB)
Requirement already satisfied: pandas>=1.1 in /opt/conda/lib/python3.7/site-packages (from meteostat) (1.3.3)
Requirement already satisfied: pytz in /opt/conda/lib/python3.7/site-packages (from meteostat) (2021.1)
Requirement already satisfied: numpy in /opt/conda/lib/python3.7/site-packages (from meteostat) (1.19.5)
Requirement already satisfied: python-dateutil>=2.7.3 in /opt/conda/lib/python3.7/site-packages (from pandas>=1.1->meteostat) (2.8.0)
Requirement already satisfied: six>=1.5 in /opt/conda/lib/python3.7/site-packages (from python-dateutil>=2.7.3->pandas>=1.1->meteostat) (1.16.0)
Building wheels for collected packages: meteostat
  Building wheel for meteostat (setup.py) ... - \ done
  Created wheel for meteostat: filename=meteostat-1.5.10-py3-none-any.whl size=29174 sha256=723aa106ee73a67a7009f4eb0eb8f8137b431da02886a44c5b6a7979d2fb43bb
  Stored in directory: /root/.cache/pip/wheels/c7/4e/10/6cd64c2c65bddd4c84d161e702ad98f3fc41b4c45ec7da6065
Successfully built meteostat
Installing collected packages: meteostat
Successfully installed meteostat-1.5.10
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv
                              name country region    wmo  icao  latitude  \
id                                                                         
72405  Washington National Airport      US     DC  72405  KDCA     38.85   

       longitude  elevation          timezone hourly_start hourly_end  \
id                                                                      
72405   -77.0333        5.0  America/New_York   1936-09-01 2021-12-06   

      daily_start  daily_end monthly_start monthly_end     distance  
id                                                                   
72405  1936-09-01 2021-12-04    1936-01-01  2021-01-01  6367.979298

In [23]:

# merge date and hour columns and convert into a datetime object: 
def timeit(x):
    return datetime.strptime(x, '%Y-%m-%d')
def hourit(x):
    return timedelta(hours=x)

df['dt'] = df['dteday'].apply(timeit) + df['hr'].apply(hourit)
start = df['dt'].min()
end = df['dt'].max()

# import weather data:
def get_weather(start, end):
    data = Hourly('72405', start, end)
    data = data.fetch()
    data['dt'] = data.index
    # Select only the data, precipitation and pressure data:
    data = data[['dt','prcp', 'pres']]
    return data
data = get_weather(start, end)

df_new = pd.merge(df,data,on='dt', how='left')
data.isnull().sum()

Out[23]:

dt         0
prcp    1219
pres     218
dtype: int64

In [24]:

msno.matrix(data)
plt.show()

Missing values¶

Unfortunately the data from meteostat has some missing values. We don' want to remove the rows with missing values, we're going to fill them. We're not going to use the basic method of filling nulls with an average value. Instead we'll use 'pad' method. This method takes the value from previous row and inputs it to the row with missing value. In essence: if it was raining a lot yesterday than we're assuming it is raining a lot today.

In [25]:

# fill the missing values:
df_new['prcp'] = df_new['prcp'].fillna(method='pad')
df_new['pres'] = df_new['pres'].fillna(method='pad')
# first row fillna:
df_new['prcp'] = df_new['prcp'].fillna(df_new['prcp'].mean())

# plot pressure:
fig, ax = plt.subplots(figsize=(16,8))
ax = plt.subplot(121)
sns.stripplot(x = df_new['pres'], y = df_new['cnt'], alpha=0.3, color='black', size=6, jitter=0.4, zorder=2)
plt.legend(['counts'])
spines(ax, xl='pres', title='Pressure')
X_TICKS = 50
plt.xticks(range(0, len(df_new['pres'].sort_values().unique()), X_TICKS), df_new['pres'].sort_values().unique()[::X_TICKS], rotation = 0)

# plot precipitation:
ax2 = plt.subplot(122)
sns.stripplot(x = df_new['prcp'], y = df_new['cnt'], alpha=0.3, color='black', size=6, jitter=0.4, zorder=2)
spines(ax2, xl='prcp', title='Precipitation')
X_TICKS = 5
plt.xticks(range(0, len(df_new['prcp'].sort_values().unique()), X_TICKS), df_new['prcp'].sort_values().unique()[::X_TICKS], rotation = 0)
plt.legend(['counts'])
plt.show()

In [26]:

# prep dataframe for testing on multiple models: 
# add more colums:
categorical_cols = [ 'season', 'mnth', 'holiday', 'weekday', 'workingday', 'weathersit', 'hr', 'yr']
numeric_cols = ['temp', 'hum', 'windspeed', 'prcp', 'pres']
X = df_new[categorical_cols+numeric_cols]
y = df_new['cnt']

# One Hot Encoding important categorical columns:
enc = OneHotEncoder(handle_unknown='ignore')
for col in ['workingday', 'weathersit', 'season']:
    enc_df = pd.DataFrame(enc.fit_transform(X[[col]]).toarray())
    enc_df = enc_df.add_prefix(col)
    X = X.join(enc_df)
    X = X.drop(columns=col)

# Scaling numerical values:
X[['windspeed']] =  MinMaxScaler().fit_transform(X[['windspeed']])
X[['temp']] =  MinMaxScaler().fit_transform(X[['temp']])
X[['prcp']] =  MinMaxScaler().fit_transform(X[['prcp']])
X[['hum']] =  MinMaxScaler().fit_transform(X[['hum']])

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

5. Testing different regression models¶

back to top

In [27]:

# import more libraries:
from sklearn.linear_model import ElasticNet, Lasso
from sklearn.ensemble import RandomForestRegressor,  GradientBoostingRegressor
from sklearn.kernel_ridge import KernelRidge
from sklearn.base import BaseEstimator, TransformerMixin, RegressorMixin, clone
from sklearn.model_selection import KFold, train_test_split
from sklearn.metrics import mean_squared_error
import xgboost as xgb
import lightgbm as lgb
from sklearn.neural_network import MLPRegressor

In [28]:

lasso = Lasso(random_state=1)
ENet = ElasticNet(random_state=1)
KRR = KernelRidge()
GBoost = GradientBoostingRegressor(n_estimators=3000, learning_rate=0.05,
                                   max_depth=4, max_features='sqrt',
                                   min_samples_leaf=15, min_samples_split=10, 
                                   loss='huber', random_state =1)
model_xgb = xgb.XGBRegressor(random_state =1)
model_lgb = lgb.LGBMRegressor()
mlp =  MLPRegressor(max_iter=3000, random_state=1)    
# try different models:
models = {"RandomForest": rfg,
          "Lasso": lasso,
          "ElasticNet": ENet,
          "KernelRidge": KRR,
          "GradientBoostingRegressor": GBoost,
          "XGBRegressor": model_xgb,
          "LGBMRegressor": model_lgb,
          "MLPRegressor": mlp
         }
list_predictions = []
r2_train_list = []
r2_test_list = []
for model in models:
    r2_score_train,r2_score_test, predictions = fit_model(models[model])
    list_predictions.append([model,predictions])
    r2_train_list.append(r2_score_train)
    r2_test_list.append(r2_score_test)
results_df = pd.DataFrame({ 'train_r2':r2_train_list, 'test_r2':r2_test_list}, index=models.keys(),)
results_df

Out[28]:

	train_r2	test_r2
RandomForest	0.992620	0.946928
Lasso	0.396623	0.408620
ElasticNet	0.252784	0.264971
KernelRidge	0.399994	0.412594
GradientBoostingRegressor	0.957949	0.942770
XGBRegressor	0.978292	0.952426
LGBMRegressor	0.961015	0.950310
MLPRegressor	0.619765	0.633098

In [29]:

n= 0
fig= plt.figure(figsize=(16,26))
for predictions in list_predictions:
    results = pd.DataFrame({'counts':y_test,'predictions':predictions[1]})
    results = results.sort_values('counts')
    results = results.reset_index()
    results = results.drop(columns='index')
    ax = fig.add_subplot(4,2,n+1)
    n += 1
    plt.scatter(results.index,results['predictions'], alpha=0.3, c='black', label='predictions')
    plt.scatter(results.index,results['counts'], alpha=0.3, c='red', label='counts')
    plt.legend()
    spines(ax, xl='index', title=predictions[0])
plt.show()

6. Stacking models¶

back to top

So far we've always relied on a single model to deliver prediction values. It's time to try a different approach: using multiple models can improve our results.

6.1. Averaging models¶

We'll use 4 of the best models we came across so far: GradientBoostingRegressor, XGB Regressor, LGBM Regressor and RandomForestRegressor. Instead of relying on 1 model to predict the values, we'll use four models and use the average value of four predicted values. I've collected the below functions from kaggle notebook on the matter.

In [30]:

# https://www.kaggle.com/serigne/stacked-regressions-top-4-on-leaderboard#Modelling
class AveragingModels(BaseEstimator, RegressorMixin, TransformerMixin):
    def __init__(self, models):
        self.models = models
        
    # we define clones of the original models to fit the data in
    def fit(self, X, y):
        self.models_ = [clone(x) for x in self.models]
        # Train cloned base models
        for model in self.models_:
            model.fit(X, y)
        return self
    
    #Now we do the predictions for cloned models and average them
    def predict(self, X):
        predictions = np.column_stack([
            model.predict(X) for model in self.models_
        ])
        return np.mean(predictions, axis=1)   
    
averaged_models = AveragingModels(models = (rfg, GBoost, model_xgb, model_lgb))
r2_score_train,r2_score_test, predictions = fit_model(averaged_models)
print('train set r2 score: ',r2_score_train)
print('test set r2 score: ', r2_score_test)
print('test set rmse score: ', (mean_squared_error(predictions,y_test))**(1/2))

train set r2 score:  0.9789654170709211
test set r2 score:  0.9566152559293124
test set rmse score:  38.04468755646464

6.2. Introducing meta model¶

In [31]:

# https://www.kaggle.com/serigne/stacked-regressions-top-4-on-leaderboard#Modelling
class StackingAveragedModels(BaseEstimator, RegressorMixin, TransformerMixin):
    def __init__(self, base_models, meta_model, n_folds=5):
        self.base_models = base_models
        self.meta_model = meta_model
        self.n_folds = n_folds
   
    # We again fit the data on clones of the original models
    def fit(self, X, y):
        self.base_models_ = [list() for x in self.base_models]
        self.meta_model_ = clone(self.meta_model)
        kfold = KFold(n_splits=self.n_folds, shuffle=True, random_state=156)
        
        # Train cloned base models then create out-of-fold predictions
        # that are needed to train the cloned meta-model
        out_of_fold_predictions = np.zeros((X.shape[0], len(self.base_models)))
        for i, model in enumerate(self.base_models):
            for train_index, holdout_index in kfold.split(X, y):
                instance = clone(model)
                self.base_models_[i].append(instance)
                instance.fit(X.iloc[train_index], y.iloc[train_index])
                y_pred = instance.predict(X.iloc[holdout_index])
                out_of_fold_predictions[holdout_index, i] = y_pred
                
        # Now train the cloned  meta-model using the out-of-fold predictions as new feature
        self.meta_model_.fit(out_of_fold_predictions, y)
        return self
   
    #Do the predictions of all base models on the test data and use the averaged predictions as 
    #meta-features for the final prediction which is done by the meta-model
    def predict(self, X):
        meta_features = np.column_stack([
            np.column_stack([model.predict(X) for model in base_models]).mean(axis=1)
            for base_models in self.base_models_ ])
        return self.meta_model_.predict(meta_features)
       
stacked_averaged_models = StackingAveragedModels(base_models = (rfg, GBoost, model_xgb, model_lgb),
                                                 meta_model = mlp)
r2_score_train,r2_score_test, predictions = fit_model(stacked_averaged_models)
print('train set r2 score: ',r2_score_train)
print('test set r2 score: ', r2_score_test)
print('test set rmse score: ', (mean_squared_error(predictions,y_test))**(1/2))

train set r2 score:  0.9798188561949155
test set r2 score:  0.9563798161210616
test set rmse score:  38.14777810989409

Finding the best hidden layer sizes setup:¶

We'll use a simple loop to find the best hidden_layer_sizes parameter.

In [32]:

layers_list = [(125,111,17,5,4,),(75,20,10,5,2,),(475,220,110,55,5,),(220,220,110,55,5,),(300,200,100)]
predictions_list = []
r2_list = []
for layer_setup in layers_list:
    mlp =  MLPRegressor(max_iter=3000, hidden_layer_sizes=layer_setup, random_state=1)
    stacked_averaged_models = StackingAveragedModels(base_models = (rfg, GBoost, model_xgb, model_lgb), meta_model = mlp)
    r2_score_train,r2_score_test, predictions = fit_model(stacked_averaged_models)
    predictions_list.append(predictions)
    r2_list.append(r2_score_test)
    print(layer_setup,'   train set r2 score: ',r2_score_train,'   test set r2 score: ',r2_score_test, )

(125, 111, 17, 5, 4)    train set r2 score:  0.9795518671546777    test set r2 score:  0.9576782413922301
(75, 20, 10, 5, 2)    train set r2 score:  0.9803801615320383    test set r2 score:  0.9577976335408065
(475, 220, 110, 55, 5)    train set r2 score:  0.9803606913767285    test set r2 score:  0.9582019271312862
(220, 220, 110, 55, 5)    train set r2 score:  0.9804204405014428    test set r2 score:  0.9575601132653637
(300, 200, 100)    train set r2 score:  0.9809245286989323    test set r2 score:  0.9576903371921955

In [33]:

max_value = max(r2_list) 
max_index = r2_list.index(max_value)
scatter(y_test,predictions_list[max_index])
plt.show()

In [34]:

mse = mean_squared_error(predictions_list[max_index],y_test)
rmse = mse**(1/2)
rmse

Out[34]:

37.34252077152997

7. Conclusions¶

back to top

Linear regression model not surprisingly didn't perform well - there are too many features that impact our target values in various ways for that model to handle the data well
Random Forest Regressor delivered great results from the very beginning
Gathering additional data delivered very small improvement in results
Apart from Random Forest, gradient boosting models performed very well
- difference between training and testing sets r2 scores was much lower in gradient boosting models than Random Forest model
- we've used Random Forest in conjunction with gradient boosting models to improve the results with averaging predictions of multiple models
Neural Network model didn't perform well at the start on its own. When used as a meta model combined with other models it did improve already very good results.
Our final models results:
- RMSE of test set: 37.34252077152997
- R2 score of training set: 0.9803606913767285
- R2 score of test set: 0.9582019271312862
- It's worth mentioning that to beat the results of single Random Forest or Gradient Boosting model, by a small margin. We had to use multiple model stacking methods that come at a big computational cost.

1. Intro¶

1.1. What is bicycle sharing?¶

1.2 Importing packages¶

2. Exploratory Data Analysis¶

2.1. Initial Analysis¶

Windspeed is not normalized¶

2.2. Time columns:¶

2.3. Weather columns:¶

3. Modelling¶

Use GridSearch to find best hyperparameters for Random Forest¶

4. Import more weather data¶

Missing values¶

5. Testing different regression models¶

6. Stacking models¶

6.1. Averaging models¶

6.2. Introducing meta model¶

Finding the best hidden layer sizes setup:¶

7. Conclusions¶