In this project we'll build a machine learning model to predict the amount of bicycles rented every day in Washington D.C. in 2011 and 2012. There are a number of factors that influence the number of bike rents every day, very often those factors influence each other. A linear regression model is probably not going to deliver any good results in this scenario.
We'll start with simple linear regression model, just to confirm the hunch that it won't perform well. Then we'll move on to Random Forest Regressor and perform basic GridSearch to tweak that model. After that we'll test other models, including gradient boosting models and neural networks. Having tested quite a few models we'll move on to joining them together:
Index:
Links:
!pip install wikipedia
import wikipedia
import textwrap as tr
Collecting wikipedia
Downloading wikipedia-1.4.0.tar.gz (27 kB)
Requirement already satisfied: beautifulsoup4 in /opt/conda/lib/python3.7/site-packages (from wikipedia) (4.10.0)
Requirement already satisfied: requests<3.0.0,>=2.0.0 in /opt/conda/lib/python3.7/site-packages (from wikipedia) (2.25.1)
Requirement already satisfied: certifi>=2017.4.17 in /opt/conda/lib/python3.7/site-packages (from requests<3.0.0,>=2.0.0->wikipedia) (2021.10.8)
Requirement already satisfied: urllib3<1.27,>=1.21.1 in /opt/conda/lib/python3.7/site-packages (from requests<3.0.0,>=2.0.0->wikipedia) (1.26.6)
Requirement already satisfied: chardet<5,>=3.0.2 in /opt/conda/lib/python3.7/site-packages (from requests<3.0.0,>=2.0.0->wikipedia) (4.0.0)
Requirement already satisfied: idna<3,>=2.5 in /opt/conda/lib/python3.7/site-packages (from requests<3.0.0,>=2.0.0->wikipedia) (2.10)
Requirement already satisfied: soupsieve>1.2 in /opt/conda/lib/python3.7/site-packages (from beautifulsoup4->wikipedia) (2.2.1)
Building wheels for collected packages: wikipedia
Building wheel for wikipedia (setup.py) ... - \ done
Created wheel for wikipedia: filename=wikipedia-1.4.0-py3-none-any.whl size=11696 sha256=7e3a7d31d8cc7cb01e7180e83315f72211f29b9852cf2e6fe957bf54f2b552cf
Stored in directory: /root/.cache/pip/wheels/15/93/6d/5b2c68b8a64c7a7a04947b4ed6d89fb557dcc6bc27d1d7f3ba
Successfully built wikipedia
Installing collected packages: wikipedia
Successfully installed wikipedia-1.4.0
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv
my_str = wikipedia.summary("Bicycle-sharing system")
for line in tr.wrap(my_str, width=110):
print(line)
print('source: Wikipedia')
A bicycle-sharing system, bike share program, public bicycle scheme, or public bike share (PBS) scheme, is a shared transport service in which bicycles are made available for shared use to individuals on a short term basis for a price or free. Many bike share systems allow people to borrow a bike from a "dock" and return it at another dock belonging to the same system. Docks are special bike racks that lock the bike, and only release it by computer control. The user enters payment information, and the computer unlocks a bike. The user returns the bike by placing it in the dock, which locks it in place. Other systems are dockless. In recent years, an increasing number of cities across the world have started to offer both mechanical bike share and electric bicycle sharing systems, such as Dubai, New York, Paris, Montreal and Barcelona.For many systems, smartphone mapping apps show nearby available bikes and open docks. In July 2020, Google Maps began including bike shares in its route recommendations. source: Wikipedia
import pandas as pd
import numpy as np
import random
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OrdinalEncoder
# from sklearn.impute import SimpleImputer
# from sklearn.compose import ColumnTransformer
from sklearn import linear_model
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import r2_score
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import MinMaxScaler
import missingno as msno
df = pd.read_csv('../input/bike-sharing-dataset/hour.csv')
df.head()
instant | dteday | season | yr | mnth | hr | holiday | weekday | workingday | weathersit | temp | atemp | hum | windspeed | casual | registered | cnt | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 2011-01-01 | 1 | 0 | 1 | 0 | 0 | 6 | 0 | 1 | 0.24 | 0.2879 | 0.81 | 0.0 | 3 | 13 | 16 |
1 | 2 | 2011-01-01 | 1 | 0 | 1 | 1 | 0 | 6 | 0 | 1 | 0.22 | 0.2727 | 0.80 | 0.0 | 8 | 32 | 40 |
2 | 3 | 2011-01-01 | 1 | 0 | 1 | 2 | 0 | 6 | 0 | 1 | 0.22 | 0.2727 | 0.80 | 0.0 | 5 | 27 | 32 |
3 | 4 | 2011-01-01 | 1 | 0 | 1 | 3 | 0 | 6 | 0 | 1 | 0.24 | 0.2879 | 0.75 | 0.0 | 3 | 10 | 13 |
4 | 5 | 2011-01-01 | 1 | 0 | 1 | 4 | 0 | 6 | 0 | 1 | 0.24 | 0.2879 | 0.75 | 0.0 | 0 | 1 | 1 |
df.isnull().sum().sum()
0
df.describe().T
count | mean | std | min | 25% | 50% | 75% | max | |
---|---|---|---|---|---|---|---|---|
instant | 17379.0 | 8690.000000 | 5017.029500 | 1.00 | 4345.5000 | 8690.0000 | 13034.5000 | 17379.0000 |
season | 17379.0 | 2.501640 | 1.106918 | 1.00 | 2.0000 | 3.0000 | 3.0000 | 4.0000 |
yr | 17379.0 | 0.502561 | 0.500008 | 0.00 | 0.0000 | 1.0000 | 1.0000 | 1.0000 |
mnth | 17379.0 | 6.537775 | 3.438776 | 1.00 | 4.0000 | 7.0000 | 10.0000 | 12.0000 |
hr | 17379.0 | 11.546752 | 6.914405 | 0.00 | 6.0000 | 12.0000 | 18.0000 | 23.0000 |
holiday | 17379.0 | 0.028770 | 0.167165 | 0.00 | 0.0000 | 0.0000 | 0.0000 | 1.0000 |
weekday | 17379.0 | 3.003683 | 2.005771 | 0.00 | 1.0000 | 3.0000 | 5.0000 | 6.0000 |
workingday | 17379.0 | 0.682721 | 0.465431 | 0.00 | 0.0000 | 1.0000 | 1.0000 | 1.0000 |
weathersit | 17379.0 | 1.425283 | 0.639357 | 1.00 | 1.0000 | 1.0000 | 2.0000 | 4.0000 |
temp | 17379.0 | 0.496987 | 0.192556 | 0.02 | 0.3400 | 0.5000 | 0.6600 | 1.0000 |
atemp | 17379.0 | 0.475775 | 0.171850 | 0.00 | 0.3333 | 0.4848 | 0.6212 | 1.0000 |
hum | 17379.0 | 0.627229 | 0.192930 | 0.00 | 0.4800 | 0.6300 | 0.7800 | 1.0000 |
windspeed | 17379.0 | 0.190098 | 0.122340 | 0.00 | 0.1045 | 0.1940 | 0.2537 | 0.8507 |
casual | 17379.0 | 35.676218 | 49.305030 | 0.00 | 4.0000 | 17.0000 | 48.0000 | 367.0000 |
registered | 17379.0 | 153.786869 | 151.357286 | 0.00 | 34.0000 | 115.0000 | 220.0000 | 886.0000 |
cnt | 17379.0 | 189.463088 | 181.387599 | 1.00 | 40.0000 | 142.0000 | 281.0000 | 977.0000 |
# lets generate a correlation heatmap:
mask = np.triu(np.ones_like(df.corr(), dtype=bool))
fig, ax = plt.subplots(figsize=(16,16))
sns.heatmap(abs(df.corr()), square=True, cmap='BrBG', mask=mask)
plt.show()
# we'll use the usual spines function to reduce plots code:
def spines(ax,yl='Rental counts',xl='',title=''):
x1 = ax.spines['right'].set_visible(False)
x2 = ax.spines['top'].set_visible(False)
x3 = ax.spines['left'].set_linewidth(2)
x4 = ax.spines['bottom'].set_linewidth(2)
x5 = ax.set_ylabel(yl)
x6 = ax.set_xlabel(xl)
x7 = ax.set_title(title)
return x1, x2, x3, x4, x5, x6
fig, ax = plt.subplots(figsize=(16,16))
ax = plt.subplot(221)
plt.pie([df['registered'].sum(), df['casual'].sum()], labels=['registered','casual'])
ax = plt.subplot(222)
df.groupby('hr')['cnt'].sum().plot.bar()
spines(ax, title='Combined')
ax = plt.subplot(223)
df.groupby('hr')['registered'].sum().plot.bar()
spines(ax, title='Registered')
ax = plt.subplot(224)
df.groupby('hr')['casual'].sum().plot.bar()
spines(ax, title='casual')
plt.show()
Observations:
fig, ax = plt.subplots(figsize=(16,8))
sns.stripplot(x = df.groupby('dteday')['cnt'].sum().index, y = df.groupby('dteday')['cnt'].sum(), alpha=0.3, color='black', size=6, jitter=0.4, zorder=2)
spines(ax, title='Counts per date', xl='Date')
X_TICKS = 50
plt.xticks(range(0, len(df['dteday'].sort_values().unique()), X_TICKS), df['dteday'].sort_values().unique()[::X_TICKS], rotation = 30)
plt.show()
fig, ax = plt.subplots(figsize=(16,8))
sns.stripplot(x = df['hr'], y = df['cnt'], alpha=0.3, color='black', size=6, jitter=0.4, zorder=2)
plt.plot(df['hr'].unique(),df.groupby('hr')['cnt'].mean(), color='red', linewidth=6, zorder=3, alpha=0.6)
plt.legend(['avg','counts'])
spines(ax, xl='Hour', title='Rental counts per hour')
plt.show()
fig, ax = plt.subplots(figsize=(16,8))
sns.stripplot(x = df['weekday'], y = df['cnt'], alpha=0.3, color='black', size=6, jitter=0.4, zorder=2)
plt.plot(df['weekday'].sort_values().unique(),df.groupby('weekday')['cnt'].mean(), color='red', linewidth=6, zorder=3, alpha=0.6)
plt.legend(['avg','counts'])
spines(ax, xl='day', title='Rental counts per day')
plt.show()
fig, ax = plt.subplots(figsize=(16,8))
sns.stripplot(x = df['mnth'], y = df['cnt'], alpha=0.3, color='black', size=6, jitter=0.4, zorder=2)
plt.plot(df['mnth'].unique()-1,df.groupby('mnth')['cnt'].mean(), color='red', linewidth=6, zorder=3, alpha=0.6)
# handles, labels = ax.get_legend_handles_labels()
plt.legend(['avg','counts'])
spines(ax, xl='mnth', title='Rental counts per month')
plt.show()
fig, ax = plt.subplots(figsize=(16,8))
sns.stripplot(x=df['hum'],y=df['cnt'], alpha=0.3, color='black', jitter=0.4)
plt.legend(['counts'])
spines(ax, xl='hum')
X_TICKS = 5
plt.xticks(range(0, len(df['hum'].sort_values().unique()), X_TICKS), df['hum'].sort_values().unique()[::X_TICKS], rotation = 0)
plt.show()
fig, ax = plt.subplots(figsize=(16,8))
sns.stripplot(x=df['temp'],y=df['cnt'], alpha=0.3, color='black', jitter=0.4)
plt.legend(['counts'])
spines(ax, xl='temp')
X_TICKS = 5
plt.xticks(range(0, len(df['temp'].sort_values().unique()), X_TICKS), df['temp'].sort_values().unique()[::X_TICKS], rotation = 0)
plt.show()
fig, ax = plt.subplots(figsize=(16,8))
sns.stripplot(x=df['windspeed'],y=df['cnt'], alpha=0.3, color='black', jitter=0.4)
plt.legend(['counts'])
spines(ax, xl='windspeed')
X_TICKS = 2
plt.xticks(range(0, len(df['windspeed'].sort_values().unique()), X_TICKS), df['windspeed'].sort_values().unique()[::X_TICKS], rotation = 0)
plt.show()
We can notice an obvious correlation of temperature and windspeed with rental counts. This relationship doesn't look as obvious with humidity.
We'll start with a basic linear regression (no good results are expected from that model), then we'll move on to Random Forest Regressor.
categorical_cols = ['season', 'mnth', 'holiday', 'weekday', 'workingday', 'weathersit', 'yr', 'hr']
numeric_cols = ['temp', 'hum', 'windspeed']
def select_split(df):
X = df[categorical_cols+numeric_cols]
y = df['cnt']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)
return X_train, X_test, y_train, y_test
def fit_model(model):
model.fit(X_train, y_train)
predictions_train = model.predict(X_train)
predictions_test = model.predict(X_test)
return r2_score(y_train, predictions_train),r2_score(y_test, predictions_test), predictions_test
lr = linear_model.LinearRegression()
X_train, X_test, y_train, y_test = select_split(df)
r2_score_train,r2_score_test, predictions1 = fit_model(lr)
print('train set r2 score: ',r2_score_train,' test set r2 score: ', r2_score_test)
train set r2 score: 0.3849208346739741 test set r2 score: 0.39852904131877387
def scatter(y_test, predictions):
results = pd.DataFrame({'counts':y_test,'predictions':predictions})
results = results.sort_values('counts')
results = results.reset_index()
results = results.drop(columns='index')
fig, ax = plt.subplots(figsize=(16,8))
x1 = plt.scatter(results.index,results['predictions'], alpha=0.3, c='black', label='predictions')
x2 = plt.scatter(results.index,results['counts'], alpha=0.3, c='red', label='counts')
x7 = plt.legend()
x3 = spines(ax, xl='index')
fig2, ax2 = plt.subplots(figsize=(16,8))
x4 = plt.scatter(results['counts'], results['predictions'] - results['counts'], alpha=0.35)
x5 = plt.axhline(0, color='black')
x6 = spines(ax2, xl='Rental counts', yl='Error')
return fig, ax, x1, x2, x7, x3, x4, x5, x6,
scatter(y_test,predictions1)
plt.show()
From the start linear regression model fails to deliver predictions withing a decent margin of error. On top of that, as the rental count increases, the model fails to follow the growth of rental counts.
Lets try a basic random forest model:
rfg = RandomForestRegressor(random_state=1)
X_train, X_test, y_train, y_test = select_split(df)
r2_score_train,r2_score_test, predictions2 = fit_model(rfg)
print('train set r2 score: ',r2_score_train,' test set r2 score: ', r2_score_test)
train set r2 score: 0.9920127960437292 test set r2 score: 0.9465913744945162
# plot feature importance for random forest model:
fig, ax = plt.subplots(figsize=(16,8))
importance = rfg.feature_importances_
plt.bar(X_train.columns, importance)
spines(ax, title='Feature importance for Random Forest model')
plt.show()
Clearly time of the day has the strongest influence on our model, followed by temperature and year.
# we'll setup a very basic gridsearch for tuning the model:
rfg = RandomForestRegressor(random_state=1,n_estimators=100, max_features='auto', max_depth=None)
tuned_parameters = {
'n_estimators': [100, 150, 200],
'max_features': ['auto', 'sqrt', 'log2'],
}
clf = GridSearchCV(rfg, tuned_parameters, cv=5,
n_jobs=-1, verbose=1)
clf.fit(X_train, y_train)
clf.best_params_
Fitting 5 folds for each of 9 candidates, totalling 45 fits
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers. [Parallel(n_jobs=-1)]: Done 45 out of 45 | elapsed: 1.1min finished
{'max_features': 'auto', 'n_estimators': 200}
rfg = RandomForestRegressor(random_state=1, n_estimators=200, max_features='auto')
r2_score_train,r2_score_test, predictions = fit_model(rfg)
# print('train set r2 score: ',r2_score_train,' test set r2 score: ', r2_score_test)
# 0.9467762482052581
print('train set r2 score: ',r2_score_train)
print('test set r2 score: ', r2_score_test)
print('test set rmse score: ', (mean_squared_error(predictions,y_test))**(1/2))
train set r2 score: 0.9922013663669105 test set r2 score: 0.9467820922480089 test set rmse score: 42.13610620954608
Temperature is a very important feature for our model, but we're missing precipitation data. We'll start with importing that and pressure data, hopefully we can improve our results by adding this intel. Unfortunately the weather station for Washington doesn't record the sunshine time amount.
!pip install meteostat
import calendar
from datetime import datetime
from meteostat import Hourly
from datetime import datetime, timedelta
from meteostat import Stations
# Get nearby weather stations
stations = Stations()
# Coordinates of Washington D.C.:
stations = stations.nearby(38.9072, -77.0369)
station = stations.fetch(1)
# Print DataFrame
print(station)
Collecting meteostat
Downloading meteostat-1.5.10.tar.gz (16 kB)
Requirement already satisfied: pandas>=1.1 in /opt/conda/lib/python3.7/site-packages (from meteostat) (1.3.3)
Requirement already satisfied: pytz in /opt/conda/lib/python3.7/site-packages (from meteostat) (2021.1)
Requirement already satisfied: numpy in /opt/conda/lib/python3.7/site-packages (from meteostat) (1.19.5)
Requirement already satisfied: python-dateutil>=2.7.3 in /opt/conda/lib/python3.7/site-packages (from pandas>=1.1->meteostat) (2.8.0)
Requirement already satisfied: six>=1.5 in /opt/conda/lib/python3.7/site-packages (from python-dateutil>=2.7.3->pandas>=1.1->meteostat) (1.16.0)
Building wheels for collected packages: meteostat
Building wheel for meteostat (setup.py) ... - \ done
Created wheel for meteostat: filename=meteostat-1.5.10-py3-none-any.whl size=29174 sha256=723aa106ee73a67a7009f4eb0eb8f8137b431da02886a44c5b6a7979d2fb43bb
Stored in directory: /root/.cache/pip/wheels/c7/4e/10/6cd64c2c65bddd4c84d161e702ad98f3fc41b4c45ec7da6065
Successfully built meteostat
Installing collected packages: meteostat
Successfully installed meteostat-1.5.10
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv
name country region wmo icao latitude \
id
72405 Washington National Airport US DC 72405 KDCA 38.85
longitude elevation timezone hourly_start hourly_end \
id
72405 -77.0333 5.0 America/New_York 1936-09-01 2021-12-06
daily_start daily_end monthly_start monthly_end distance
id
72405 1936-09-01 2021-12-04 1936-01-01 2021-01-01 6367.979298
# merge date and hour columns and convert into a datetime object:
def timeit(x):
return datetime.strptime(x, '%Y-%m-%d')
def hourit(x):
return timedelta(hours=x)
df['dt'] = df['dteday'].apply(timeit) + df['hr'].apply(hourit)
start = df['dt'].min()
end = df['dt'].max()
# import weather data:
def get_weather(start, end):
data = Hourly('72405', start, end)
data = data.fetch()
data['dt'] = data.index
# Select only the data, precipitation and pressure data:
data = data[['dt','prcp', 'pres']]
return data
data = get_weather(start, end)
df_new = pd.merge(df,data,on='dt', how='left')
data.isnull().sum()
dt 0 prcp 1219 pres 218 dtype: int64
msno.matrix(data)
plt.show()
Unfortunately the data from meteostat has some missing values. We don' want to remove the rows with missing values, we're going to fill them. We're not going to use the basic method of filling nulls with an average value. Instead we'll use 'pad' method. This method takes the value from previous row and inputs it to the row with missing value. In essence: if it was raining a lot yesterday than we're assuming it is raining a lot today.
# fill the missing values:
df_new['prcp'] = df_new['prcp'].fillna(method='pad')
df_new['pres'] = df_new['pres'].fillna(method='pad')
# first row fillna:
df_new['prcp'] = df_new['prcp'].fillna(df_new['prcp'].mean())
# plot pressure:
fig, ax = plt.subplots(figsize=(16,8))
ax = plt.subplot(121)
sns.stripplot(x = df_new['pres'], y = df_new['cnt'], alpha=0.3, color='black', size=6, jitter=0.4, zorder=2)
plt.legend(['counts'])
spines(ax, xl='pres', title='Pressure')
X_TICKS = 50
plt.xticks(range(0, len(df_new['pres'].sort_values().unique()), X_TICKS), df_new['pres'].sort_values().unique()[::X_TICKS], rotation = 0)
# plot precipitation:
ax2 = plt.subplot(122)
sns.stripplot(x = df_new['prcp'], y = df_new['cnt'], alpha=0.3, color='black', size=6, jitter=0.4, zorder=2)
spines(ax2, xl='prcp', title='Precipitation')
X_TICKS = 5
plt.xticks(range(0, len(df_new['prcp'].sort_values().unique()), X_TICKS), df_new['prcp'].sort_values().unique()[::X_TICKS], rotation = 0)
plt.legend(['counts'])
plt.show()
# prep dataframe for testing on multiple models:
# add more colums:
categorical_cols = [ 'season', 'mnth', 'holiday', 'weekday', 'workingday', 'weathersit', 'hr', 'yr']
numeric_cols = ['temp', 'hum', 'windspeed', 'prcp', 'pres']
X = df_new[categorical_cols+numeric_cols]
y = df_new['cnt']
# One Hot Encoding important categorical columns:
enc = OneHotEncoder(handle_unknown='ignore')
for col in ['workingday', 'weathersit', 'season']:
enc_df = pd.DataFrame(enc.fit_transform(X[[col]]).toarray())
enc_df = enc_df.add_prefix(col)
X = X.join(enc_df)
X = X.drop(columns=col)
# Scaling numerical values:
X[['windspeed']] = MinMaxScaler().fit_transform(X[['windspeed']])
X[['temp']] = MinMaxScaler().fit_transform(X[['temp']])
X[['prcp']] = MinMaxScaler().fit_transform(X[['prcp']])
X[['hum']] = MinMaxScaler().fit_transform(X[['hum']])
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)
# import more libraries:
from sklearn.linear_model import ElasticNet, Lasso
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.kernel_ridge import KernelRidge
from sklearn.base import BaseEstimator, TransformerMixin, RegressorMixin, clone
from sklearn.model_selection import KFold, train_test_split
from sklearn.metrics import mean_squared_error
import xgboost as xgb
import lightgbm as lgb
from sklearn.neural_network import MLPRegressor
lasso = Lasso(random_state=1)
ENet = ElasticNet(random_state=1)
KRR = KernelRidge()
GBoost = GradientBoostingRegressor(n_estimators=3000, learning_rate=0.05,
max_depth=4, max_features='sqrt',
min_samples_leaf=15, min_samples_split=10,
loss='huber', random_state =1)
model_xgb = xgb.XGBRegressor(random_state =1)
model_lgb = lgb.LGBMRegressor()
mlp = MLPRegressor(max_iter=3000, random_state=1)
# try different models:
models = {"RandomForest": rfg,
"Lasso": lasso,
"ElasticNet": ENet,
"KernelRidge": KRR,
"GradientBoostingRegressor": GBoost,
"XGBRegressor": model_xgb,
"LGBMRegressor": model_lgb,
"MLPRegressor": mlp
}
list_predictions = []
r2_train_list = []
r2_test_list = []
for model in models:
r2_score_train,r2_score_test, predictions = fit_model(models[model])
list_predictions.append([model,predictions])
r2_train_list.append(r2_score_train)
r2_test_list.append(r2_score_test)
results_df = pd.DataFrame({ 'train_r2':r2_train_list, 'test_r2':r2_test_list}, index=models.keys(),)
results_df
train_r2 | test_r2 | |
---|---|---|
RandomForest | 0.992620 | 0.946928 |
Lasso | 0.396623 | 0.408620 |
ElasticNet | 0.252784 | 0.264971 |
KernelRidge | 0.399994 | 0.412594 |
GradientBoostingRegressor | 0.957949 | 0.942770 |
XGBRegressor | 0.978292 | 0.952426 |
LGBMRegressor | 0.961015 | 0.950310 |
MLPRegressor | 0.619765 | 0.633098 |
n= 0
fig= plt.figure(figsize=(16,26))
for predictions in list_predictions:
results = pd.DataFrame({'counts':y_test,'predictions':predictions[1]})
results = results.sort_values('counts')
results = results.reset_index()
results = results.drop(columns='index')
ax = fig.add_subplot(4,2,n+1)
n += 1
plt.scatter(results.index,results['predictions'], alpha=0.3, c='black', label='predictions')
plt.scatter(results.index,results['counts'], alpha=0.3, c='red', label='counts')
plt.legend()
spines(ax, xl='index', title=predictions[0])
plt.show()
So far we've always relied on a single model to deliver prediction values. It's time to try a different approach: using multiple models can improve our results.
We'll use 4 of the best models we came across so far: GradientBoostingRegressor, XGB Regressor, LGBM Regressor and RandomForestRegressor. Instead of relying on 1 model to predict the values, we'll use four models and use the average value of four predicted values. I've collected the below functions from kaggle notebook on the matter.
# https://www.kaggle.com/serigne/stacked-regressions-top-4-on-leaderboard#Modelling
class AveragingModels(BaseEstimator, RegressorMixin, TransformerMixin):
def __init__(self, models):
self.models = models
# we define clones of the original models to fit the data in
def fit(self, X, y):
self.models_ = [clone(x) for x in self.models]
# Train cloned base models
for model in self.models_:
model.fit(X, y)
return self
#Now we do the predictions for cloned models and average them
def predict(self, X):
predictions = np.column_stack([
model.predict(X) for model in self.models_
])
return np.mean(predictions, axis=1)
averaged_models = AveragingModels(models = (rfg, GBoost, model_xgb, model_lgb))
r2_score_train,r2_score_test, predictions = fit_model(averaged_models)
print('train set r2 score: ',r2_score_train)
print('test set r2 score: ', r2_score_test)
print('test set rmse score: ', (mean_squared_error(predictions,y_test))**(1/2))
train set r2 score: 0.9789654170709211 test set r2 score: 0.9566152559293124 test set rmse score: 38.04468755646464
# https://www.kaggle.com/serigne/stacked-regressions-top-4-on-leaderboard#Modelling
class StackingAveragedModels(BaseEstimator, RegressorMixin, TransformerMixin):
def __init__(self, base_models, meta_model, n_folds=5):
self.base_models = base_models
self.meta_model = meta_model
self.n_folds = n_folds
# We again fit the data on clones of the original models
def fit(self, X, y):
self.base_models_ = [list() for x in self.base_models]
self.meta_model_ = clone(self.meta_model)
kfold = KFold(n_splits=self.n_folds, shuffle=True, random_state=156)
# Train cloned base models then create out-of-fold predictions
# that are needed to train the cloned meta-model
out_of_fold_predictions = np.zeros((X.shape[0], len(self.base_models)))
for i, model in enumerate(self.base_models):
for train_index, holdout_index in kfold.split(X, y):
instance = clone(model)
self.base_models_[i].append(instance)
instance.fit(X.iloc[train_index], y.iloc[train_index])
y_pred = instance.predict(X.iloc[holdout_index])
out_of_fold_predictions[holdout_index, i] = y_pred
# Now train the cloned meta-model using the out-of-fold predictions as new feature
self.meta_model_.fit(out_of_fold_predictions, y)
return self
#Do the predictions of all base models on the test data and use the averaged predictions as
#meta-features for the final prediction which is done by the meta-model
def predict(self, X):
meta_features = np.column_stack([
np.column_stack([model.predict(X) for model in base_models]).mean(axis=1)
for base_models in self.base_models_ ])
return self.meta_model_.predict(meta_features)
stacked_averaged_models = StackingAveragedModels(base_models = (rfg, GBoost, model_xgb, model_lgb),
meta_model = mlp)
r2_score_train,r2_score_test, predictions = fit_model(stacked_averaged_models)
print('train set r2 score: ',r2_score_train)
print('test set r2 score: ', r2_score_test)
print('test set rmse score: ', (mean_squared_error(predictions,y_test))**(1/2))
train set r2 score: 0.9798188561949155 test set r2 score: 0.9563798161210616 test set rmse score: 38.14777810989409
We'll use a simple loop to find the best hidden_layer_sizes parameter.
layers_list = [(125,111,17,5,4,),(75,20,10,5,2,),(475,220,110,55,5,),(220,220,110,55,5,),(300,200,100)]
predictions_list = []
r2_list = []
for layer_setup in layers_list:
mlp = MLPRegressor(max_iter=3000, hidden_layer_sizes=layer_setup, random_state=1)
stacked_averaged_models = StackingAveragedModels(base_models = (rfg, GBoost, model_xgb, model_lgb), meta_model = mlp)
r2_score_train,r2_score_test, predictions = fit_model(stacked_averaged_models)
predictions_list.append(predictions)
r2_list.append(r2_score_test)
print(layer_setup,' train set r2 score: ',r2_score_train,' test set r2 score: ',r2_score_test, )
(125, 111, 17, 5, 4) train set r2 score: 0.9795518671546777 test set r2 score: 0.9576782413922301 (75, 20, 10, 5, 2) train set r2 score: 0.9803801615320383 test set r2 score: 0.9577976335408065 (475, 220, 110, 55, 5) train set r2 score: 0.9803606913767285 test set r2 score: 0.9582019271312862 (220, 220, 110, 55, 5) train set r2 score: 0.9804204405014428 test set r2 score: 0.9575601132653637 (300, 200, 100) train set r2 score: 0.9809245286989323 test set r2 score: 0.9576903371921955
max_value = max(r2_list)
max_index = r2_list.index(max_value)
scatter(y_test,predictions_list[max_index])
plt.show()
mse = mean_squared_error(predictions_list[max_index],y_test)
rmse = mse**(1/2)
rmse
37.34252077152997