Many cities worldwide have communal bike sharing stations where you can rent bicycles by the hour or day. Washington, D.C. is one of these cities. The District collects detailed data on the number of bicycles people rent by the hour and day.

Hadi Fanaee-T at the University of Porto compiled this data into a CSV file, which you'll work with in this project. The file contains `17380`

rows, with each row representing the number of bike rentals for a single hour of a single day. You can download the data from the University of California, Irvine's website.

Our goal for this project was to predict the number of bike rents for each hour using machine learning algorithms given the available set of features. In the end, random forests regressor turned out to be the most accurate algorithm for a given task.

In [1]:

```
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, r2_score
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
```

In [2]:

```
bikerent = pd.read_csv('bike_rental_hour.csv', index_col='instant')
bikerent
```

Out[2]:

Here are the descriptions for the relevant columns:

- instant - A unique sequential ID number for each row
- dteday - The date of the rentals
- season - The season in which the rentals occurred
- yr - The year the rentals occurred
- mnth - The month the rentals occurred
- hr - The hour the rentals occurred
- holiday - Whether or not the day was a holiday
- weekday - The day of the week (as a number, 0 to 7)
- workingday - Whether or not the day was a working day
- weathersit - The weather (as a categorical variable)
- temp - The temperature, on a 0-1 scale
- atemp - The adjusted temperature
- hum - The humidity, on a 0-1 scale
- windspeed - The wind speed, on a 0-1 scale
- casual - The number of casual riders (people who hadn't previously signed up with the bike sharing program)
- registered - The number of registered riders (people who had already signed up)
- cnt - The total number of bike rentals (casual + registered)

In this project, we'll try to predict the total number of bikes people rented in a given hour. We'll predict the `cnt`

column using all of the other columns, except for `casual`

and `registered`

(since they leak an information about `cnt`

). To accomplish this, we'll create a few different machine learning models and evaluate their performance

In [3]:

```
bikerent.info()
```

All the data is numerical and there is no missing data.

In [4]:

```
plt.hist(bikerent['cnt'])
plt.show()
```

In [5]:

```
bikerent.corr()['cnt']
```

Out[5]:

In [6]:

```
bikerent.drop(columns=['casual', 'registered'], inplace=True)
```

In [7]:

```
bikerent.corr()['cnt'].sort_values()
```

Out[7]:

The most explainable features are:

- Temperature
- Adjusted temperature
- Humidity

`Weathersit`

, `Season`

are two columns that needs to be one-hot encoded since the values there are rather nominal than ordinal and consequently not representable.

In [8]:

```
# Define a function to transform hours
def hours_trans(hour):
if hour >=6 and hour <12:
return 1 #Morning
elif hour >=12 and hour <= 18:
return 2 #Day
elif hour >18 and hour < 24:
return 3 #Evening
else:
return 4 #Night
bikerent['hr_label'] = bikerent['hr'].apply(hours_trans)
bikerent['hr_label'].value_counts(dropna=False) # Check
```

Out[8]:

In [9]:

```
bikerent.info()
```

In [10]:

```
bikerent.drop(columns=['dteday', 'atemp'], inplace=True)
```

In [11]:

```
bikerent.head()
```

Out[11]:

In [12]:

```
from sklearn.model_selection import train_test_split
train_bike, test_bike = train_test_split(bikerent, test_size=0.2, random_state=1)
train_bike_x = train_bike.drop(columns=['cnt'])
train_bike_y = train_bike['cnt']
test_bike_x = test_bike.drop(columns=['cnt'])
test_bike_y = test_bike['cnt']
```

We will train 4 models here:

- Linear regression
- Decision Trees
- Random Forests
- Support Vector Machine (SVM) Regressor

Our error metrics will be MAE (mean absolute error), since it clearly shows us how much our prediction deviates from the original value, and r2 score.

In [13]:

```
%%time
lr = LinearRegression()
lr.fit(train_bike_x, train_bike_y)
predictions_train = lr.predict(train_bike_x)
predictions_test = lr.predict(test_bike_x)
mae_train = mean_absolute_error(train_bike_y, predictions_train)
mae_test = mean_absolute_error(test_bike_y, predictions_test)
r2_train = r2_score(train_bike_y, predictions_train)
r2_test = r2_score(test_bike_y, predictions_test)
print(f"Train MAE for linear regression:{mae_train}")
print(f"Test MAE for linear regression:{mae_test}")
print(f"Train r2 score for linear regression:{r2_train}")
print(f"Test r2 score for linear regression:{r2_test}")
```

Linear regression with these features is not able to predict the renting count with an adequate accuracy.

In [14]:

```
%%time
def tree_pred(df, min_samples_split=2, min_samples_leaf=1, max_depth=None, return_values=False):
train_bike, test_bike = train_test_split(df, test_size=0.2, random_state=1)
train_bike_x = train_bike.drop(columns=['cnt'])
train_bike_y = train_bike['cnt']
test_bike_x = test_bike.drop(columns=['cnt'])
test_bike_y = test_bike['cnt']
clf = DecisionTreeRegressor(criterion='absolute_error', max_depth=max_depth, min_samples_split=min_samples_split, min_samples_leaf=min_samples_leaf, random_state=1)
clf.fit(train_bike_x, train_bike_y)
predictions_train = clf.predict(train_bike_x)
predictions_test = clf.predict(test_bike_x)
mae_train = mean_absolute_error(train_bike_y, predictions_train)
mae_test = mean_absolute_error(test_bike_y, predictions_test)
r2_train = r2_score(train_bike_y, predictions_train)
r2_test = r2_score(test_bike_y, predictions_test)
if return_values == False:
print(f"Train MAE for decision tree:{mae_train}")
print(f"Test MAE for decision tree:{mae_test}")
print(f"Train r2 score for decision tree:{r2_train}")
print(f"Test r2 score for decision tree:{r2_test}")
print(f"The tree depth is:{clf.get_depth()}")
print(f"The number of leafs is:{clf.get_n_leaves()}")
else:
return mae_train, mae_test, r2_train, r2_test
tree_pred(bikerent)
```

We clearly see an overfitting when we use default parameters. This is expected when using Decision Trees.

There are three main ways to combat overfitting:

- "Prune" the tree after we build it to remove unnecessary leaves.
- Use ensembling to blend the predictions of many trees.
- Restrict the depth of the tree while we're building it.

Let's start with the last approach and reduce `min_samples_split`

parameter, which is the minimum number of samples required to split an internal node.

In [15]:

```
%%time
tree_pred(bikerent, min_samples_split=10)
```

In [16]:

```
x = np.arange(2,40, 5)
r2_trains = []
r2_tests = []
for split in x:
mae_train, mae_test, r2_train, r2_test = tree_pred(bikerent, min_samples_split=split, return_values=True)
r2_trains.append(r2_train)
r2_tests.append(r2_test)
```

In [17]:

```
plt.figure(figsize=(12,8))
plt.plot(x, r2_trains, label='Training error', color='red')
plt.plot(x, r2_tests, label='Testing error', color='green')
plt.legend()
plt.title('Training and testing errors tuning min_samples_split', fontsize=16)
plt.xlabel('The minimum number of samples required to split an internal node')
plt.ylabel('r2 score')
```

Out[17]:

We can clearly see that increasing `min_samples_split`

to more than 15 will not lead to the further increase of test accuracy and will only lead to the decrease of train accuracy (so, we will have underfitting).

Now, let's try to reduce tree depth directly.

In [18]:

```
for leafd in np.arange(34, 20, -3):
tree_pred(bikerent, max_depth=leafd)
```

So, if we reduce the three depth without increasing the minimum number of samples required to split an internal node (`min_samples_split`

) it will not reduce the total number of leafs much and will not improve the test r2 and MAE scores as a consequence. So, tuning this hyperparameter alone will not help us to overcome overfitting.

Let's finally try to tune `min_samples_leaf`

which is the minimum number of samples required to be at a leaf node.

In [19]:

```
x = np.arange(1, 10, 2)
r2_trains = []
r2_tests = []
for split_leaf in x:
mae_train, mae_test, r2_train, r2_test = tree_pred(bikerent, min_samples_split=15, min_samples_leaf=split_leaf, return_values=True)
r2_trains.append(r2_train)
r2_tests.append(r2_test)
```

In [20]:

```
plt.figure(figsize=(12,8))
plt.plot(x, r2_trains, label='Training error', color='red')
plt.plot(x, r2_tests, label='Testing error', color='green')
plt.legend()
plt.title('Training and testing errors tuning min_samples_leaf', fontsize=16)
plt.xlabel('The minimum number of samples required to be at a leaf node')
plt.ylabel('r2 score')
```

Out[20]:

We can see that the biggest testing accuracy is reached when `min_samples_leaf`

is equal to 5.

And as a final hyperparameter to tune, let's consider another time `max_depth`

which is a depth of a tree. Other hyperparameters will be fixed.

In [21]:

```
# Check the current tree depth given the other hyperparameters fixed.
tree_pred(bikerent, min_samples_split=15, min_samples_leaf=5, return_values=False)
```

In [22]:

```
x = np.arange(25, 6, -3)
r2_trains = []
r2_tests = []
for leafd in x:
mae_train, mae_test, r2_train, r2_test = tree_pred(bikerent, max_depth=leafd, min_samples_split=15, min_samples_leaf=5, return_values=True)
r2_trains.append(r2_train)
r2_tests.append(r2_test)
```

In [23]:

```
plt.figure(figsize=(12,8))
plt.plot(x, r2_trains, label='Training error', color='red')
plt.plot(x, r2_tests, label='Testing error', color='green')
plt.legend()
plt.title('Training and testing errors tuning max_depth', fontsize=16)
plt.xlabel('The depth of the tree')
plt.ylabel('r2 score')
```

Out[23]:

We can clearly see that decreasing the depth of the tree does not help us to reach a better balance between training and testing accuracy. Therefore, we will keep it equal to 25 as default when other hyperparameters are tuned.

In [24]:

```
%%time
def rf_tree_pred(df, n_estimators=10, min_samples_split=2, min_samples_leaf=1, max_depth=None, return_values=False):
train_bike, test_bike = train_test_split(df, test_size=0.2, random_state=1)
train_bike_x = train_bike.drop(columns=['cnt'])
train_bike_y = train_bike['cnt']
test_bike_x = test_bike.drop(columns=['cnt'])
test_bike_y = test_bike['cnt']
clf = RandomForestRegressor(criterion='absolute_error', n_estimators=n_estimators, max_depth=max_depth, min_samples_split=min_samples_split, min_samples_leaf=min_samples_leaf, random_state=1)
clf.fit(train_bike_x, train_bike_y)
predictions_train = clf.predict(train_bike_x)
predictions_test = clf.predict(test_bike_x)
mae_train = mean_absolute_error(train_bike_y, predictions_train)
mae_test = mean_absolute_error(test_bike_y, predictions_test)
r2_train = r2_score(train_bike_y, predictions_train)
r2_test = r2_score(test_bike_y, predictions_test)
if return_values == False:
print(f"Train MAE for decision tree:{mae_train}")
print(f"Test MAE for decision tree:{mae_test}")
print(f"Train r2 score for decision tree:{r2_train}")
print(f"Test r2 score for decision tree:{r2_test}")
else:
return mae_train, mae_test, r2_train, r2_test
rf_tree_pred(bikerent)
```

The test accuracy has increased compared to decision tree model. Now, let's apply hyperparameters obtained from decision tree model.

In [25]:

```
rf_tree_pred(bikerent, n_estimators=10, min_samples_split=15, min_samples_leaf=5)
```

In [26]:

```
x = np.arange(10, 31, 4)
r2_trains = []
r2_tests = []
for nestim in x:
mae_train, mae_test, r2_train, r2_test = rf_tree_pred(bikerent, n_estimators=nestim, min_samples_split=15, min_samples_leaf=5, return_values=True)
r2_trains.append(r2_train)
r2_tests.append(r2_test)
```

In [27]:

```
plt.figure(figsize=(12,8))
plt.plot(x, r2_trains, label='Training error', color='red')
plt.plot(x, r2_tests, label='Testing error', color='green')
plt.legend()
plt.title('Training and testing errors for random forests model tuning n_estimators', fontsize=16)
plt.xlabel('The number of forests')
plt.ylabel('r2 score')
```

Out[27]:

As we can see, the increase of number of trees inside the forest does not contribute to the higher accuracy. Therefore, we can keep the default value (10 trees) for the faster speed of algorithm execution.

Finally, let's use Support Vector Machine Regressor (SVR).

In [28]:

```
%%time
from sklearn.svm import SVR
SVR_rbf = SVR(kernel="rbf")
SVR_rbf.fit(train_bike_x, train_bike_y)
predictions_train = SVR_rbf.predict(train_bike_x)
predictions_test = SVR_rbf.predict(test_bike_x)
mae_train = mean_absolute_error(train_bike_y, predictions_train)
mae_test = mean_absolute_error(test_bike_y, predictions_test)
r2_train = r2_score(train_bike_y, predictions_train)
r2_test = r2_score(test_bike_y, predictions_test)
print(f"Train MAE for SVR:{mae_train}")
print(f"Test MAE for SVR:{mae_test}")
print(f"Train r2 score for SVR:{r2_train}")
print(f"Test r2 score for SVR:{r2_test}")
```

MAE is lower than for linear regression (even though r2 score is worse). Generally, the accuracy of SVR is much worse than of decision tree and random forests model. However, we have to mention that support vector machines are very sensitive to data scaling. So, let's use a Pipeline from sklearn library and see if the accuracy will be improved.

In [29]:

```
%%time
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
SVR_rbf = Pipeline((
("scaler", StandardScaler()),
("svr", SVR(kernel="rbf")),
))
SVR_rbf.fit(train_bike_x, train_bike_y)
predictions_train = SVR_rbf.predict(train_bike_x)
predictions_test = SVR_rbf.predict(test_bike_x)
mae_train = mean_absolute_error(train_bike_y, predictions_train)
mae_test = mean_absolute_error(test_bike_y, predictions_test)
r2_train = r2_score(train_bike_y, predictions_train)
r2_test = r2_score(test_bike_y, predictions_test)
print(f"Train MAE for SVR:{mae_train}")
print(f"Test MAE for SVR:{mae_test}")
print(f"Train r2 score for SVR:{r2_train}")
print(f"Test r2 score for SVR:{r2_test}")
```

The accuracy has improved but not significantly. Finally, let's try to use linear model.

In [30]:

```
%%time
from sklearn.svm import LinearSVR
SVR_rbf = Pipeline((
("scaler", StandardScaler()),
("linear_svr", LinearSVR(C=1)),
))
SVR_rbf.fit(train_bike_x, train_bike_y)
predictions_train = SVR_rbf.predict(train_bike_x)
predictions_test = SVR_rbf.predict(test_bike_x)
mae_train = mean_absolute_error(train_bike_y, predictions_train)
mae_test = mean_absolute_error(test_bike_y, predictions_test)
r2_train = r2_score(train_bike_y, predictions_train)
r2_test = r2_score(test_bike_y, predictions_test)
print(f"Train MAE for SVR:{mae_train}")
print(f"Test MAE for SVR:{mae_test}")
print(f"Train r2 score for SVR:{r2_train}")
print(f"Test r2 score for SVR:{r2_test}")
```

As expected, linear model did not give better results than a model with rbf kernel. As a conclusion, we can claim that SVM is not the best algorithm for this particular data.

In this project we tried to predict the number of bike rentals for each hour during the day, given the information about different calendar and weather particularities. After a short data processing and feature engineering we trained 3 regression models:

- Linear regression
- Decision tree regressor
- Random forests regressor
- SVM regressor

As a result of the analysis, we can conclude that the random forests algorithm is the most accurate for the given task and data. The results of the best model (10 forests) are the following:

- 27.44 in terms of MAE (mean average error)
- 0.937 r2 score