Exercise 7¶

Part 1 - DT¶

Capital Bikeshare data¶

Introduction¶

Capital Bikeshare dataset from Kaggle: data, data dictionary
Each observation represents the bikeshare rentals initiated during a given hour of a given day

In [2]:

%matplotlib inline
import pandas as pd
import numpy as np
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor, export_graphviz

In [3]:

# read the data and set "datetime" as the index
bikes = pd.read_csv('../datasets/bikeshare.csv', index_col='datetime', parse_dates=True)

In [4]:

# "count" is a method, so it's best to rename that column
bikes.rename(columns={'count':'total'}, inplace=True)

In [5]:

# create "hour" as its own feature
bikes['hour'] = bikes.index.hour

In [6]:

bikes.head()

Out[6]:

	season	holiday	workingday	weather	temp	atemp	humidity	windspeed	casual	registered	total	hour
datetime
2011-01-01 00:00:00	1	0	0	1	9.84	14.395	81	0.0	3	13	16	0
2011-01-01 01:00:00	1	0	0	1	9.02	13.635	80	0.0	8	32	40	1
2011-01-01 02:00:00	1	0	0	1	9.02	13.635	80	0.0	5	27	32	2
2011-01-01 03:00:00	1	0	0	1	9.84	14.395	75	0.0	3	10	13	3
2011-01-01 04:00:00	1	0	0	1	9.84	14.395	75	0.0	0	1	1	4

In [7]:

bikes.tail()

Out[7]:

	season	holiday	workingday	weather	temp	atemp	humidity	windspeed	casual	registered	total	hour
datetime
2012-12-19 19:00:00	4	0	1	1	15.58	19.695	50	26.0027	7	329	336	19
2012-12-19 20:00:00	4	0	1	1	14.76	17.425	57	15.0013	10	231	241	20
2012-12-19 21:00:00	4	0	1	1	13.94	15.910	61	15.0013	4	164	168	21
2012-12-19 22:00:00	4	0	1	1	13.94	17.425	61	6.0032	12	117	129	22
2012-12-19 23:00:00	4	0	1	1	13.12	16.665	66	8.9981	4	84	88	23

hour ranges from 0 (midnight) through 23 (11pm)
workingday is either 0 (weekend or holiday) or 1 (non-holiday weekday)

Exercise 7.1¶

Run these two groupby statements and figure out what they tell you about the data.

In [8]:

# mean rentals for each value of "workingday"
bikes.groupby('workingday').total.mean()

Out[8]:

workingday
0    188.506621
1    193.011873
Name: total, dtype: float64

In [9]:

# mean rentals for each value of "hour"
bikes.groupby('hour').total.mean()

Out[9]:

hour
0      55.138462
1      33.859031
2      22.899554
3      11.757506
4       6.407240
5      19.767699
6      76.259341
7     213.116484
8     362.769231
9     221.780220
10    175.092308
11    210.674725
12    256.508772
13    257.787281
14    243.442982
15    254.298246
16    316.372807
17    468.765351
18    430.859649
19    315.278509
20    228.517544
21    173.370614
22    133.576754
23     89.508772
Name: total, dtype: float64

Exercise 7.2¶

Run this plotting code, and make sure you understand the output. Then, separate this plot into two separate plots conditioned on "workingday". (In other words, one plot should display the hourly trend for "workingday=0", and the other should display the hourly trend for "workingday=1".)

In [10]:

# mean rentals for each value of "hour"
bikes.groupby('hour').total.mean().plot()

Out[10]:

<matplotlib.axes._subplots.AxesSubplot at 0x250dd8a9b08>

Plot for workingday == 0 and workingday == 1

In [11]:

# hourly rental trend for "workingday=0"
bikes[bikes.workingday==0].groupby('hour').total.mean().plot()

Out[11]:

<matplotlib.axes._subplots.AxesSubplot at 0x250dd9aff08>

In [12]:

# hourly rental trend for "workingday=1"
bikes[bikes.workingday==1].groupby('hour').total.mean().plot()

Out[12]:

<matplotlib.axes._subplots.AxesSubplot at 0x250dda31308>

In [13]:

# combine the two plots
bikes.groupby(['hour', 'workingday']).total.mean().unstack().plot()

Out[13]:

<matplotlib.axes._subplots.AxesSubplot at 0x250ddd4db48>

Write about your findings

Exercise 7.3¶

Fit a linear regression model to the entire dataset, using "total" as the response and "hour" and "workingday" as the only features. Then, print the coefficients and interpret them. What are the limitations of linear regression in this instance?

In [ ]:

Exercice 7.4¶

Create a Decision Tree to forecast "total" by manually iterating over the features "hour" and "workingday". The algorithm must at least have 6 end nodes.

In [ ]:

Exercise 7.5¶

Train a Decision Tree using scikit-learn. Comment about the performance of the models.

In [ ]:

Part 2 - Bagging¶

Mashable news stories analysis¶

Predicting if a news story is going to be popular

In [14]:

df = pd.read_csv('../datasets/mashable.csv', index_col=0)
df.head()

Out[14]:

	url	timedelta	n_tokens_title	n_tokens_content	n_unique_tokens	n_non_stop_words	n_non_stop_unique_tokens	num_hrefs	num_self_hrefs	num_imgs	...	min_positive_polarity	max_positive_polarity	avg_negative_polarity	min_negative_polarity	max_negative_polarity	title_subjectivity	title_sentiment_polarity	abs_title_subjectivity	abs_title_sentiment_polarity	Popular
0	http://mashable.com/2014/12/10/cia-torture-rep...	28.0	9.0	188.0	0.732620	1.0	0.844262	5.0	1.0	1.0	...	0.200000	0.80	-0.487500	-0.60	-0.250000	0.9	0.8	0.4	0.8	1
1	http://mashable.com/2013/10/18/bitlock-kicksta...	447.0	7.0	297.0	0.653199	1.0	0.815789	9.0	4.0	1.0	...	0.160000	0.50	-0.135340	-0.40	-0.050000	0.1	-0.1	0.4	0.1	0
2	http://mashable.com/2013/07/24/google-glass-po...	533.0	11.0	181.0	0.660377	1.0	0.775701	4.0	3.0	1.0	...	0.136364	1.00	0.000000	0.00	0.000000	0.3	1.0	0.2	1.0	0
3	http://mashable.com/2013/11/21/these-are-the-m...	413.0	12.0	781.0	0.497409	1.0	0.677350	10.0	3.0	1.0	...	0.100000	1.00	-0.195701	-0.40	-0.071429	0.0	0.0	0.5	0.0	0
4	http://mashable.com/2014/02/11/parking-ticket-...	331.0	8.0	177.0	0.685714	1.0	0.830357	3.0	2.0	1.0	...	0.100000	0.55	-0.175000	-0.25	-0.100000	0.0	0.0	0.5	0.0	0

5 rows × 61 columns

In [16]:

df.shape

Out[16]:

(6000, 61)

In [17]:

X = df.drop(['url', 'Popular'], axis=1)
y = df['Popular']

In [18]:

y.mean()

Out[18]:

0.5

In [19]:

# train/test split
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)

Exercise 7.6¶

Estimate a Decision Tree Classifier and a Logistic Regression

Evaluate using the following metrics:

Accuracy
F1-Score

In [ ]:

Exercise 7.7¶

Estimate 300 bagged samples

Estimate the following set of classifiers:

100 Decision Trees where max_depth=None
100 Decision Trees where max_depth=2
100 Logistic Regressions

In [ ]:

Exercise 7.8¶

Ensemble using majority voting

Evaluate using the following metrics:

Accuracy
F1-Score

In [ ]:

Exercise 7.9¶

Estimate the probability as %models that predict positive

Modify the probability threshold and select the one that maximizes the F1-Score

In [ ]:

Exercise 7.10¶

Ensemble using weighted voting using the oob_error

Evaluate using the following metrics:

Accuracy
F1-Score

In [ ]:

Exercise 7.11¶

Estimate te probability of the weighted voting

Modify the probability threshold and select the one that maximizes the F1-Score

In [ ]:

Exercise 7.12¶

Estimate a logistic regression using as input the estimated classifiers

Modify the probability threshold such that maximizes the F1-Score

In [ ]: