%matplotlib inline
import pandas as pd
import numpy as np
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor, export_graphviz
# read the data and set "datetime" as the index
bikes = pd.read_csv('../datasets/bikeshare.csv', index_col='datetime', parse_dates=True)
# "count" is a method, so it's best to rename that column
bikes.rename(columns={'count':'total'}, inplace=True)
# create "hour" as its own feature
bikes['hour'] = bikes.index.hour
bikes.head()
season | holiday | workingday | weather | temp | atemp | humidity | windspeed | casual | registered | total | hour | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
datetime | ||||||||||||
2011-01-01 00:00:00 | 1 | 0 | 0 | 1 | 9.84 | 14.395 | 81 | 0.0 | 3 | 13 | 16 | 0 |
2011-01-01 01:00:00 | 1 | 0 | 0 | 1 | 9.02 | 13.635 | 80 | 0.0 | 8 | 32 | 40 | 1 |
2011-01-01 02:00:00 | 1 | 0 | 0 | 1 | 9.02 | 13.635 | 80 | 0.0 | 5 | 27 | 32 | 2 |
2011-01-01 03:00:00 | 1 | 0 | 0 | 1 | 9.84 | 14.395 | 75 | 0.0 | 3 | 10 | 13 | 3 |
2011-01-01 04:00:00 | 1 | 0 | 0 | 1 | 9.84 | 14.395 | 75 | 0.0 | 0 | 1 | 1 | 4 |
bikes.tail()
season | holiday | workingday | weather | temp | atemp | humidity | windspeed | casual | registered | total | hour | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
datetime | ||||||||||||
2012-12-19 19:00:00 | 4 | 0 | 1 | 1 | 15.58 | 19.695 | 50 | 26.0027 | 7 | 329 | 336 | 19 |
2012-12-19 20:00:00 | 4 | 0 | 1 | 1 | 14.76 | 17.425 | 57 | 15.0013 | 10 | 231 | 241 | 20 |
2012-12-19 21:00:00 | 4 | 0 | 1 | 1 | 13.94 | 15.910 | 61 | 15.0013 | 4 | 164 | 168 | 21 |
2012-12-19 22:00:00 | 4 | 0 | 1 | 1 | 13.94 | 17.425 | 61 | 6.0032 | 12 | 117 | 129 | 22 |
2012-12-19 23:00:00 | 4 | 0 | 1 | 1 | 13.12 | 16.665 | 66 | 8.9981 | 4 | 84 | 88 | 23 |
Run these two groupby
statements and figure out what they tell you about the data.
# mean rentals for each value of "workingday"
bikes.groupby('workingday').total.mean()
workingday 0 188.506621 1 193.011873 Name: total, dtype: float64
# mean rentals for each value of "hour"
bikes.groupby('hour').total.mean()
hour 0 55.138462 1 33.859031 2 22.899554 3 11.757506 4 6.407240 5 19.767699 6 76.259341 7 213.116484 8 362.769231 9 221.780220 10 175.092308 11 210.674725 12 256.508772 13 257.787281 14 243.442982 15 254.298246 16 316.372807 17 468.765351 18 430.859649 19 315.278509 20 228.517544 21 173.370614 22 133.576754 23 89.508772 Name: total, dtype: float64
Run this plotting code, and make sure you understand the output. Then, separate this plot into two separate plots conditioned on "workingday". (In other words, one plot should display the hourly trend for "workingday=0", and the other should display the hourly trend for "workingday=1".)
# mean rentals for each value of "hour"
bikes.groupby('hour').total.mean().plot()
<matplotlib.axes._subplots.AxesSubplot at 0x250dd8a9b08>
Plot for workingday == 0 and workingday == 1
# hourly rental trend for "workingday=0"
bikes[bikes.workingday==0].groupby('hour').total.mean().plot()
<matplotlib.axes._subplots.AxesSubplot at 0x250dd9aff08>
# hourly rental trend for "workingday=1"
bikes[bikes.workingday==1].groupby('hour').total.mean().plot()
<matplotlib.axes._subplots.AxesSubplot at 0x250dda31308>
# combine the two plots
bikes.groupby(['hour', 'workingday']).total.mean().unstack().plot()
<matplotlib.axes._subplots.AxesSubplot at 0x250ddd4db48>
Write about your findings
Fit a linear regression model to the entire dataset, using "total" as the response and "hour" and "workingday" as the only features. Then, print the coefficients and interpret them. What are the limitations of linear regression in this instance?
Create a Decision Tree to forecast "total" by manually iterating over the features "hour" and "workingday". The algorithm must at least have 6 end nodes.
Train a Decision Tree using scikit-learn. Comment about the performance of the models.
Predicting if a news story is going to be popular
df = pd.read_csv('../datasets/mashable.csv', index_col=0)
df.head()
url | timedelta | n_tokens_title | n_tokens_content | n_unique_tokens | n_non_stop_words | n_non_stop_unique_tokens | num_hrefs | num_self_hrefs | num_imgs | ... | min_positive_polarity | max_positive_polarity | avg_negative_polarity | min_negative_polarity | max_negative_polarity | title_subjectivity | title_sentiment_polarity | abs_title_subjectivity | abs_title_sentiment_polarity | Popular | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | http://mashable.com/2014/12/10/cia-torture-rep... | 28.0 | 9.0 | 188.0 | 0.732620 | 1.0 | 0.844262 | 5.0 | 1.0 | 1.0 | ... | 0.200000 | 0.80 | -0.487500 | -0.60 | -0.250000 | 0.9 | 0.8 | 0.4 | 0.8 | 1 |
1 | http://mashable.com/2013/10/18/bitlock-kicksta... | 447.0 | 7.0 | 297.0 | 0.653199 | 1.0 | 0.815789 | 9.0 | 4.0 | 1.0 | ... | 0.160000 | 0.50 | -0.135340 | -0.40 | -0.050000 | 0.1 | -0.1 | 0.4 | 0.1 | 0 |
2 | http://mashable.com/2013/07/24/google-glass-po... | 533.0 | 11.0 | 181.0 | 0.660377 | 1.0 | 0.775701 | 4.0 | 3.0 | 1.0 | ... | 0.136364 | 1.00 | 0.000000 | 0.00 | 0.000000 | 0.3 | 1.0 | 0.2 | 1.0 | 0 |
3 | http://mashable.com/2013/11/21/these-are-the-m... | 413.0 | 12.0 | 781.0 | 0.497409 | 1.0 | 0.677350 | 10.0 | 3.0 | 1.0 | ... | 0.100000 | 1.00 | -0.195701 | -0.40 | -0.071429 | 0.0 | 0.0 | 0.5 | 0.0 | 0 |
4 | http://mashable.com/2014/02/11/parking-ticket-... | 331.0 | 8.0 | 177.0 | 0.685714 | 1.0 | 0.830357 | 3.0 | 2.0 | 1.0 | ... | 0.100000 | 0.55 | -0.175000 | -0.25 | -0.100000 | 0.0 | 0.0 | 0.5 | 0.0 | 0 |
5 rows × 61 columns
df.shape
(6000, 61)
X = df.drop(['url', 'Popular'], axis=1)
y = df['Popular']
y.mean()
0.5
# train/test split
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)
Estimate a Decision Tree Classifier and a Logistic Regression
Evaluate using the following metrics:
Estimate 300 bagged samples
Estimate the following set of classifiers:
Estimate the probability as %models that predict positive
Modify the probability threshold and select the one that maximizes the F1-Score
Ensemble using weighted voting using the oob_error
Evaluate using the following metrics:
Estimate te probability of the weighted voting
Modify the probability threshold and select the one that maximizes the F1-Score
Estimate a logistic regression using as input the estimated classifiers
Modify the probability threshold such that maximizes the F1-Score