Выпускной проект "Идентификация интернет-пользователей" Week 4
Week 4. Comparison of classification algorithms
Now we will finally approach the training of classification models, compare several algorithms on cross-validation, and figure out which session length parameters (session_length and window_size) are better to use. Also, for the selected algorithm, we will construct validation curves (how the classification quality depends on one of the hyperparameters of the algorithm) and learning curves (how the classification quality depends on the sample size).
4 week plan:
In this part of the project, video recordings of the following lectures of the course "Learning from marked data" may be useful:
# pip install watermark
#%load_ext watermark
#%watermark -v -m -p numpy,scipy,pandas,matplotlib,statsmodels,sklearn -g
from __future__ import division, print_function
# disable any Anaconda warnings
import warnings
warnings.filterwarnings('ignore')
from time import time
import itertools
import os
import numpy as np
import pandas as pd
import seaborn as sns
%matplotlib inline
from matplotlib import pyplot as plt
import pickle
from scipy.sparse import csr_matrix
from sklearn.model_selection import train_test_split, cross_val_score, StratifiedKFold, GridSearchCV
from sklearn.metrics import accuracy_score, f1_score
from google.colab import drive
drive.mount('/content/drive')
Mounted at /content/drive
# set my own path to the data
PATH_TO_DATA = '/content/drive/MyDrive/DATA/Stepik/capstone_user_identification'
Let's load the previously serialized objects X_sparse_10 users and y_10 users, corresponding to the training sample for 10 users.
with open(os.path.join(PATH_TO_DATA,
'X_sparse_10users.pkl'), 'rb') as X_sparse_10users_pkl:
X_sparse_10users = pickle.load(X_sparse_10users_pkl)
with open(os.path.join(PATH_TO_DATA,
'y_10users.pkl'), 'rb') as y_10users_pkl:
y_10users = pickle.load(y_10users_pkl)
There are more than 14 thousand sessions and almost 5 thousand unique visited sites.
X_sparse_10users.shape
(14061, 4913)
Let's split the sample into 2 parts. On one we will carry out cross-validation, on the second we will evaluate the model trained after cross-validation.
# Due to a training error on sparse matrices
# X_sparse_10users = X_sparse_10users.todense()
X_train, X_valid, y_train, y_valid = train_test_split(X_sparse_10users, y_10users,
test_size=0.3,
random_state=17, stratify=y_10users)
Let's set the type of cross-validation in advance: 3-fold, with mixing, the parameter random_state = 56 - for reproducibility.
skf = StratifiedKFold(n_splits=3, shuffle=True, random_state=17)
Auxiliary function for drawing validation curves after starting GridSearchCV (or RandomizedCV).
def plot_validation_curves(param_values, grid_cv_results_):
train_mu, train_std = grid_cv_results_['mean_train_score'], grid_cv_results_['std_train_score']
valid_mu, valid_std = grid_cv_results_['mean_test_score'], grid_cv_results_['std_test_score']
train_line = plt.plot(param_values, train_mu, '-', label='train', color='green')
valid_line = plt.plot(param_values, valid_mu, '-', label='test', color='red')
plt.fill_between(param_values, train_mu - train_std, train_mu + train_std, edgecolor='none',
facecolor=train_line[0].get_color(), alpha=0.2)
plt.fill_between(param_values, valid_mu - valid_std, valid_mu + valid_std, edgecolor='none',
facecolor=valid_line[0].get_color(), alpha=0.2)
plt.legend()
1. Let's train a 'KNeighborsClassifier' with 100 nearest neighbors (we'll leave the rest of the parameters by default, only 'n_jobs'= -1 for parallelization) and look at the proportion of correct answers on 3-fold cross-validation (for the sake of reproducibility, we use the StratifiedKFold
skf' object for this) on the sample
(X_train, y_train)and separately on the sample
(X_valid, y_valid)`.
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
from time import time
knn = KNeighborsClassifier(n_neighbors=100)
t_start = time()
knn_cv_score = cross_val_score(knn, X_train, y_train, cv=skf, n_jobs=-1)
print("CV scores:", knn_cv_score)
print("mean:", np.mean(knn_cv_score))
print("std:", np.std(knn_cv_score))
print("Time elapsed: ", time()-t_start)
CV scores: [0.56598598 0.55409936 0.55792683] mean: 0.5593373897012363 std: 0.004954135909104914 Time elapsed: 3.559767246246338
t_start = time()
scores = cross_val_score(knn, X_valid, y_valid, cv=skf)
print("CV scores:", scores)
print("mean:", np.mean(scores))
print("std:", np.std(scores))
print("Time elapsed: ", time()-t_start)
CV scores: [0.50817342 0.51351351 0.48435277] mean: 0.5020132353203838 std: 0.012676699846532288 Time elapsed: 0.5638549327850342
Importernt: I had a problem with my code in this place: cross_val_score did not work and gave result NaN or, if I used the todense, code worked for too long.
Question 1 Let's calculate the proportion of correct answers for the KNeighborsClassifier on cross-validation and deferred sampling. I'll round each one up to 3 decimal places.
knn.fit(X_train, y_train)
knn_pred = knn.predict(X_valid)
print(f"KNN Cross-Validation Score: {knn_cv_score.mean():.3f}")
print(f"KNN Validation Score: {accuracy_score(y_valid, knn_pred):.3f}")
KNN Cross-Validation Score: 0.559 KNN Validation Score: 0.584
2. Train a random forest (RandomForestClassifier) of 100 trees (for reproducibility random_state=17). Let's look at the OOB score (for which we will immediately set oob_score=True) and the proportion of correct answers in the sample (X_valid, y_valid). For parallelization, set n_jobs= -1.
Question 2. Calculate the percentages of correct answers for RandomForestClassifier during Out-of-Bag evaluation and on deferred sampling?
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier(n_estimators=100, random_state=17, oob_score=True, n_jobs=-1)
t_start = time()
clf.fit(X_train, y_train)
print('Score: ', clf.score(X_train, y_train))
print("Time elapsed: ", time()-t_start)
Score: 0.9759195285511075 Time elapsed: 11.684211492538452
print(clf.oob_score_)
0.7172322698638488
t_start = time()
clf_pred = clf.predict(X_valid)
print(accuracy_score(y_valid, clf_pred))
print("Time elapsed: ", time()-t_start)
0.7312159279450107 Time elapsed: 0.20986366271972656
3. Let's train Logistic Regression with the default parameter C and random_state=17 (for reproducibility). Let's look at the proportion of correct answers on cross-validation (using the scf object created earlier) and on the sample (X_valid, y_valid). For parallelization, set n_jobs= -1.
from sklearn.linear_model import LogisticRegression, LogisticRegressionCV
logit = LogisticRegression(random_state=17, n_jobs=-1)
t_start = time()
logit_cv_score = cross_val_score(logit, X_train, y_train, cv=skf)
logit.fit(X_train, y_train)
logit_y_pred = logit.predict(X_valid)
logit_val_score = accuracy_score(y_valid, logit_y_pred)
print(f"LogReg Cross-Validation Score: {logit_cv_score.mean():.3f}")
print(f"LogReg Validation Score: {logit_val_score:.3f}")
print("Time elapsed: ", time()-t_start)
LogReg Cross-Validation Score: 0.761 LogReg Validation Score: 0.777 Time elapsed: 10.484177112579346
Using LogisticRegressionCV, we will select the parameter C for Logistic Regression first in a wide range: 10 values from 1e-4 to 1e2, use logspace from NumPy. Specify the LogisticRegressionCV parameters multi_class='multinomial' and random_state=17. For cross-validation, we use the skf object created earlier. For parallelization, set n_jobs= -1.
At the end, we will draw validation curves for parameter C.
t_start = time()
logit_c_values1 = np.logspace(-4, 2, 10)
logit_grid_searcher1 = LogisticRegressionCV(Cs=logit_c_values1, cv=skf,
multi_class="multinomial",
random_state=17)
logit_grid_searcher1.fit(X_train, y_train)
print("Time elapsed: ", time()-t_start)
Time elapsed: 91.80030512809753
The average values of the proportion of correct responses to cross-validation for each of the 10 parameters C.
logit_mean_cv_scores1 = np.array(
list(logit_grid_searcher1.scores_.values())).mean(axis=(0, 1))
logit_mean_cv_scores1
array([0.31954964, 0.47307397, 0.55202236, 0.64875035, 0.71438846, 0.75177962, 0.76092382, 0.75848551, 0.749849 , 0.74029823])
We will output the best value of the proportion of correct answers on cross-validation and the corresponding value with.
best_score1 = np.max(logit_mean_cv_scores1)
best_C1 = logit_grid_searcher1.Cs_[np.argmax(logit_mean_cv_scores1)]
print(f"Best Score: {best_score1}")
print(f"Best C: {best_C1}")
Best Score: 0.7609238210638737 Best C: 1.0
Let's draw a graph of the dependence of the proportion of correct answers to cross-validation on `C'.
plt.plot(logit_c_values1, logit_mean_cv_scores1);
Now the same thing, only the values of parameter 'C' are iterated over in the range np.linspace(0.1, 7, 20). Let's draw validation curves again, determine the maximum value of the proportion of correct answers on cross-validation.
t_start = time()
logit_c_values2 = np.linspace(0.1, 7, 20)
logit_grid_searcher2 = LogisticRegressionCV(Cs=logit_c_values2, cv=skf, multi_class='multinomial', random_state=17, n_jobs=-1)
logit_grid_searcher2.fit(X_train, y_train)
print("Time elapsed: ", time()-t_start)
Time elapsed: 105.8310170173645
The average values of the proportion of correct responses to cross-validation for each of the 10 parameters `C'.
logit_mean_cv_scores2 = np.array(
list(logit_grid_searcher2.scores_.values())).mean(axis=(0, 1))
logit_mean_cv_scores2
array([0.73481117, 0.75919655, 0.76102545, 0.76082216, 0.76133023, 0.76143192, 0.75990775, 0.75929811, 0.76000937, 0.75939977, 0.75919661, 0.75868861, 0.757571 , 0.75736787, 0.75716462, 0.75614861, 0.75553901, 0.75513262, 0.75401502, 0.75350692])
We will output the best value of the proportion of correct answers on cross-validation and the corresponding value with.
best_score2 = np.max(logit_mean_cv_scores2)
best_C2 = logit_grid_searcher2.Cs_[np.argmax(logit_mean_cv_scores2)]
print(f"Best Score: {best_score2}")
print(f"Best C: {best_C2}")
Best Score: 0.761431920171076 Best C: 1.9157894736842107
Let's draw a graph of the dependence of the proportion of correct answers to cross-validation on `C'.
plt.plot(logit_c_values2, logit_mean_cv_scores2);
We output the proportion of correct answers in the sample (X_value, y_value)' for logistic regression with the best values found
C'.
t_start = time()
logit = LogisticRegression(C=best_C2, n_jobs=-1, random_state=17)
logit_cv_score = cross_val_score(logit, X_train, y_train, cv=skf, n_jobs=-1)
logit.fit(X_train, y_train)
logit_y_pred = logit.predict(X_valid)
logit_val_score = accuracy_score(y_valid, logit_y_pred)
print("Time elapsed: ", time()-t_start)
Time elapsed: 9.804409980773926
Question 3. Let's calculate the proportions of correct answers for 'logit_grid_searcher 2' on cross-validation for the best value of parameter 'C` and on deferred sampling. Round each one to 3 decimal places and print it separated by a space.
print(f"LogReg Cross-Validation Score: {logit_cv_score.mean():.3f}")
print(f"LogReg Validation Score: {logit_val_score:.3f}")
LogReg Cross-Validation Score: 0.762 LogReg Validation Score: 0.782
4. Let's train a linear SVM ('LinearSVC) with the parameter 'C'=1 and 'random_state'=17 (for reproducibility). Let's look at the proportion of correct answers on cross-validation (using the
skf' object created earlier) and on the sample (X_valid, y_valid)
.
from sklearn.svm import LinearSVC
t_start = time()
svm = LinearSVC(C=1, random_state=17)
scores_svm = cross_val_score(svm, X_train, y_train, cv=skf, n_jobs=-1)
print("CV scores:", scores_svm)
print("mean:", np.mean(scores_svm))
print("std:", np.std(scores_svm))
print("Time elapsed: ", time()-t_start)
CV scores: [0.75068577 0.73270344 0.7695122 ] mean: 0.7509671352428245 std: 0.015028426724668621 Time elapsed: 3.3960418701171875
t_start = time()
svm.fit(X_train, y_train)
svm_pred = svm.predict(X_valid)
print(accuracy_score(y_valid, svm_pred))
print("Time elapsed: ", time()-t_start)
0.7769613652524295 Time elapsed: 2.0046310424804688
Using GridSearchCV
, we will select the parameter C
for SVM first in a wide range: 10 values from 1e-4 to 1 e4, use linspace
from NumPy. Let's draw validation curves.
%%time
svm_params1 = {
"C": np.linspace(1e-4, 1e4, 10)
}
svm_grid_searcher1 = GridSearchCV(estimator=svm, cv=skf, param_grid=svm_params1,
return_train_score=True)
svm_grid_searcher1.fit(X_train, y_train)
CPU times: user 49.9 s, sys: 34.8 ms, total: 50 s Wall time: 49.7 s
t_start = time()
svm_params1 = {'C': np.linspace(1e-4, 1e4, 10)}
svm_grid_searcher1 = GridSearchCV(svm, param_grid=svm_params1, cv=skf, return_train_score=True, n_jobs=-1)
svm_grid_searcher1.fit(X_train, y_train)
print("Time elapsed: ", time()-t_start)
Time elapsed: 44.00320863723755
We will output the best value of the proportion of correct answers on cross-validation and the corresponding value C
.
svm_grid_searcher1.best_params_
{'C': 5555.555600000001}
Let's draw a graph of the dependence of the proportion of correct answers to cross-validation on C
.
plot_validation_curves(svm_params1['C'], svm_grid_searcher1.cv_results_)
But we remember that with the default regularization parameter (C=1) on cross-validation, the proportion of correct answers is higher. This is the case (not uncommon) when you can make a mistake and iterate over the parameters in the wrong range (the reason is that we took a uniform grid over a large interval and missed a really good range of values
C
). Here it is much more meaningful to selectC
in the region of 1, besides, this way the model learns faster than with largeC
.
Using GridSearchCV
, we will select the parameter C
for SVM in the range (1e-3, 1), 30 values using `linspace' from NumPy. Let's draw validation curves.
%%time
svm_params2 = {'C': np.linspace(1e-3, 1, 30)}
svm_grid_searcher2 = GridSearchCV(svm, param_grid=svm_params2, cv=skf, return_train_score=True, n_jobs=-1)
svm_grid_searcher2.fit(X_train, y_train)
CPU times: user 1.77 s, sys: 150 ms, total: 1.92 s Wall time: 1min 18s
Output the best value of the proportion of correct answers on cross-validation and the corresponding value of C
.
best_score = svm_grid_searcher2.best_score_
best_params = svm_grid_searcher2.best_params_
print(f"Best Score: {best_score}")
print(f"Best Params: {best_params}")
Best Score: 0.7670206386611259 Best Params: {'C': 0.10434482758620689}
Let's draw a graph of the dependence of the proportion of correct answers to cross-validation on 'C'.
plot_validation_curves(svm_params2['C'], svm_grid_searcher2.cv_results_)
Output the proportion of correct answers in the sample (X_value, y_value)' for 'LinearSVC
with the best values found `C'.
%%time
svm = LinearSVC(**best_params, random_state=17)
svm_cv_score = cross_val_score(svm, X_train, y_train, cv=skf, n_jobs=-1)
svm.fit(X_train, y_train)
svm_y_pred = svm.predict(X_valid)
svm_val_score = accuracy_score(y_valid, svm_y_pred)
CPU times: user 683 ms, sys: 8.26 ms, total: 691 ms Wall time: 1.74 s
Question 4. Let's calculate the proportions of correct answers for 'stm_grid_searcher 2' on cross-validation for the best value of parameter 'C` and on deferred sampling. Round each one to 3 decimal places and print it separated by a space.
print(f"SVC Cross-Validation Score: {svm_cv_score.mean():.3f}")
print(f"SVC Validation Score: {svm_val_score:.3f}")
SVC Cross-Validation Score: 0.767 SVC Validation Score: 0.781
Let's take LinearSVC
, which showed the best quality on cross-validation in part 1, and check its work on 8 more samples for 10 users (with different combinations of parameters session_length and window_size). Since there are already more calculations here, we will not re-select the regularization parameter C
every time.
Let's define the model_assessment
function, the documentation of which is described below. The split of the sample with 'train_test_split' should be stratified.
def model_assessment(estimator, path_to_X_pickle, path_to_y_pickle, cv, random_state=17, test_size=0.3):
'''
Estimates CV-accuracy for (1 - test_size) share of (X_sparse, y)
loaded from path_to_X_pickle and path_to_y_pickle and holdout accuracy for (test_size) share of (X_sparse, y).
The split is made with stratified train_test_split with params random_state and test_size.
:param estimator – Scikit-learn estimator (classifier or regressor)
:param path_to_X_pickle – path to pickled sparse X (instances and their features)
:param path_to_y_pickle – path to pickled y (responses)
:param cv – cross-validation as in cross_val_score (use StratifiedKFold here)
:param random_state – for train_test_split
:param test_size – for train_test_split
:returns mean CV-accuracy for (X_train, y_train) and accuracy for (X_valid, y_valid) where (X_train, y_train)
and (X_valid, y_valid) are (1 - test_size) and (testsize) shares of (X_sparse, y).
'''
with open(path_to_X_pickle, 'rb') as X_sparse_10users_pkl:
X_sparse_10users = pickle.load(X_sparse_10users_pkl)
with open(path_to_y_pickle, 'rb') as y_10users_pkl:
y_10users = pickle.load(y_10users_pkl)
X_train, X_valid, y_train, y_valid = train_test_split(X_sparse_10users, y_10users,
test_size=0.3,
random_state=17, stratify=y_10users)
t_start = time()
scores_svm = cross_val_score(estimator, X_train, y_train, cv=skf, n_jobs=-1)
t_start = time()
estimator.fit(X_train, y_train)
svm_pred = estimator.predict(X_valid)
return(np.mean(scores_svm), accuracy_score(y_valid, svm_pred), " Time elapsed: ", time()-t_start)
Let's make sure that the function works.
model_assessment(svm_grid_searcher2.best_estimator_,
os.path.join(PATH_TO_DATA, 'X_sparse_10users.pkl'),
os.path.join(PATH_TO_DATA, 'y_10users.pkl'), skf, random_state=17, test_size=0.3)
(0.7670206386611259, 0.7807537331121118, ' Time elapsed: ', 0.6428191661834717)
Let's use the model_assessment function for the best algorithm from the previous part (namely, 'svm_grid_searcher 2.best_estimator_`) and 9 samples of the form with different combinations of parameters session_length and window_size for 10 users. We will output the session_length and window_size parameters in the loop, as well as the output result of the model_assessment function.
Here, for convenience, it is worth creating copies of previously created pickle files X_sparse_10users.pkl, X_sparse_150users.pkl, y_10users.pkl and y_150users.pkl, adding s10_w10 to their names, which means the session length is 10 and the window width is 10.
!cp $PATH_TO_DATA/X_sparse_10users.pkl $PATH_TO_DATA/X_sparse_10users_s10_w10.pkl
!cp $PATH_TO_DATA/X_sparse_150users.pkl $PATH_TO_DATA/X_sparse_150users_s10_w10.pkl
!cp $PATH_TO_DATA/y_10users.pkl $PATH_TO_DATA/y_10users_s10_w10.pkl
!cp $PATH_TO_DATA/y_150users.pkl $PATH_TO_DATA/y_150users_s10_w10.pkl
#for 10 users
%%time
estimator = svm_grid_searcher2.best_estimator_
for window_size, session_length in itertools.product([10, 7, 5], [15, 10, 7, 5]):
if window_size <= session_length:
path_to_X_pkl = os.path.join(
PATH_TO_DATA, f"X_sparse_10users_s{session_length}_w{window_size}.pkl")
path_to_y_pkl = os.path.join(
PATH_TO_DATA, f"y_10users_s{session_length}_w{window_size}.pkl")
print(window_size, session_length,
model_assessment(estimator=estimator,
path_to_X_pickle=path_to_X_pkl,
path_to_y_pickle=path_to_y_pkl,
cv=skf))
10 15 (0.8243252292702751, 0.8404835269021095, ' Time elapsed: ', 1.0584142208099365) 10 10 (0.7670206386611259, 0.7807537331121118, ' Time elapsed: ', 0.6238193511962891) 7 15 (0.8495024256089474, 0.8543222166915547, ' Time elapsed: ', 1.623363971710205) 7 10 (0.7983645917156946, 0.8073668491786958, ' Time elapsed: ', 0.974440336227417) 7 7 (0.754765400423003, 0.7617388418782147, ' Time elapsed: ', 0.5486903190612793) 5 15 (0.8670355547005402, 0.8752963489805595, ' Time elapsed: ', 2.1419732570648193) 5 10 (0.8177520250854086, 0.8245614035087719, ' Time elapsed: ', 1.218752145767212) 5 7 (0.772939529035208, 0.7853247984826932, ' Time elapsed: ', 0.761998176574707) 5 5 (0.7254849424351582, 0.7362494073020389, ' Time elapsed: ', 0.501572847366333) CPU times: user 11.2 s, sys: 107 ms, total: 11.3 s Wall time: 33.2 s
#for 150 users
%%time
estimator = svm_grid_searcher2.best_estimator_
for window_size, session_length in itertools.product([10, 7, 5], [15, 10, 7, 5]):
if window_size <= session_length:
path_to_X_pkl = os.path.join(
PATH_TO_DATA, f"X_sparse_150users_s{session_length}_w{window_size}.pkl")
path_to_y_pkl = os.path.join(
PATH_TO_DATA, f"y_150users_s{session_length}_w{window_size}.pkl")
print(window_size, session_length,
model_assessment(estimator=estimator,
path_to_X_pickle=path_to_X_pkl,
path_to_y_pickle=path_to_y_pkl,
cv=skf))
10 15 (0.5488098589346596, 0.5751471804602735, ' Time elapsed: ', 223.8541784286499) 10 10 (0.46308633866107823, 0.4836276942538802, ' Time elapsed: ', 136.27440857887268) 7 15 (0.5828479247872232, 0.6084920121265797, ' Time elapsed: ', 309.6582760810852) 7 10 (0.5015547672228792, 0.5239295568348264, ' Time elapsed: ', 213.65414190292358) 7 7 (0.43694798464211154, 0.45307763054808053, ' Time elapsed: ', 149.04103302955627) 5 15 (0.6139887051608968, 0.6360295906945053, ' Time elapsed: ', 467.08626675605774) 5 10 (0.5265866745928696, 0.5458826106000876, ' Time elapsed: ', 317.86096453666687) 5 7 (0.46509602699080665, 0.48189516717769015, ' Time elapsed: ', 195.35350489616394) 5 5 (0.4084080325808655, 0.4217282328320436, ' Time elapsed: ', 142.6359510421753) CPU times: user 36min 40s, sys: 7.97 s, total: 36min 48s Wall time: 1h 29min 29s
Question 5. Let's calculate the proportion of correct answers for LinearSVC
with the configured parameter C
and the selection X_sparse_10 users_s15_w5
. We will indicate the proportions of correct answers on cross-validation and on deferred sampling. Round each one to 3 decimal places and print it separated by a space.
%%time
estimator = svm_grid_searcher2.best_estimator_
path_to_X_pkl = os.path.join(PATH_TO_DATA, "X_sparse_10users_s15_w5.pkl")
path_to_y_pkl = os.path.join(PATH_TO_DATA, "y_10users_s15_w5.pkl")
with open(path_to_X_pkl, 'rb') as X_sparse_10users_pkl:
X_sparse_10users = pickle.load(X_sparse_10users_pkl)
with open(path_to_y_pkl, 'rb') as y_10users_pkl:
y_10users = pickle.load(y_10users_pkl)
X_train, X_valid, y_train, y_valid = train_test_split(X_sparse_10users, y_10users,
test_size=0.3,
random_state=17, stratify=y_10users)
scores_svm = cross_val_score(estimator, X_train, y_train, cv=skf, n_jobs=-1)
estimator.fit(X_train, y_train)
svm_pred = estimator.predict(X_valid)
print(f"SVC Cross-Validation Score: {np.mean(scores_svm):.3f}")
print(f"SVC Validation Score: {accuracy_score(y_valid, svm_pred):.3f}")
SVC Cross-Validation Score: 0.867 SVC Validation Score: 0.875 CPU times: user 2.39 s, sys: 25.6 ms, total: 2.41 s Wall time: 6.38 s
Make a conclusion about how the quality of classification depends on the length of the session and the width of the window.
%%time
estimator = svm_grid_searcher2.best_estimator_
for window_size, session_length in [(5, 5), (7, 7), (10, 10)]:
path_to_X_pkl = os.path.join(
PATH_TO_DATA, f"X_sparse_150users_s{session_length}_w{window_size}.pkl")
path_to_y_pkl = os.path.join(
PATH_TO_DATA, f"y_150users_s{session_length}_w{window_size}.pkl")
print(window_size, session_length,
model_assessment(estimator=estimator,
path_to_X_pickle=path_to_X_pkl,
path_to_y_pickle=path_to_y_pkl,
cv=skf))
5 5 (0.4084080325808655, 0.4217282328320436, ' Time elapsed: ', 133.36306834220886) 7 7 (0.43694798464211154, 0.45307763054808053, ' Time elapsed: ', 120.4077639579773) 10 10 (0.46308633866107823, 0.4836276942538802, ' Time elapsed: ', 114.10968589782715) CPU times: user 6min 16s, sys: 1.8 s, total: 6min 17s Wall time: 16min 33s
Question 6. Calculate the proportions of correct answers for LinearSVC
with the C' parameter configured and the
X_sparse_150 users` selection. Specify the proportions of correct answers on cross-validation and on deferred sampling. Round each one to 3 decimal places and separate it with a space.
%%time
estimator = svm_grid_searcher2.best_estimator_
path_to_X_pkl = os.path.join(PATH_TO_DATA, "X_sparse_150users.pkl")
path_to_y_pkl = os.path.join(PATH_TO_DATA, "y_150users.pkl")
with open(path_to_X_pkl, 'rb') as X_sparse_150users_pkl:
X_sparse_150users = pickle.load(X_sparse_150users_pkl)
with open(path_to_y_pkl, 'rb') as y_150users_pkl:
y_150users = pickle.load(y_150users_pkl)
X_train, X_valid, y_train, y_valid = train_test_split(X_sparse_150users, y_150users,
test_size=0.3,
random_state=17, stratify=y_150users)
scores_svm = cross_val_score(estimator, X_train, y_train, cv=skf, n_jobs=-1)
estimator.fit(X_train, y_train)
svm_pred = estimator.predict(X_valid)
print(f"SVC Cross-Validation Score: {np.mean(scores_svm):.3f}")
print(f"SVC Validation Score: {accuracy_score(y_valid, svm_pred):.3f}")
SVC Cross-Validation Score: 0.463 SVC Validation Score: 0.484 CPU times: user 1min 46s, sys: 339 ms, total: 1min 47s Wall time: 4min 37s
Since it may be disappointing that the multiclass share of correct answers in a sample of 150 users is small, let's be glad that a particular user can be identified well enough.
Let's load the previously serialized objects X_sparse_150users and y_150users corresponding to the training sample for 150 users with parameters (session_length, window_size) = (10.10). Just exactly break them down into 70% and 30%.
with open(os.path.join(PATH_TO_DATA, 'X_sparse_150users.pkl'), 'rb') as X_sparse_150users_pkl:
X_sparse_150users = pickle.load(X_sparse_150users_pkl)
with open(os.path.join(PATH_TO_DATA, 'y_150users.pkl'), 'rb') as y_150users_pkl:
y_150users = pickle.load(y_150users_pkl)
X_train_150, X_valid_150, y_train_150, y_valid_150 = train_test_split(X_sparse_150users,
y_150users, test_size=0.3,
random_state=17, stratify=y_150users)
Let's train LogisticRegressionCV
for one value of the parameter C
(the best on cross-validation in 1 part, use the exact value, not by eye). Now we will solve 150 tasks "One-against-All", so we will specify the argument multi_class
=ovr
. As always, where possible, specify n_jobs=-1
and random_state
=17.
from sklearn.linear_model import LogisticRegression, LogisticRegressionCV
best_C2_tmp = 1.9157894736842107
skf = StratifiedKFold(n_splits=3, shuffle=True, random_state=17)
# slice digits
y_train_150_tmp = []
for i in y_train_150:
y_train_150_tmp.append(int(i[4:]))
# convert to int
y_train_150_work = np.array(y_train_150_tmp, dtype=np.int)
%%time
logit_cv_150users = LogisticRegressionCV(Cs=[best_C2_tmp], cv=skf, multi_class="ovr",
n_jobs=-1, random_state=17)
logit_cv_150users.fit(X_train_150, y_train_150_work)
CPU times: user 6min 55s, sys: 7min 30s, total: 14min 25s Wall time: 15min 25s
Look at the average proportions of correct responses to cross-validation in the task of identifying each user individually.
cv_scores_by_user = logit_cv_150users.scores_
for user_id in logit_cv_150users.scores_:
print(f"User {user_id}, CV score: {cv_scores_by_user[user_id].mean()}")
User 6, CV score: 0.996058928403866 User 13, CV score: 0.9963091551718745 User 15, CV score: 0.9952352652925046 User 16, CV score: 0.9918676300397236 User 28, CV score: 0.9903766955470061 User 31, CV score: 0.9943907499504759 User 33, CV score: 0.9937651830304546 User 39, CV score: 0.9858621876075193 User 46, CV score: 0.9980398903172666 User 49, CV score: 0.9952039869465036 User 50, CV score: 0.9943386193738076 User 53, CV score: 0.9937130524537862 User 65, CV score: 0.9969451482072295 User 66, CV score: 0.9948077945638234 User 82, CV score: 0.996361285748543 User 85, CV score: 0.9963717118638766 User 89, CV score: 0.9908667229676894 User 92, CV score: 0.994422028296477 User 100, CV score: 0.9944741588731455 User 102, CV score: 0.9911586541970326 User 103, CV score: 0.980565721018006 User 105, CV score: 0.9969034437458948 User 106, CV score: 0.9948494990251583 User 118, CV score: 0.990950131890359 User 119, CV score: 0.9965906602858841 User 120, CV score: 0.994286488797139 User 126, CV score: 0.995058021331832 User 127, CV score: 0.9915965510410477 User 128, CV score: 0.9846944626901463 User 138, CV score: 0.9970598354759 User 158, CV score: 0.9970598354759 User 160, CV score: 0.9968200348232252 User 165, CV score: 0.997362192820577 User 172, CV score: 0.9964863991325471 User 177, CV score: 0.9967783303618906 User 203, CV score: 0.9975707151272508 User 207, CV score: 0.9877805928289178 User 223, CV score: 0.9965489558245494 User 233, CV score: 0.9963091551718746 User 235, CV score: 0.9966636430932199 User 236, CV score: 0.9900117815103271 User 237, CV score: 0.9893862145903057 User 238, CV score: 0.9962570245952062 User 240, CV score: 0.9957461449438553 User 241, CV score: 0.9959650933658629 User 242, CV score: 0.9951414302545015 User 245, CV score: 0.9960067978271976 User 246, CV score: 0.9970181310145653 User 249, CV score: 0.9950058907551634 User 252, CV score: 0.9965072513632145 User 254, CV score: 0.9920135956543952 User 256, CV score: 0.9961006328652008 User 258, CV score: 0.9959338150198618 User 259, CV score: 0.9949120557171603 User 260, CV score: 0.9974038972819117 User 261, CV score: 0.9897511286269848 User 263, CV score: 0.9927955543044217 User 264, CV score: 0.9966010864012178 User 269, CV score: 0.9871341736782293 User 270, CV score: 0.9894279190516405 User 273, CV score: 0.9944011760658097 User 287, CV score: 0.9901681732403323 User 294, CV score: 0.9957461449438553 User 298, CV score: 0.9912941936963707 User 301, CV score: 0.9972058010905717 User 308, CV score: 0.9957044404825206 User 315, CV score: 0.9975498628965833 User 318, CV score: 0.9958921105585269 User 327, CV score: 0.9966427908625525 User 332, CV score: 0.9968096087078915 User 333, CV score: 0.9962674507105397 User 339, CV score: 0.9971223921679022 User 340, CV score: 0.9967157736698883 User 342, CV score: 0.99225339630707 User 344, CV score: 0.9966532169778862 User 351, CV score: 0.99245149249841 User 356, CV score: 0.9976019934732517 User 361, CV score: 0.9965802341705504 User 363, CV score: 0.9964863991325471 User 411, CV score: 0.9912212108890349 User 417, CV score: 0.9967261997852219 User 425, CV score: 0.9942239321051369 User 430, CV score: 0.9962674507105399 User 435, CV score: 0.9970389832452327 User 436, CV score: 0.9951831347158362 User 440, CV score: 0.9970389832452327 User 444, CV score: 0.9978105157799256 User 475, CV score: 0.9892506750909679 User 476, CV score: 0.9969868526685642 User 486, CV score: 0.9954125092531774 User 515, CV score: 0.9942135059898033 User 533, CV score: 0.9937964613764558 User 561, CV score: 0.9845276448448073 User 563, CV score: 0.9968304609385589 User 564, CV score: 0.9956835882518532 User 568, CV score: 0.991784221117054 User 569, CV score: 0.9893028056676362 User 570, CV score: 0.9982901170852752 User 573, CV score: 0.9907624618143527 User 575, CV score: 0.9900534859716618 User 576, CV score: 0.9941613754131349 User 580, CV score: 0.9867484074108828 User 583, CV score: 0.9808368000166817 User 584, CV score: 0.9811912879380271 User 600, CV score: 0.9915756988103803 User 603, CV score: 0.9957044404825206 User 605, CV score: 0.9975290106659159 User 640, CV score: 0.9972579316672402 User 647, CV score: 0.9976436979345866 User 653, CV score: 0.9973830450512443 User 664, CV score: 0.9952144130618373 User 665, CV score: 0.9969138698612284 User 677, CV score: 0.9966323647472187 User 692, CV score: 0.9969347220918957 User 697, CV score: 0.9959963717118638 User 705, CV score: 0.9964342685558788 User 722, CV score: 0.9947035334104865 User 740, CV score: 0.996694921439221 User 741, CV score: 0.9968513131692264 User 756, CV score: 0.9955793270985164 User 780, CV score: 0.9965489558245494 User 784, CV score: 0.9966532169778862 User 785, CV score: 0.9969555743225631 User 797, CV score: 0.995756571059189 User 812, CV score: 0.9949224818324941 User 844, CV score: 0.9970285571298989 User 859, CV score: 0.9981337253552699 User 868, CV score: 0.9965489558245494 User 875, CV score: 0.9957148665978544 User 932, CV score: 0.990512235046344 User 996, CV score: 0.9933168600711061 User 1014, CV score: 0.9971328182832359 User 1040, CV score: 0.9970389832452327 User 1054, CV score: 0.9964655469018799 User 1248, CV score: 0.9977375329725898 User 1267, CV score: 0.9973309144745759 User 1299, CV score: 0.996924295976562 User 1371, CV score: 0.9934419734551104 User 1797, CV score: 0.994891203486493 User 1798, CV score: 0.9966740692085535 User 1993, CV score: 0.9967991825925578 User 2118, CV score: 0.9978522202412603 User 2174, CV score: 0.995860832212526 User 2191, CV score: 0.9952665436385058 User 2250, CV score: 0.9973413405899096 User 2355, CV score: 0.995860832212526 User 2408, CV score: 0.9937547569151209 User 2493, CV score: 0.9966115125165516 User 2625, CV score: 0.9961423373265355 User 2902, CV score: 0.9971223921679022
The results seem impressive, but perhaps we forget about the imbalance of classes, and a high proportion of correct answers can be obtained by constant prediction. Let's calculate for each user the difference between the proportion of correct answers to cross-validation (just calculated using LogisticRegressionCV) and the proportion of labels in y_train_150 other than the ID of this user (this is the proportion of correct answers that can be obtained if the classifier always "says" that this is not the user with the number i in the classification task i-vs-All).
class_distr = np.bincount(y_train_150_work)
acc_diff_vs_constant = []
for user_id in np.unique(y_train_150_work):
val = (class_distr.sum() - class_distr[user_id]) / class_distr.sum()
print(user_id)
diff = cv_scores_by_user[user_id].mean() - val
acc_diff_vs_constant.append(diff)
print(f"User: {user_id} Val: {val:.3f} Diff: {diff}")
6 User: 6 Val: 0.984 Diff: 0.011656396943062974 13 User: 13 Val: 0.996 Diff: 0.000604714689353858 15 User: 15 Val: 0.994 Diff: 0.0008340892266949229 16 User: 16 Val: 0.985 Diff: 0.007152315118909902 28 User: 28 Val: 0.988 Diff: 0.0024292848727492933 31 User: 31 Val: 0.994 Diff: -6.255669200216918e-05 33 User: 33 Val: 0.993 Diff: 0.0012198554940413553 39 User: 39 Val: 0.984 Diff: 0.0019496835673996626 46 User: 46 Val: 0.997 Diff: 0.0009174981493643708 49 User: 49 Val: 0.994 Diff: 0.0013762472240467227 50 User: 50 Val: 0.994 Diff: 0.00018767007600650754 53 User: 53 Val: 0.992 Diff: 0.0016681784533899569 65 User: 65 Val: 0.997 Diff: 2.0852230667389726e-05 66 User: 66 Val: 0.995 Diff: -5.213057666852983e-05 82 User: 82 Val: 0.996 Diff: 1.0426115333750374e-05 85 User: 85 Val: 0.996 Diff: 0.00017724396067264614 89 User: 89 Val: 0.990 Diff: 0.0007923847653602545 92 User: 92 Val: 0.994 Diff: 0.0002710789986758444 100 User: 100 Val: 0.995 Diff: -0.0002710789986758444 102 User: 102 Val: 0.990 Diff: 0.0007194019580243349 103 User: 103 Val: 0.977 Diff: 0.0035657314441213117 105 User: 105 Val: 0.996 Diff: 0.0008862198033635638 106 User: 106 Val: 0.987 Diff: 0.007631916424259533 118 User: 118 Val: 0.990 Diff: 0.0009487764953656219 119 User: 119 Val: 0.996 Diff: 0.0006464191506886374 120 User: 120 Val: 0.994 Diff: 0.0006255669200211367 126 User: 126 Val: 0.994 Diff: 0.0010009070720340407 127 User: 127 Val: 0.988 Diff: 0.004087037210805722 128 User: 128 Val: 0.980 Diff: 0.005098370398173291 138 User: 138 Val: 0.997 Diff: -5.2130576668418804e-05 158 User: 158 Val: 0.997 Diff: 0.00028150511400959477 160 User: 160 Val: 0.997 Diff: 0.00028150511400959477 165 User: 165 Val: 0.997 Diff: 0.00028150511400959477 172 User: 172 Val: 0.996 Diff: 0.00022937453734106494 177 User: 177 Val: 0.996 Diff: 0.0002815051140097058 203 User: 203 Val: 0.996 Diff: 0.0013866733393805841 207 User: 207 Val: 0.986 Diff: 0.0014179516853815022 223 User: 223 Val: 0.996 Diff: 8.34089226695589e-05 233 User: 233 Val: 0.996 Diff: 7.298280733591955e-05 235 User: 235 Val: 0.997 Diff: -8.34089226695589e-05 236 User: 236 Val: 0.989 Diff: 0.0012302816093752167 237 User: 237 Val: 0.988 Diff: 0.0013762472240467227 238 User: 238 Val: 0.996 Diff: -1.0426115333639352e-05 240 User: 240 Val: 0.996 Diff: -3.12783460011401e-05 241 User: 241 Val: 0.996 Diff: -0.0001355394993378667 242 User: 242 Val: 0.995 Diff: 0.00047960130534974166 245 User: 245 Val: 0.996 Diff: 0.00023980065267481532 246 User: 246 Val: 0.997 Diff: -0.00010426115333694863 249 User: 249 Val: 0.995 Diff: -9.383503800330928e-05 252 User: 252 Val: 0.996 Diff: 0.00020852230667367522 254 User: 254 Val: 0.990 Diff: 0.0020956491820712797 256 User: 256 Val: 0.996 Diff: 0.0005421579973517998 258 User: 258 Val: 0.996 Diff: 0.00017724396067275716 259 User: 259 Val: 0.995 Diff: -0.0002710789986758444 260 User: 260 Val: 0.997 Diff: 0.0003232095753442632 261 User: 261 Val: 0.989 Diff: 0.0006464191506885264 263 User: 263 Val: 0.992 Diff: 0.0011990032633740766 264 User: 264 Val: 0.996 Diff: 0.00039619238268018275 269 User: 269 Val: 0.986 Diff: 0.0013866733393805841 270 User: 270 Val: 0.985 Diff: 0.004305985632813036 273 User: 273 Val: 0.994 Diff: 0.0006672713813560271 287 User: 287 Val: 0.988 Diff: 0.001928831336732162 294 User: 294 Val: 0.996 Diff: -0.00020852230667389726 298 User: 298 Val: 0.990 Diff: 0.0015222128387184508 301 User: 301 Val: 0.995 Diff: 0.001730735145392126 308 User: 308 Val: 0.995 Diff: 0.0007298280733580853 315 User: 315 Val: 0.997 Diff: 0.0004378968440148512 318 User: 318 Val: 0.995 Diff: 0.0005525841126853281 327 User: 327 Val: 0.997 Diff: 1.0426115333639352e-05 332 User: 332 Val: 0.997 Diff: -1.0426115333750374e-05 333 User: 333 Val: 0.995 Diff: 0.0012928383013771638 339 User: 339 Val: 0.996 Diff: 0.0011990032633741876 340 User: 340 Val: 0.997 Diff: 0.00020852230667367522 342 User: 342 Val: 0.992 Diff: 0.00034406180601176395 344 User: 344 Val: 0.997 Diff: -0.00012511338400422733 351 User: 351 Val: 0.991 Diff: 0.0014179516853815022 356 User: 356 Val: 0.997 Diff: 0.0003857662673464324 361 User: 361 Val: 0.997 Diff: -5.2130576668418804e-05 363 User: 363 Val: 0.995 Diff: 0.0011885771480403262 411 User: 411 Val: 0.989 Diff: 0.001991388028734442 417 User: 417 Val: 0.997 Diff: 9.383503800308723e-05 425 User: 425 Val: 0.994 Diff: 6.255669200205816e-05 430 User: 430 Val: 0.995 Diff: 0.0010217593027015415 435 User: 435 Val: 0.997 Diff: -0.00011468726867058798 436 User: 436 Val: 0.995 Diff: 0.0001668178453390068 440 User: 440 Val: 0.997 Diff: -5.2130576668418804e-05 444 User: 444 Val: 0.997 Diff: 0.0007923847653603655 475 User: 475 Val: 0.988 Diff: 0.0012615599553762458 476 User: 476 Val: 0.996 Diff: 0.0008549414573624237 486 User: 486 Val: 0.995 Diff: 0.0001668178453390068 515 User: 515 Val: 0.994 Diff: 0.00022937453734117597 533 User: 533 Val: 0.993 Diff: 0.0011572988020394082 561 User: 561 Val: 0.981 Diff: 0.0035553053287875613 563 User: 563 Val: 0.996 Diff: 0.0005213057666844101 564 User: 564 Val: 0.995 Diff: 0.0005525841126854392 568 User: 568 Val: 0.992 Diff: -4.170446133477945e-05 569 User: 569 Val: 0.985 Diff: 0.0042538550561445065 570 User: 570 Val: 0.996 Diff: 0.0025856766027545497 573 User: 573 Val: 0.991 Diff: -0.00019809619133992484 575 User: 575 Val: 0.989 Diff: 0.0014492300313825313 576 User: 576 Val: 0.994 Diff: -0.00022937453734106494 580 User: 580 Val: 0.980 Diff: 0.006266095315546338 583 User: 583 Val: 0.966 Diff: 0.014794657658502963 584 User: 584 Val: 0.978 Diff: 0.0027524944480935565 600 User: 600 Val: 0.991 Diff: 0.000980054841366762 603 User: 603 Val: 0.995 Diff: 0.0006464191506886374 605 User: 605 Val: 0.995 Diff: 0.0024605632187502113 640 User: 640 Val: 0.997 Diff: 0.00021894842200753661 647 User: 647 Val: 0.996 Diff: 0.0016264739920553994 653 User: 653 Val: 0.997 Diff: 0.0001355394993378667 664 User: 664 Val: 0.995 Diff: 0.0006359930353549981 665 User: 665 Val: 0.997 Diff: 0.00023980065267481532 677 User: 677 Val: 0.996 Diff: 0.0007298280733580853 692 User: 692 Val: 0.997 Diff: 6.255669200205816e-05 697 User: 697 Val: 0.995 Diff: 0.000750680304025475 705 User: 705 Val: 0.996 Diff: 0.0002502267680085657 722 User: 722 Val: 0.994 Diff: 0.0010843159947033776 740 User: 740 Val: 0.996 Diff: 0.0008340892266950339 741 User: 741 Val: 0.997 Diff: 7.298280733591955e-05 756 User: 756 Val: 0.995 Diff: 0.0002710789986758444 780 User: 780 Val: 0.996 Diff: 0.0007819586500266151 784 User: 784 Val: 0.995 Diff: 0.001699456799391097 785 User: 785 Val: 0.997 Diff: -8.34089226695589e-05 797 User: 797 Val: 0.995 Diff: 0.0005421579973516888 812 User: 812 Val: 0.990 Diff: 0.005421579973517776 844 User: 844 Val: 0.997 Diff: 6.255669200205816e-05 859 User: 859 Val: 0.997 Diff: 0.0012198554940414663 868 User: 868 Val: 0.997 Diff: -0.00022937453734106494 875 User: 875 Val: 0.990 Diff: 0.0055779717035230325 932 User: 932 Val: 0.988 Diff: 0.0024084326420817925 996 User: 996 Val: 0.990 Diff: 0.0028359033707631154 1014 User: 1014 Val: 0.996 Diff: 0.0013449688780456936 1040 User: 1040 Val: 0.995 Diff: 0.0018662746447301037 1054 User: 1054 Val: 0.996 Diff: 0.000114687268670699 1248 User: 1248 Val: 0.997 Diff: 0.00040661849801393313 1267 User: 1267 Val: 0.997 Diff: 0.0003336356906780136 1299 User: 1299 Val: 0.997 Diff: 9.383503800308723e-05 1371 User: 1371 Val: 0.989 Diff: 0.0047126041308269695 1797 User: 1797 Val: 0.992 Diff: 0.0031486868307737392 1798 User: 1798 Val: 0.995 Diff: 0.001240707724708745 1993 User: 1993 Val: 0.996 Diff: 0.0008757936880298134 2118 User: 2118 Val: 0.996 Diff: 0.0015117867233847004 2174 User: 2174 Val: 0.995 Diff: 0.0005108796513507707 2191 User: 2191 Val: 0.995 Diff: 0.0004378968440149622 2250 User: 2250 Val: 0.997 Diff: -9.383503800319826e-05 2355 User: 2355 Val: 0.987 Diff: 0.008424301189619787 2408 User: 2408 Val: 0.992 Diff: 0.0013345427627119433 2493 User: 2493 Val: 0.996 Diff: 0.0003023573446770955 2625 User: 2625 Val: 0.995 Diff: 0.0006776974966896665 2902 User: 2902 Val: 0.995 Diff: 0.00207479695140389
num_better_than_default = (np.array(acc_diff_vs_constant) > 0).sum()
num_better_than_default
127
Question 7. Let's calculate the proportion of users for whom the logistic regression on cross-validation gives a better forecast than the constant one. Round it up to 3 decimal places.|
better = num_better_than_default / len(acc_diff_vs_constant)
print(better)
0.8466666666666667
Next, we will build learning curves for a specific user, for example, for the 128th. Let's make a new binary vector based on y_150 users, its values will be 1 or 0, depending on whether the user ID is 128.
y_binary_128 = y_150users == 'user0128'
y_binary_128.astype("int")
array([0, 0, 0, ..., 0, 0, 0])
from sklearn.model_selection import learning_curve
def plot_learning_curve(val_train, val_test, train_sizes,
xlabel='Training Set Size', ylabel='score'):
def plot_with_err(x, data, **kwargs):
mu, std = data.mean(1), data.std(1)
lines = plt.plot(x, mu, '-', **kwargs)
plt.fill_between(x, mu - std, mu + std, edgecolor='none',
facecolor=lines[0].get_color(), alpha=0.2)
plot_with_err(train_sizes, val_train, label='train')
plot_with_err(train_sizes, val_test, label='valid')
plt.xlabel(xlabel); plt.ylabel(ylabel)
plt.legend(loc='lower right');
Let's calculate the proportions of correct answers to cross-validation in the classification problem "user128-vs-All" depending on the sample size.
%%time
train_sizes = np.linspace(0.25, 1, 20)
estimator = svm_grid_searcher2.best_estimator_
n_train, val_train, val_test = learning_curve(
estimator=estimator,
X=X_sparse_150users,
y=y_binary_128,
train_sizes=train_sizes,
cv=skf,
n_jobs=-1,
random_state=17)
CPU times: user 630 ms, sys: 148 ms, total: 777 ms Wall time: 20.5 s
plot_learning_curve(val_train, val_test, n_train,
xlabel='train_size', ylabel='accuracy')
Next week we will remember about linear models trained by stochastic gradient descent, and we will rejoice at how much faster they work. We will also make the first (or not the first) in the parcels [competition] (https://in class.kaggle.com/c/catch-me-if-you-can-intruder-detection-through-webpage-session-tracking2 ) Kaggle class.