Notebook

Graduation project "Identification of Internet users" - Comparison of classification algorithms. Programming Assignment¶

Выпускной проект "Идентификация интернет-пользователей" Week 4

toc: true
branch: master
badges: true
comments: true
author: Zmey56
categories: [graduation project, machine learning, stepik, yandex, english]

Week 4. Comparison of classification algorithms

Now we will finally approach the training of classification models, compare several algorithms on cross-validation, and figure out which session length parameters (session_length and window_size) are better to use. Also, for the selected algorithm, we will construct validation curves (how the classification quality depends on one of the hyperparameters of the algorithm) and learning curves (how the classification quality depends on the sample size).

4 week plan:

Part 1. Comparison of several algorithms in sessions of 10 sites
Part 2. Selection of parameters - session length and window width
Part 3. Identification of a specific user and learning curves

In this part of the project, video recordings of the following lectures of the course "Learning from marked data" may be useful:

In [ ]:

# pip install watermark
#%load_ext watermark

In [ ]:

#%watermark -v -m -p numpy,scipy,pandas,matplotlib,statsmodels,sklearn -g

In [ ]:

from __future__ import division, print_function
# disable any Anaconda warnings
import warnings
warnings.filterwarnings('ignore')
from time import time
import itertools
import os
import numpy as np
import pandas as pd
import seaborn as sns
%matplotlib inline
from matplotlib import pyplot as plt
import pickle
from scipy.sparse import csr_matrix
from sklearn.model_selection import train_test_split, cross_val_score, StratifiedKFold, GridSearchCV
from sklearn.metrics import accuracy_score, f1_score

In [ ]:

from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive

In [ ]:

# set my own path to the data
PATH_TO_DATA = '/content/drive/MyDrive/DATA/Stepik/capstone_user_identification'

Part 1. Comparison of several algorithms in sessions of 10 sites¶

Let's load the previously serialized objects X_sparse_10 users and y_10 users, corresponding to the training sample for 10 users.

In [ ]:

with open(os.path.join(PATH_TO_DATA, 
         'X_sparse_10users.pkl'), 'rb') as X_sparse_10users_pkl:
    X_sparse_10users = pickle.load(X_sparse_10users_pkl)
with open(os.path.join(PATH_TO_DATA, 
                       'y_10users.pkl'), 'rb') as y_10users_pkl:
    y_10users = pickle.load(y_10users_pkl)

There are more than 14 thousand sessions and almost 5 thousand unique visited sites.

In [ ]:

X_sparse_10users.shape

Out[ ]:

(14061, 4913)

Let's split the sample into 2 parts. On one we will carry out cross-validation, on the second we will evaluate the model trained after cross-validation.

In [ ]:

# Due to a training error on sparse matrices
# X_sparse_10users = X_sparse_10users.todense()

In [ ]:

X_train, X_valid, y_train, y_valid = train_test_split(X_sparse_10users, y_10users, 
                                                      test_size=0.3, 
                                                     random_state=17, stratify=y_10users)

Let's set the type of cross-validation in advance: 3-fold, with mixing, the parameter random_state = 56 - for reproducibility.

In [ ]:

skf = StratifiedKFold(n_splits=3, shuffle=True, random_state=17)

Auxiliary function for drawing validation curves after starting GridSearchCV (or RandomizedCV).

In [ ]:

def plot_validation_curves(param_values, grid_cv_results_):
    train_mu, train_std = grid_cv_results_['mean_train_score'], grid_cv_results_['std_train_score']
    valid_mu, valid_std = grid_cv_results_['mean_test_score'], grid_cv_results_['std_test_score']
    train_line = plt.plot(param_values, train_mu, '-', label='train', color='green')
    valid_line = plt.plot(param_values, valid_mu, '-', label='test', color='red')
    plt.fill_between(param_values, train_mu - train_std, train_mu + train_std, edgecolor='none',
                     facecolor=train_line[0].get_color(), alpha=0.2)
    plt.fill_between(param_values, valid_mu - valid_std, valid_mu + valid_std, edgecolor='none',
                     facecolor=valid_line[0].get_color(), alpha=0.2)
    plt.legend()

1. Let's train a 'KNeighborsClassifier' with 100 nearest neighbors (we'll leave the rest of the parameters by default, only 'n_jobs'= -1 for parallelization) and look at the proportion of correct answers on 3-fold cross-validation (for the sake of reproducibility, we use the StratifiedKFold skf' object for this) on the sample(X_train, y_train)and separately on the sample(X_valid, y_valid)`.

In [ ]:

from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
from time import time

In [ ]:

knn = KNeighborsClassifier(n_neighbors=100)

In [ ]:

t_start = time()
knn_cv_score = cross_val_score(knn, X_train, y_train, cv=skf, n_jobs=-1)
print("CV scores:", knn_cv_score)
print("mean:", np.mean(knn_cv_score))
print("std:", np.std(knn_cv_score))
print("Time elapsed: ", time()-t_start)

CV scores: [0.56598598 0.55409936 0.55792683]
mean: 0.5593373897012363
std: 0.004954135909104914
Time elapsed:  3.559767246246338

In [ ]:

t_start = time()
scores = cross_val_score(knn, X_valid, y_valid, cv=skf)
print("CV scores:", scores)
print("mean:", np.mean(scores))
print("std:", np.std(scores))
print("Time elapsed: ", time()-t_start)

CV scores: [0.50817342 0.51351351 0.48435277]
mean: 0.5020132353203838
std: 0.012676699846532288
Time elapsed:  0.5638549327850342

Importernt: I had a problem with my code in this place: cross_val_score did not work and gave result NaN or, if I used the todense, code worked for too long.

Question 1 Let's calculate the proportion of correct answers for the KNeighborsClassifier on cross-validation and deferred sampling. I'll round each one up to 3 decimal places.

In [ ]:

knn.fit(X_train, y_train)
knn_pred = knn.predict(X_valid)

In [ ]:

print(f"KNN Cross-Validation Score: {knn_cv_score.mean():.3f}")
print(f"KNN Validation Score: {accuracy_score(y_valid, knn_pred):.3f}")

KNN Cross-Validation Score: 0.559
KNN Validation Score: 0.584

2. Train a random forest (RandomForestClassifier) of 100 trees (for reproducibility random_state=17). Let's look at the OOB score (for which we will immediately set oob_score=True) and the proportion of correct answers in the sample (X_valid, y_valid). For parallelization, set n_jobs= -1.

Question 2. Calculate the percentages of correct answers for RandomForestClassifier during Out-of-Bag evaluation and on deferred sampling?

In [ ]:

from sklearn.ensemble import RandomForestClassifier

In [ ]:

clf = RandomForestClassifier(n_estimators=100, random_state=17, oob_score=True, n_jobs=-1)

In [ ]:

t_start = time()
clf.fit(X_train, y_train)
print('Score: ', clf.score(X_train, y_train))
print("Time elapsed: ", time()-t_start)

Score:  0.9759195285511075
Time elapsed:  11.684211492538452

In [ ]:

print(clf.oob_score_)

0.7172322698638488

In [ ]:

t_start = time()
clf_pred = clf.predict(X_valid)
print(accuracy_score(y_valid, clf_pred))
print("Time elapsed: ", time()-t_start)

0.7312159279450107
Time elapsed:  0.20986366271972656

3. Let's train Logistic Regression with the default parameter C and random_state=17 (for reproducibility). Let's look at the proportion of correct answers on cross-validation (using the scf object created earlier) and on the sample (X_valid, y_valid). For parallelization, set n_jobs= -1.

In [ ]:

from sklearn.linear_model import LogisticRegression, LogisticRegressionCV

In [ ]:

logit = LogisticRegression(random_state=17, n_jobs=-1)

In [ ]:

t_start = time()

logit_cv_score = cross_val_score(logit, X_train, y_train, cv=skf)

logit.fit(X_train, y_train)
logit_y_pred = logit.predict(X_valid)
logit_val_score = accuracy_score(y_valid, logit_y_pred)

print(f"LogReg Cross-Validation Score: {logit_cv_score.mean():.3f}")
print(f"LogReg Validation Score: {logit_val_score:.3f}")
print("Time elapsed: ", time()-t_start)

LogReg Cross-Validation Score: 0.761
LogReg Validation Score: 0.777
Time elapsed:  10.484177112579346

Using LogisticRegressionCV, we will select the parameter C for Logistic Regression first in a wide range: 10 values from 1e-4 to 1e2, use logspace from NumPy. Specify the LogisticRegressionCV parameters multi_class='multinomial' and random_state=17. For cross-validation, we use the skf object created earlier. For parallelization, set n_jobs= -1.

At the end, we will draw validation curves for parameter C.

In [ ]:

t_start = time()

logit_c_values1 = np.logspace(-4, 2, 10)

logit_grid_searcher1 = LogisticRegressionCV(Cs=logit_c_values1, cv=skf,
                                            multi_class="multinomial",
                                            random_state=17)
logit_grid_searcher1.fit(X_train, y_train)

print("Time elapsed: ", time()-t_start)

Time elapsed:  91.80030512809753

The average values of the proportion of correct responses to cross-validation for each of the 10 parameters C.

In [ ]:

logit_mean_cv_scores1 = np.array(
    list(logit_grid_searcher1.scores_.values())).mean(axis=(0, 1))
logit_mean_cv_scores1

Out[ ]:

array([0.31954964, 0.47307397, 0.55202236, 0.64875035, 0.71438846,
       0.75177962, 0.76092382, 0.75848551, 0.749849  , 0.74029823])

We will output the best value of the proportion of correct answers on cross-validation and the corresponding value with.

In [ ]:

best_score1 = np.max(logit_mean_cv_scores1)
best_C1 = logit_grid_searcher1.Cs_[np.argmax(logit_mean_cv_scores1)]

print(f"Best Score: {best_score1}")
print(f"Best C: {best_C1}")

Best Score: 0.7609238210638737
Best C: 1.0

Let's draw a graph of the dependence of the proportion of correct answers to cross-validation on `C'.

In [ ]:

plt.plot(logit_c_values1, logit_mean_cv_scores1);

Now the same thing, only the values of parameter 'C' are iterated over in the range np.linspace(0.1, 7, 20). Let's draw validation curves again, determine the maximum value of the proportion of correct answers on cross-validation.

In [ ]:

t_start = time()
logit_c_values2 = np.linspace(0.1, 7, 20)
logit_grid_searcher2 = LogisticRegressionCV(Cs=logit_c_values2, cv=skf, multi_class='multinomial', random_state=17, n_jobs=-1)
logit_grid_searcher2.fit(X_train, y_train)
print("Time elapsed: ", time()-t_start)

Time elapsed:  105.8310170173645

The average values of the proportion of correct responses to cross-validation for each of the 10 parameters `C'.

In [ ]:

logit_mean_cv_scores2 = np.array(
    list(logit_grid_searcher2.scores_.values())).mean(axis=(0, 1))
logit_mean_cv_scores2

Out[ ]:

array([0.73481117, 0.75919655, 0.76102545, 0.76082216, 0.76133023,
       0.76143192, 0.75990775, 0.75929811, 0.76000937, 0.75939977,
       0.75919661, 0.75868861, 0.757571  , 0.75736787, 0.75716462,
       0.75614861, 0.75553901, 0.75513262, 0.75401502, 0.75350692])

We will output the best value of the proportion of correct answers on cross-validation and the corresponding value with.

In [ ]:

best_score2 = np.max(logit_mean_cv_scores2)
best_C2 = logit_grid_searcher2.Cs_[np.argmax(logit_mean_cv_scores2)]

print(f"Best Score: {best_score2}")
print(f"Best C: {best_C2}")

Best Score: 0.761431920171076
Best C: 1.9157894736842107

Let's draw a graph of the dependence of the proportion of correct answers to cross-validation on `C'.

In [ ]:

plt.plot(logit_c_values2, logit_mean_cv_scores2);

We output the proportion of correct answers in the sample (X_value, y_value)' for logistic regression with the best values foundC'.

In [ ]:

t_start = time()

logit = LogisticRegression(C=best_C2, n_jobs=-1, random_state=17)

logit_cv_score = cross_val_score(logit, X_train, y_train, cv=skf, n_jobs=-1)

logit.fit(X_train, y_train)
logit_y_pred = logit.predict(X_valid)
logit_val_score = accuracy_score(y_valid, logit_y_pred)

print("Time elapsed: ", time()-t_start)

Time elapsed:  9.804409980773926

Question 3. Let's calculate the proportions of correct answers for 'logit_grid_searcher 2' on cross-validation for the best value of parameter 'C` and on deferred sampling. Round each one to 3 decimal places and print it separated by a space.

In [ ]:

print(f"LogReg Cross-Validation Score: {logit_cv_score.mean():.3f}")
print(f"LogReg Validation Score: {logit_val_score:.3f}")

LogReg Cross-Validation Score: 0.762
LogReg Validation Score: 0.782

4. Let's train a linear SVM ('LinearSVC) with the parameter 'C'=1 and 'random_state'=17 (for reproducibility). Let's look at the proportion of correct answers on cross-validation (using theskf' object created earlier) and on the sample (X_valid, y_valid).

In [ ]:

from sklearn.svm import LinearSVC

In [ ]:

t_start = time()
svm = LinearSVC(C=1, random_state=17)
scores_svm = cross_val_score(svm, X_train, y_train, cv=skf, n_jobs=-1)
print("CV scores:", scores_svm)
print("mean:", np.mean(scores_svm))
print("std:", np.std(scores_svm))
print("Time elapsed: ", time()-t_start)

CV scores: [0.75068577 0.73270344 0.7695122 ]
mean: 0.7509671352428245
std: 0.015028426724668621
Time elapsed:  3.3960418701171875

In [ ]:

t_start = time()
svm.fit(X_train, y_train)
svm_pred = svm.predict(X_valid)
print(accuracy_score(y_valid, svm_pred))
print("Time elapsed: ", time()-t_start)

0.7769613652524295
Time elapsed:  2.0046310424804688

Using GridSearchCV, we will select the parameter C for SVM first in a wide range: 10 values from 1e-4 to 1 e4, use linspace from NumPy. Let's draw validation curves.

In [ ]:

%%time

svm_params1 = {
    "C": np.linspace(1e-4, 1e4, 10)
}

svm_grid_searcher1 = GridSearchCV(estimator=svm, cv=skf, param_grid=svm_params1,
                                  return_train_score=True)
svm_grid_searcher1.fit(X_train, y_train)

CPU times: user 49.9 s, sys: 34.8 ms, total: 50 s
Wall time: 49.7 s

In [ ]:

t_start = time()

svm_params1 = {'C': np.linspace(1e-4, 1e4, 10)}

svm_grid_searcher1 = GridSearchCV(svm, param_grid=svm_params1, cv=skf, return_train_score=True, n_jobs=-1)
svm_grid_searcher1.fit(X_train, y_train)

print("Time elapsed: ", time()-t_start)

Time elapsed:  44.00320863723755

We will output the best value of the proportion of correct answers on cross-validation and the corresponding value C.

In [ ]:

svm_grid_searcher1.best_params_

Out[ ]:

{'C': 5555.555600000001}

Let's draw a graph of the dependence of the proportion of correct answers to cross-validation on C.

In [ ]:

plot_validation_curves(svm_params1['C'], svm_grid_searcher1.cv_results_)

But we remember that with the default regularization parameter (C=1) on cross-validation, the proportion of correct answers is higher. This is the case (not uncommon) when you can make a mistake and iterate over the parameters in the wrong range (the reason is that we took a uniform grid over a large interval and missed a really good range of values C). Here it is much more meaningful to select C in the region of 1, besides, this way the model learns faster than with large C.

Using GridSearchCV, we will select the parameter C for SVM in the range (1e-3, 1), 30 values using `linspace' from NumPy. Let's draw validation curves.

In [ ]:

%%time
svm_params2 = {'C': np.linspace(1e-3, 1, 30)}

svm_grid_searcher2 = GridSearchCV(svm, param_grid=svm_params2, cv=skf, return_train_score=True, n_jobs=-1)
svm_grid_searcher2.fit(X_train, y_train)

CPU times: user 1.77 s, sys: 150 ms, total: 1.92 s
Wall time: 1min 18s

Output the best value of the proportion of correct answers on cross-validation and the corresponding value of C.

In [ ]:

best_score = svm_grid_searcher2.best_score_
best_params = svm_grid_searcher2.best_params_

print(f"Best Score: {best_score}")
print(f"Best Params: {best_params}")

Best Score: 0.7670206386611259
Best Params: {'C': 0.10434482758620689}

Let's draw a graph of the dependence of the proportion of correct answers to cross-validation on 'C'.

In [ ]:

plot_validation_curves(svm_params2['C'], svm_grid_searcher2.cv_results_)

Output the proportion of correct answers in the sample (X_value, y_value)' for 'LinearSVC with the best values found `C'.

In [ ]:

%%time

svm = LinearSVC(**best_params, random_state=17)

svm_cv_score = cross_val_score(svm, X_train, y_train, cv=skf, n_jobs=-1)

svm.fit(X_train, y_train)
svm_y_pred = svm.predict(X_valid)
svm_val_score = accuracy_score(y_valid, svm_y_pred)

CPU times: user 683 ms, sys: 8.26 ms, total: 691 ms
Wall time: 1.74 s

Question 4. Let's calculate the proportions of correct answers for 'stm_grid_searcher 2' on cross-validation for the best value of parameter 'C` and on deferred sampling. Round each one to 3 decimal places and print it separated by a space.

In [ ]:

print(f"SVC Cross-Validation Score: {svm_cv_score.mean():.3f}")
print(f"SVC Validation Score: {svm_val_score:.3f}")

SVC Cross-Validation Score: 0.767
SVC Validation Score: 0.781

Part 2. Selection of parameters - session length and window width¶

Let's take LinearSVC, which showed the best quality on cross-validation in part 1, and check its work on 8 more samples for 10 users (with different combinations of parameters session_length and window_size). Since there are already more calculations here, we will not re-select the regularization parameter C every time.

Let's define the model_assessment function, the documentation of which is described below. The split of the sample with 'train_test_split' should be stratified.

In [ ]:

def model_assessment(estimator, path_to_X_pickle, path_to_y_pickle, cv, random_state=17, test_size=0.3):
    '''
    Estimates CV-accuracy for (1 - test_size) share of (X_sparse, y) 
    loaded from path_to_X_pickle and path_to_y_pickle and holdout accuracy for (test_size) share of (X_sparse, y).
    The split is made with stratified train_test_split with params random_state and test_size.
    
    :param estimator – Scikit-learn estimator (classifier or regressor)
    :param path_to_X_pickle – path to pickled sparse X (instances and their features)
    :param path_to_y_pickle – path to pickled y (responses)
    :param cv – cross-validation as in cross_val_score (use StratifiedKFold here)
    :param random_state –  for train_test_split
    :param test_size –  for train_test_split
    
    :returns mean CV-accuracy for (X_train, y_train) and accuracy for (X_valid, y_valid) where (X_train, y_train)
    and (X_valid, y_valid) are (1 - test_size) and (testsize) shares of (X_sparse, y).
    '''
    
    with open(path_to_X_pickle, 'rb') as X_sparse_10users_pkl:
        X_sparse_10users = pickle.load(X_sparse_10users_pkl)
    with open(path_to_y_pickle, 'rb') as y_10users_pkl:
        y_10users = pickle.load(y_10users_pkl)
        
    X_train, X_valid, y_train, y_valid = train_test_split(X_sparse_10users, y_10users, 
                                                      test_size=0.3, 
                                                     random_state=17, stratify=y_10users)
    
    t_start = time()
    scores_svm = cross_val_score(estimator, X_train, y_train, cv=skf, n_jobs=-1)
    
    t_start = time()
    estimator.fit(X_train, y_train)
    svm_pred = estimator.predict(X_valid)
    
    return(np.mean(scores_svm), accuracy_score(y_valid, svm_pred), " Time elapsed: ", time()-t_start)

Let's make sure that the function works.

In [ ]:

model_assessment(svm_grid_searcher2.best_estimator_, 
                 os.path.join(PATH_TO_DATA, 'X_sparse_10users.pkl'),
        os.path.join(PATH_TO_DATA, 'y_10users.pkl'), skf, random_state=17, test_size=0.3)

Out[ ]:

(0.7670206386611259, 0.7807537331121118, ' Time elapsed: ', 0.6428191661834717)

Let's use the model_assessment function for the best algorithm from the previous part (namely, 'svm_grid_searcher 2.best_estimator_`) and 9 samples of the form with different combinations of parameters session_length and window_size for 10 users. We will output the session_length and window_size parameters in the loop, as well as the output result of the model_assessment function.

Here, for convenience, it is worth creating copies of previously created pickle files X_sparse_10users.pkl, X_sparse_150users.pkl, y_10users.pkl and y_150users.pkl, adding s10_w10 to their names, which means the session length is 10 and the window width is 10.

In [ ]:

!cp $PATH_TO_DATA/X_sparse_10users.pkl $PATH_TO_DATA/X_sparse_10users_s10_w10.pkl 
!cp $PATH_TO_DATA/X_sparse_150users.pkl $PATH_TO_DATA/X_sparse_150users_s10_w10.pkl 
!cp $PATH_TO_DATA/y_10users.pkl $PATH_TO_DATA/y_10users_s10_w10.pkl 
!cp $PATH_TO_DATA/y_150users.pkl $PATH_TO_DATA/y_150users_s10_w10.pkl 

In [ ]:

#for 10 users

%%time

estimator = svm_grid_searcher2.best_estimator_

for window_size, session_length in itertools.product([10, 7, 5], [15, 10, 7, 5]):
    if window_size <= session_length:
        path_to_X_pkl = os.path.join(
            PATH_TO_DATA, f"X_sparse_10users_s{session_length}_w{window_size}.pkl")
        path_to_y_pkl = os.path.join(
            PATH_TO_DATA, f"y_10users_s{session_length}_w{window_size}.pkl")
        print(window_size, session_length, 
              model_assessment(estimator=estimator, 
                               path_to_X_pickle=path_to_X_pkl,
                               path_to_y_pickle=path_to_y_pkl,
                               cv=skf))

10 15 (0.8243252292702751, 0.8404835269021095, ' Time elapsed: ', 1.0584142208099365)
10 10 (0.7670206386611259, 0.7807537331121118, ' Time elapsed: ', 0.6238193511962891)
7 15 (0.8495024256089474, 0.8543222166915547, ' Time elapsed: ', 1.623363971710205)
7 10 (0.7983645917156946, 0.8073668491786958, ' Time elapsed: ', 0.974440336227417)
7 7 (0.754765400423003, 0.7617388418782147, ' Time elapsed: ', 0.5486903190612793)
5 15 (0.8670355547005402, 0.8752963489805595, ' Time elapsed: ', 2.1419732570648193)
5 10 (0.8177520250854086, 0.8245614035087719, ' Time elapsed: ', 1.218752145767212)
5 7 (0.772939529035208, 0.7853247984826932, ' Time elapsed: ', 0.761998176574707)
5 5 (0.7254849424351582, 0.7362494073020389, ' Time elapsed: ', 0.501572847366333)
CPU times: user 11.2 s, sys: 107 ms, total: 11.3 s
Wall time: 33.2 s

In [ ]:

#for 150 users

%%time

estimator = svm_grid_searcher2.best_estimator_

for window_size, session_length in itertools.product([10, 7, 5], [15, 10, 7, 5]):
    if window_size <= session_length:
        path_to_X_pkl = os.path.join(
            PATH_TO_DATA, f"X_sparse_150users_s{session_length}_w{window_size}.pkl")
        path_to_y_pkl = os.path.join(
            PATH_TO_DATA, f"y_150users_s{session_length}_w{window_size}.pkl")
        print(window_size, session_length, 
              model_assessment(estimator=estimator, 
                               path_to_X_pickle=path_to_X_pkl,
                               path_to_y_pickle=path_to_y_pkl,
                               cv=skf))

10 15 (0.5488098589346596, 0.5751471804602735, ' Time elapsed: ', 223.8541784286499)
10 10 (0.46308633866107823, 0.4836276942538802, ' Time elapsed: ', 136.27440857887268)
7 15 (0.5828479247872232, 0.6084920121265797, ' Time elapsed: ', 309.6582760810852)
7 10 (0.5015547672228792, 0.5239295568348264, ' Time elapsed: ', 213.65414190292358)
7 7 (0.43694798464211154, 0.45307763054808053, ' Time elapsed: ', 149.04103302955627)
5 15 (0.6139887051608968, 0.6360295906945053, ' Time elapsed: ', 467.08626675605774)
5 10 (0.5265866745928696, 0.5458826106000876, ' Time elapsed: ', 317.86096453666687)
5 7 (0.46509602699080665, 0.48189516717769015, ' Time elapsed: ', 195.35350489616394)
5 5 (0.4084080325808655, 0.4217282328320436, ' Time elapsed: ', 142.6359510421753)
CPU times: user 36min 40s, sys: 7.97 s, total: 36min 48s
Wall time: 1h 29min 29s

Question 5. Let's calculate the proportion of correct answers for LinearSVC with the configured parameter C and the selection X_sparse_10 users_s15_w5. We will indicate the proportions of correct answers on cross-validation and on deferred sampling. Round each one to 3 decimal places and print it separated by a space.

In [ ]:

%%time

estimator = svm_grid_searcher2.best_estimator_

path_to_X_pkl = os.path.join(PATH_TO_DATA, "X_sparse_10users_s15_w5.pkl")
path_to_y_pkl = os.path.join(PATH_TO_DATA, "y_10users_s15_w5.pkl")

with open(path_to_X_pkl, 'rb') as X_sparse_10users_pkl:
  X_sparse_10users = pickle.load(X_sparse_10users_pkl)
with open(path_to_y_pkl, 'rb') as y_10users_pkl:
  y_10users = pickle.load(y_10users_pkl)

X_train, X_valid, y_train, y_valid = train_test_split(X_sparse_10users, y_10users, 
                                                  test_size=0.3, 
                                                  random_state=17, stratify=y_10users)

scores_svm = cross_val_score(estimator, X_train, y_train, cv=skf, n_jobs=-1)

estimator.fit(X_train, y_train)

svm_pred = estimator.predict(X_valid)

print(f"SVC Cross-Validation Score: {np.mean(scores_svm):.3f}")
print(f"SVC Validation Score: {accuracy_score(y_valid, svm_pred):.3f}")

SVC Cross-Validation Score: 0.867
SVC Validation Score: 0.875
CPU times: user 2.39 s, sys: 25.6 ms, total: 2.41 s
Wall time: 6.38 s

Make a conclusion about how the quality of classification depends on the length of the session and the width of the window.

In [ ]:

%%time

estimator = svm_grid_searcher2.best_estimator_

for window_size, session_length in [(5, 5), (7, 7), (10, 10)]:
    path_to_X_pkl = os.path.join(
        PATH_TO_DATA, f"X_sparse_150users_s{session_length}_w{window_size}.pkl")
    path_to_y_pkl = os.path.join(
        PATH_TO_DATA, f"y_150users_s{session_length}_w{window_size}.pkl")
    print(window_size, session_length, 
          model_assessment(estimator=estimator, 
                           path_to_X_pickle=path_to_X_pkl,
                           path_to_y_pickle=path_to_y_pkl,
                           cv=skf))

5 5 (0.4084080325808655, 0.4217282328320436, ' Time elapsed: ', 133.36306834220886)
7 7 (0.43694798464211154, 0.45307763054808053, ' Time elapsed: ', 120.4077639579773)
10 10 (0.46308633866107823, 0.4836276942538802, ' Time elapsed: ', 114.10968589782715)
CPU times: user 6min 16s, sys: 1.8 s, total: 6min 17s
Wall time: 16min 33s

Question 6. Calculate the proportions of correct answers for LinearSVC with the C' parameter configured and theX_sparse_150 users` selection. Specify the proportions of correct answers on cross-validation and on deferred sampling. Round each one to 3 decimal places and separate it with a space.

In [ ]:

%%time

estimator = svm_grid_searcher2.best_estimator_

path_to_X_pkl = os.path.join(PATH_TO_DATA, "X_sparse_150users.pkl")
path_to_y_pkl = os.path.join(PATH_TO_DATA, "y_150users.pkl")

with open(path_to_X_pkl, 'rb') as X_sparse_150users_pkl:
  X_sparse_150users = pickle.load(X_sparse_150users_pkl)
with open(path_to_y_pkl, 'rb') as y_150users_pkl:
  y_150users = pickle.load(y_150users_pkl)

X_train, X_valid, y_train, y_valid = train_test_split(X_sparse_150users, y_150users, 
                                                  test_size=0.3, 
                                                  random_state=17, stratify=y_150users)

scores_svm = cross_val_score(estimator, X_train, y_train, cv=skf, n_jobs=-1)

estimator.fit(X_train, y_train)

svm_pred = estimator.predict(X_valid)

print(f"SVC Cross-Validation Score: {np.mean(scores_svm):.3f}")
print(f"SVC Validation Score: {accuracy_score(y_valid, svm_pred):.3f}")

SVC Cross-Validation Score: 0.463
SVC Validation Score: 0.484
CPU times: user 1min 46s, sys: 339 ms, total: 1min 47s
Wall time: 4min 37s

Part 3. Identification of a specific user and learning curves¶

Since it may be disappointing that the multiclass share of correct answers in a sample of 150 users is small, let's be glad that a particular user can be identified well enough.

Let's load the previously serialized objects X_sparse_150users and y_150users corresponding to the training sample for 150 users with parameters (session_length, window_size) = (10.10). Just exactly break them down into 70% and 30%.

In [ ]:

with open(os.path.join(PATH_TO_DATA, 'X_sparse_150users.pkl'), 'rb') as X_sparse_150users_pkl:
     X_sparse_150users = pickle.load(X_sparse_150users_pkl)
with open(os.path.join(PATH_TO_DATA, 'y_150users.pkl'), 'rb') as y_150users_pkl:
    y_150users = pickle.load(y_150users_pkl)

In [ ]:

X_train_150, X_valid_150, y_train_150, y_valid_150 = train_test_split(X_sparse_150users, 
                                                                      y_150users, test_size=0.3, 
                                                     random_state=17, stratify=y_150users)

Let's train LogisticRegressionCV for one value of the parameter C (the best on cross-validation in 1 part, use the exact value, not by eye). Now we will solve 150 tasks "One-against-All", so we will specify the argument multi_class=ovr. As always, where possible, specify n_jobs=-1 and random_state=17.

In [ ]:

from sklearn.linear_model import LogisticRegression, LogisticRegressionCV

best_C2_tmp = 1.9157894736842107

skf = StratifiedKFold(n_splits=3, shuffle=True, random_state=17)

In [ ]:

# slice digits
y_train_150_tmp = []
for i in y_train_150:
  y_train_150_tmp.append(int(i[4:]))

# convert to int
y_train_150_work = np.array(y_train_150_tmp, dtype=np.int)

In [ ]:

%%time

logit_cv_150users = LogisticRegressionCV(Cs=[best_C2_tmp], cv=skf, multi_class="ovr", 
                                         n_jobs=-1, random_state=17)
logit_cv_150users.fit(X_train_150, y_train_150_work)

CPU times: user 6min 55s, sys: 7min 30s, total: 14min 25s
Wall time: 15min 25s

Look at the average proportions of correct responses to cross-validation in the task of identifying each user individually.

In [ ]:

cv_scores_by_user = logit_cv_150users.scores_
for user_id in logit_cv_150users.scores_:
    print(f"User {user_id}, CV score: {cv_scores_by_user[user_id].mean()}")

User 6, CV score: 0.996058928403866
User 13, CV score: 0.9963091551718745
User 15, CV score: 0.9952352652925046
User 16, CV score: 0.9918676300397236
User 28, CV score: 0.9903766955470061
User 31, CV score: 0.9943907499504759
User 33, CV score: 0.9937651830304546
User 39, CV score: 0.9858621876075193
User 46, CV score: 0.9980398903172666
User 49, CV score: 0.9952039869465036
User 50, CV score: 0.9943386193738076
User 53, CV score: 0.9937130524537862
User 65, CV score: 0.9969451482072295
User 66, CV score: 0.9948077945638234
User 82, CV score: 0.996361285748543
User 85, CV score: 0.9963717118638766
User 89, CV score: 0.9908667229676894
User 92, CV score: 0.994422028296477
User 100, CV score: 0.9944741588731455
User 102, CV score: 0.9911586541970326
User 103, CV score: 0.980565721018006
User 105, CV score: 0.9969034437458948
User 106, CV score: 0.9948494990251583
User 118, CV score: 0.990950131890359
User 119, CV score: 0.9965906602858841
User 120, CV score: 0.994286488797139
User 126, CV score: 0.995058021331832
User 127, CV score: 0.9915965510410477
User 128, CV score: 0.9846944626901463
User 138, CV score: 0.9970598354759
User 158, CV score: 0.9970598354759
User 160, CV score: 0.9968200348232252
User 165, CV score: 0.997362192820577
User 172, CV score: 0.9964863991325471
User 177, CV score: 0.9967783303618906
User 203, CV score: 0.9975707151272508
User 207, CV score: 0.9877805928289178
User 223, CV score: 0.9965489558245494
User 233, CV score: 0.9963091551718746
User 235, CV score: 0.9966636430932199
User 236, CV score: 0.9900117815103271
User 237, CV score: 0.9893862145903057
User 238, CV score: 0.9962570245952062
User 240, CV score: 0.9957461449438553
User 241, CV score: 0.9959650933658629
User 242, CV score: 0.9951414302545015
User 245, CV score: 0.9960067978271976
User 246, CV score: 0.9970181310145653
User 249, CV score: 0.9950058907551634
User 252, CV score: 0.9965072513632145
User 254, CV score: 0.9920135956543952
User 256, CV score: 0.9961006328652008
User 258, CV score: 0.9959338150198618
User 259, CV score: 0.9949120557171603
User 260, CV score: 0.9974038972819117
User 261, CV score: 0.9897511286269848
User 263, CV score: 0.9927955543044217
User 264, CV score: 0.9966010864012178
User 269, CV score: 0.9871341736782293
User 270, CV score: 0.9894279190516405
User 273, CV score: 0.9944011760658097
User 287, CV score: 0.9901681732403323
User 294, CV score: 0.9957461449438553
User 298, CV score: 0.9912941936963707
User 301, CV score: 0.9972058010905717
User 308, CV score: 0.9957044404825206
User 315, CV score: 0.9975498628965833
User 318, CV score: 0.9958921105585269
User 327, CV score: 0.9966427908625525
User 332, CV score: 0.9968096087078915
User 333, CV score: 0.9962674507105397
User 339, CV score: 0.9971223921679022
User 340, CV score: 0.9967157736698883
User 342, CV score: 0.99225339630707
User 344, CV score: 0.9966532169778862
User 351, CV score: 0.99245149249841
User 356, CV score: 0.9976019934732517
User 361, CV score: 0.9965802341705504
User 363, CV score: 0.9964863991325471
User 411, CV score: 0.9912212108890349
User 417, CV score: 0.9967261997852219
User 425, CV score: 0.9942239321051369
User 430, CV score: 0.9962674507105399
User 435, CV score: 0.9970389832452327
User 436, CV score: 0.9951831347158362
User 440, CV score: 0.9970389832452327
User 444, CV score: 0.9978105157799256
User 475, CV score: 0.9892506750909679
User 476, CV score: 0.9969868526685642
User 486, CV score: 0.9954125092531774
User 515, CV score: 0.9942135059898033
User 533, CV score: 0.9937964613764558
User 561, CV score: 0.9845276448448073
User 563, CV score: 0.9968304609385589
User 564, CV score: 0.9956835882518532
User 568, CV score: 0.991784221117054
User 569, CV score: 0.9893028056676362
User 570, CV score: 0.9982901170852752
User 573, CV score: 0.9907624618143527
User 575, CV score: 0.9900534859716618
User 576, CV score: 0.9941613754131349
User 580, CV score: 0.9867484074108828
User 583, CV score: 0.9808368000166817
User 584, CV score: 0.9811912879380271
User 600, CV score: 0.9915756988103803
User 603, CV score: 0.9957044404825206
User 605, CV score: 0.9975290106659159
User 640, CV score: 0.9972579316672402
User 647, CV score: 0.9976436979345866
User 653, CV score: 0.9973830450512443
User 664, CV score: 0.9952144130618373
User 665, CV score: 0.9969138698612284
User 677, CV score: 0.9966323647472187
User 692, CV score: 0.9969347220918957
User 697, CV score: 0.9959963717118638
User 705, CV score: 0.9964342685558788
User 722, CV score: 0.9947035334104865
User 740, CV score: 0.996694921439221
User 741, CV score: 0.9968513131692264
User 756, CV score: 0.9955793270985164
User 780, CV score: 0.9965489558245494
User 784, CV score: 0.9966532169778862
User 785, CV score: 0.9969555743225631
User 797, CV score: 0.995756571059189
User 812, CV score: 0.9949224818324941
User 844, CV score: 0.9970285571298989
User 859, CV score: 0.9981337253552699
User 868, CV score: 0.9965489558245494
User 875, CV score: 0.9957148665978544
User 932, CV score: 0.990512235046344
User 996, CV score: 0.9933168600711061
User 1014, CV score: 0.9971328182832359
User 1040, CV score: 0.9970389832452327
User 1054, CV score: 0.9964655469018799
User 1248, CV score: 0.9977375329725898
User 1267, CV score: 0.9973309144745759
User 1299, CV score: 0.996924295976562
User 1371, CV score: 0.9934419734551104
User 1797, CV score: 0.994891203486493
User 1798, CV score: 0.9966740692085535
User 1993, CV score: 0.9967991825925578
User 2118, CV score: 0.9978522202412603
User 2174, CV score: 0.995860832212526
User 2191, CV score: 0.9952665436385058
User 2250, CV score: 0.9973413405899096
User 2355, CV score: 0.995860832212526
User 2408, CV score: 0.9937547569151209
User 2493, CV score: 0.9966115125165516
User 2625, CV score: 0.9961423373265355
User 2902, CV score: 0.9971223921679022

The results seem impressive, but perhaps we forget about the imbalance of classes, and a high proportion of correct answers can be obtained by constant prediction. Let's calculate for each user the difference between the proportion of correct answers to cross-validation (just calculated using LogisticRegressionCV) and the proportion of labels in y_train_150 other than the ID of this user (this is the proportion of correct answers that can be obtained if the classifier always "says" that this is not the user with the number i in the classification task i-vs-All).

In [ ]:

class_distr = np.bincount(y_train_150_work)
acc_diff_vs_constant = []

for user_id in np.unique(y_train_150_work):
    val = (class_distr.sum() - class_distr[user_id]) / class_distr.sum()
    print(user_id)
    diff = cv_scores_by_user[user_id].mean() - val
    acc_diff_vs_constant.append(diff)
    print(f"User: {user_id} Val: {val:.3f} Diff: {diff}")

6
User: 6 Val: 0.984 Diff: 0.011656396943062974
13
User: 13 Val: 0.996 Diff: 0.000604714689353858
15
User: 15 Val: 0.994 Diff: 0.0008340892266949229
16
User: 16 Val: 0.985 Diff: 0.007152315118909902
28
User: 28 Val: 0.988 Diff: 0.0024292848727492933
31
User: 31 Val: 0.994 Diff: -6.255669200216918e-05
33
User: 33 Val: 0.993 Diff: 0.0012198554940413553
39
User: 39 Val: 0.984 Diff: 0.0019496835673996626
46
User: 46 Val: 0.997 Diff: 0.0009174981493643708
49
User: 49 Val: 0.994 Diff: 0.0013762472240467227
50
User: 50 Val: 0.994 Diff: 0.00018767007600650754
53
User: 53 Val: 0.992 Diff: 0.0016681784533899569
65
User: 65 Val: 0.997 Diff: 2.0852230667389726e-05
66
User: 66 Val: 0.995 Diff: -5.213057666852983e-05
82
User: 82 Val: 0.996 Diff: 1.0426115333750374e-05
85
User: 85 Val: 0.996 Diff: 0.00017724396067264614
89
User: 89 Val: 0.990 Diff: 0.0007923847653602545
92
User: 92 Val: 0.994 Diff: 0.0002710789986758444
100
User: 100 Val: 0.995 Diff: -0.0002710789986758444
102
User: 102 Val: 0.990 Diff: 0.0007194019580243349
103
User: 103 Val: 0.977 Diff: 0.0035657314441213117
105
User: 105 Val: 0.996 Diff: 0.0008862198033635638
106
User: 106 Val: 0.987 Diff: 0.007631916424259533
118
User: 118 Val: 0.990 Diff: 0.0009487764953656219
119
User: 119 Val: 0.996 Diff: 0.0006464191506886374
120
User: 120 Val: 0.994 Diff: 0.0006255669200211367
126
User: 126 Val: 0.994 Diff: 0.0010009070720340407
127
User: 127 Val: 0.988 Diff: 0.004087037210805722
128
User: 128 Val: 0.980 Diff: 0.005098370398173291
138
User: 138 Val: 0.997 Diff: -5.2130576668418804e-05
158
User: 158 Val: 0.997 Diff: 0.00028150511400959477
160
User: 160 Val: 0.997 Diff: 0.00028150511400959477
165
User: 165 Val: 0.997 Diff: 0.00028150511400959477
172
User: 172 Val: 0.996 Diff: 0.00022937453734106494
177
User: 177 Val: 0.996 Diff: 0.0002815051140097058
203
User: 203 Val: 0.996 Diff: 0.0013866733393805841
207
User: 207 Val: 0.986 Diff: 0.0014179516853815022
223
User: 223 Val: 0.996 Diff: 8.34089226695589e-05
233
User: 233 Val: 0.996 Diff: 7.298280733591955e-05
235
User: 235 Val: 0.997 Diff: -8.34089226695589e-05
236
User: 236 Val: 0.989 Diff: 0.0012302816093752167
237
User: 237 Val: 0.988 Diff: 0.0013762472240467227
238
User: 238 Val: 0.996 Diff: -1.0426115333639352e-05
240
User: 240 Val: 0.996 Diff: -3.12783460011401e-05
241
User: 241 Val: 0.996 Diff: -0.0001355394993378667
242
User: 242 Val: 0.995 Diff: 0.00047960130534974166
245
User: 245 Val: 0.996 Diff: 0.00023980065267481532
246
User: 246 Val: 0.997 Diff: -0.00010426115333694863
249
User: 249 Val: 0.995 Diff: -9.383503800330928e-05
252
User: 252 Val: 0.996 Diff: 0.00020852230667367522
254
User: 254 Val: 0.990 Diff: 0.0020956491820712797
256
User: 256 Val: 0.996 Diff: 0.0005421579973517998
258
User: 258 Val: 0.996 Diff: 0.00017724396067275716
259
User: 259 Val: 0.995 Diff: -0.0002710789986758444
260
User: 260 Val: 0.997 Diff: 0.0003232095753442632
261
User: 261 Val: 0.989 Diff: 0.0006464191506885264
263
User: 263 Val: 0.992 Diff: 0.0011990032633740766
264
User: 264 Val: 0.996 Diff: 0.00039619238268018275
269
User: 269 Val: 0.986 Diff: 0.0013866733393805841
270
User: 270 Val: 0.985 Diff: 0.004305985632813036
273
User: 273 Val: 0.994 Diff: 0.0006672713813560271
287
User: 287 Val: 0.988 Diff: 0.001928831336732162
294
User: 294 Val: 0.996 Diff: -0.00020852230667389726
298
User: 298 Val: 0.990 Diff: 0.0015222128387184508
301
User: 301 Val: 0.995 Diff: 0.001730735145392126
308
User: 308 Val: 0.995 Diff: 0.0007298280733580853
315
User: 315 Val: 0.997 Diff: 0.0004378968440148512
318
User: 318 Val: 0.995 Diff: 0.0005525841126853281
327
User: 327 Val: 0.997 Diff: 1.0426115333639352e-05
332
User: 332 Val: 0.997 Diff: -1.0426115333750374e-05
333
User: 333 Val: 0.995 Diff: 0.0012928383013771638
339
User: 339 Val: 0.996 Diff: 0.0011990032633741876
340
User: 340 Val: 0.997 Diff: 0.00020852230667367522
342
User: 342 Val: 0.992 Diff: 0.00034406180601176395
344
User: 344 Val: 0.997 Diff: -0.00012511338400422733
351
User: 351 Val: 0.991 Diff: 0.0014179516853815022
356
User: 356 Val: 0.997 Diff: 0.0003857662673464324
361
User: 361 Val: 0.997 Diff: -5.2130576668418804e-05
363
User: 363 Val: 0.995 Diff: 0.0011885771480403262
411
User: 411 Val: 0.989 Diff: 0.001991388028734442
417
User: 417 Val: 0.997 Diff: 9.383503800308723e-05
425
User: 425 Val: 0.994 Diff: 6.255669200205816e-05
430
User: 430 Val: 0.995 Diff: 0.0010217593027015415
435
User: 435 Val: 0.997 Diff: -0.00011468726867058798
436
User: 436 Val: 0.995 Diff: 0.0001668178453390068
440
User: 440 Val: 0.997 Diff: -5.2130576668418804e-05
444
User: 444 Val: 0.997 Diff: 0.0007923847653603655
475
User: 475 Val: 0.988 Diff: 0.0012615599553762458
476
User: 476 Val: 0.996 Diff: 0.0008549414573624237
486
User: 486 Val: 0.995 Diff: 0.0001668178453390068
515
User: 515 Val: 0.994 Diff: 0.00022937453734117597
533
User: 533 Val: 0.993 Diff: 0.0011572988020394082
561
User: 561 Val: 0.981 Diff: 0.0035553053287875613
563
User: 563 Val: 0.996 Diff: 0.0005213057666844101
564
User: 564 Val: 0.995 Diff: 0.0005525841126854392
568
User: 568 Val: 0.992 Diff: -4.170446133477945e-05
569
User: 569 Val: 0.985 Diff: 0.0042538550561445065
570
User: 570 Val: 0.996 Diff: 0.0025856766027545497
573
User: 573 Val: 0.991 Diff: -0.00019809619133992484
575
User: 575 Val: 0.989 Diff: 0.0014492300313825313
576
User: 576 Val: 0.994 Diff: -0.00022937453734106494
580
User: 580 Val: 0.980 Diff: 0.006266095315546338
583
User: 583 Val: 0.966 Diff: 0.014794657658502963
584
User: 584 Val: 0.978 Diff: 0.0027524944480935565
600
User: 600 Val: 0.991 Diff: 0.000980054841366762
603
User: 603 Val: 0.995 Diff: 0.0006464191506886374
605
User: 605 Val: 0.995 Diff: 0.0024605632187502113
640
User: 640 Val: 0.997 Diff: 0.00021894842200753661
647
User: 647 Val: 0.996 Diff: 0.0016264739920553994
653
User: 653 Val: 0.997 Diff: 0.0001355394993378667
664
User: 664 Val: 0.995 Diff: 0.0006359930353549981
665
User: 665 Val: 0.997 Diff: 0.00023980065267481532
677
User: 677 Val: 0.996 Diff: 0.0007298280733580853
692
User: 692 Val: 0.997 Diff: 6.255669200205816e-05
697
User: 697 Val: 0.995 Diff: 0.000750680304025475
705
User: 705 Val: 0.996 Diff: 0.0002502267680085657
722
User: 722 Val: 0.994 Diff: 0.0010843159947033776
740
User: 740 Val: 0.996 Diff: 0.0008340892266950339
741
User: 741 Val: 0.997 Diff: 7.298280733591955e-05
756
User: 756 Val: 0.995 Diff: 0.0002710789986758444
780
User: 780 Val: 0.996 Diff: 0.0007819586500266151
784
User: 784 Val: 0.995 Diff: 0.001699456799391097
785
User: 785 Val: 0.997 Diff: -8.34089226695589e-05
797
User: 797 Val: 0.995 Diff: 0.0005421579973516888
812
User: 812 Val: 0.990 Diff: 0.005421579973517776
844
User: 844 Val: 0.997 Diff: 6.255669200205816e-05
859
User: 859 Val: 0.997 Diff: 0.0012198554940414663
868
User: 868 Val: 0.997 Diff: -0.00022937453734106494
875
User: 875 Val: 0.990 Diff: 0.0055779717035230325
932
User: 932 Val: 0.988 Diff: 0.0024084326420817925
996
User: 996 Val: 0.990 Diff: 0.0028359033707631154
1014
User: 1014 Val: 0.996 Diff: 0.0013449688780456936
1040
User: 1040 Val: 0.995 Diff: 0.0018662746447301037
1054
User: 1054 Val: 0.996 Diff: 0.000114687268670699
1248
User: 1248 Val: 0.997 Diff: 0.00040661849801393313
1267
User: 1267 Val: 0.997 Diff: 0.0003336356906780136
1299
User: 1299 Val: 0.997 Diff: 9.383503800308723e-05
1371
User: 1371 Val: 0.989 Diff: 0.0047126041308269695
1797
User: 1797 Val: 0.992 Diff: 0.0031486868307737392
1798
User: 1798 Val: 0.995 Diff: 0.001240707724708745
1993
User: 1993 Val: 0.996 Diff: 0.0008757936880298134
2118
User: 2118 Val: 0.996 Diff: 0.0015117867233847004
2174
User: 2174 Val: 0.995 Diff: 0.0005108796513507707
2191
User: 2191 Val: 0.995 Diff: 0.0004378968440149622
2250
User: 2250 Val: 0.997 Diff: -9.383503800319826e-05
2355
User: 2355 Val: 0.987 Diff: 0.008424301189619787
2408
User: 2408 Val: 0.992 Diff: 0.0013345427627119433
2493
User: 2493 Val: 0.996 Diff: 0.0003023573446770955
2625
User: 2625 Val: 0.995 Diff: 0.0006776974966896665
2902
User: 2902 Val: 0.995 Diff: 0.00207479695140389

In [ ]:

num_better_than_default = (np.array(acc_diff_vs_constant) > 0).sum()
num_better_than_default

Out[ ]:

Question 7. Let's calculate the proportion of users for whom the logistic regression on cross-validation gives a better forecast than the constant one. Round it up to 3 decimal places.|

In [ ]:

better = num_better_than_default / len(acc_diff_vs_constant)

print(better)

0.8466666666666667

Next, we will build learning curves for a specific user, for example, for the 128th. Let's make a new binary vector based on y_150 users, its values will be 1 or 0, depending on whether the user ID is 128.

In [ ]:

y_binary_128 = y_150users == 'user0128'
y_binary_128.astype("int")

Out[ ]:

array([0, 0, 0, ..., 0, 0, 0])

In [ ]:

from sklearn.model_selection import learning_curve

def plot_learning_curve(val_train, val_test, train_sizes, 
                        xlabel='Training Set Size', ylabel='score'):
    def plot_with_err(x, data, **kwargs):
        mu, std = data.mean(1), data.std(1)
        lines = plt.plot(x, mu, '-', **kwargs)
        plt.fill_between(x, mu - std, mu + std, edgecolor='none',
                         facecolor=lines[0].get_color(), alpha=0.2)
    plot_with_err(train_sizes, val_train, label='train')
    plot_with_err(train_sizes, val_test, label='valid')
    plt.xlabel(xlabel); plt.ylabel(ylabel)
    plt.legend(loc='lower right');

Let's calculate the proportions of correct answers to cross-validation in the classification problem "user128-vs-All" depending on the sample size.

In [ ]:

%%time

train_sizes = np.linspace(0.25, 1, 20)

estimator = svm_grid_searcher2.best_estimator_

n_train, val_train, val_test = learning_curve(
    estimator=estimator,
    X=X_sparse_150users,
    y=y_binary_128,
    train_sizes=train_sizes,
    cv=skf,
    n_jobs=-1,
    random_state=17)

CPU times: user 630 ms, sys: 148 ms, total: 777 ms
Wall time: 20.5 s

In [ ]:

plot_learning_curve(val_train, val_test, n_train, 
                    xlabel='train_size', ylabel='accuracy')

Ways to improve¶

of course, you can check a bunch of algorithms, for example, Xgboost, but in such a task it is very unlikely that something will do better than linear methods
it is interesting to check the quality of the algorithm on data where sessions were distinguished not by the number of sites visited, but by time, for example, 5, 7, 10 and 15 minutes. Separately, it is worth noting the data of our соревнования
again, if the resources allow, you can check how well you can solve the problem for 3000 users

Next week we will remember about linear models trained by stochastic gradient descent, and we will rejoice at how much faster they work. We will also make the first (or not the first) in the parcels [competition] (https://in class.kaggle.com/c/catch-me-if-you-can-intruder-detection-through-webpage-session-tracking2 ) Kaggle class.