Let's first set up some notation and ideas:
Second, let's define two more things
The main goal of Supervised Learning can be stated using the Empirical Risk Minimization framework of Statistical Learning.
We are looking for a function $f\in \mathbb{F}$ that minimizes the expected loss:
Logistic Regression: a member of the class of generalized linear models (glm) using the logit as its link function.
The goal of Logistic Regression is to model the posterior probability of membership in class $c_i$ as a function of $X$. I.e.,
How do we fit Logistic Regression into the ERM framework?
We find the parameters $\alpha$ and $\beta$ using the method of Maximum Likelihood Estimation.
If we consider each observation to be an indepedent Bernoulli draw with $p_i=P(y_i|x_i)$, then the likelihood of each draw can be defined as: $p_i^{y_i}(1-p_i)^{1-y_i}$, with $p_i$ given by the inverse logit function. In MLE, we wish to maximize the likelihood of observing the data as a function of the independent parameters of the model (i.e., $\alpha$ and $\beta$). The total likelihood function looks like:
'''
Let's train an actual model and see how well it generalizes
'''
import math
from sklearn.metrics import confusion_matrix, roc_auc_score
from sklearn import linear_model
import numpy as np
import os
import course_utils as bd
import pandas as pd
import statsmodels.api as sm
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')
import imp
imp.reload(bd)
%matplotlib inline
#For mac users
cwd = os.getcwd()
datadir = '/'.join(cwd.split('/')[0:-1]) + '/data/'
#For window's users, hardcode the dir:
#datadir =
#Load data and downsample for a 1/10 pos/neg ratio, then split into a train/test
f = datadir + 'ads_dataset_cut.txt'
target = 'y_buy'
tdat = pd.read_csv(f, header = 0, sep = '\t')
moddat = bd.downSample(tdat, target, 10)
#We know the dataset is sorted so we can just split by index
train_split = 0.75
train = moddat[:int(math.floor(moddat.shape[0]*train_split))]
test = moddat[int(math.floor(moddat.shape[0]*train_split)):]
#Using Scikit-learn the model is built with two easy steps.
logreg = linear_model.LogisticRegression(C = 1e30)
logreg.fit(train.drop(target, 1), train[target])
#But we are going to also build using the statsmodel package
logit_sm = sm.Logit(train[target], train.drop(target, 1))
lr_fit = logit_sm.fit()
Warning: Maximum number of iterations has been exceeded. Current function value: 0.245774 Iterations: 35
Logistic Regression has long been used as a tool in statistics and econometrics so there are a lot of additional data points one can get out of logistic regression model than one might get with standard machine learning tools.
We showed how to use scikit-learn to fit a model, but we also used statsmodel. The reason is that statsmodel returns summary statistics on each coefficient fit to the variables. In machine learning, we often only focus on the generalizability of the prediction. But in many analytical applications we want to know how statistically significant are the estimates within our model.
#Use statsmodel if you want to understand the fit statistics of the LR model
lr_fit.summary()
Dep. Variable: | y_buy | No. Observations: | 2087 |
---|---|---|---|
Model: | Logit | Df Residuals: | 2074 |
Method: | MLE | Df Model: | 12 |
Date: | Mon, 24 Sep 2018 | Pseudo R-squ.: | 0.1601 |
Time: | 20:18:41 | Log-Likelihood: | -512.93 |
converged: | False | LL-Null: | -610.74 |
LLR p-value: | 2.621e-35 |
coef | std err | z | P>|z| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
isbuyer | -23.0178 | 6.48e+04 | -0.000 | 1.000 | -1.27e+05 | 1.27e+05 |
buy_freq | 23.9668 | 6.48e+04 | 0.000 | 1.000 | -1.27e+05 | 1.27e+05 |
visit_freq | -0.0074 | 0.016 | -0.462 | 0.644 | -0.039 | 0.024 |
buy_interval | -0.0109 | 0.014 | -0.804 | 0.422 | -0.038 | 0.016 |
sv_interval | 0.0102 | 0.007 | 1.423 | 0.155 | -0.004 | 0.024 |
expected_time_buy | 0.0106 | 0.016 | 0.656 | 0.512 | -0.021 | 0.042 |
expected_time_visit | -0.0334 | 0.008 | -4.313 | 0.000 | -0.049 | -0.018 |
last_buy | 0.0102 | 0.005 | 2.086 | 0.037 | 0.001 | 0.020 |
last_visit | -0.0635 | 0.006 | -10.851 | 0.000 | -0.075 | -0.052 |
multiple_buy | -22.6427 | 6.48e+04 | -0.000 | 1.000 | -1.27e+05 | 1.27e+05 |
multiple_visit | -0.1607 | 0.213 | -0.755 | 0.450 | -0.578 | 0.257 |
uniq_urls | -0.0103 | 0.002 | -6.443 | 0.000 | -0.013 | -0.007 |
num_checkins | 8.671e-05 | 8.28e-05 | 1.048 | 0.295 | -7.55e-05 | 0.000 |
A Practical Aside
What exactly does the estimate of $\beta$ really mean? How can we interpret it?
Recall that $Ln \frac{p}{1-p}=\alpha+\beta x$. This means that a unit change in the value of $x$ changes the log-odds by the value of $\beta$. This is a mathematical statement that IMHO does not offer much intuitive value.
In this example we test the sensitivity of out-of-sample performance to training set sample size. Our goal is to plot test-set $AUC$ as a function of $N$, the number of samples in the training set. Because we expect a lot of variance in the lower range of $N$, we use bootstrap algorithm to compute standard errors of AUC measurements.
To Bootstrap:
'''
The datasets train and test are defined in the above examples.
'''
target='y_buy'
def modAUC(X_train, Y_train, X_test, Y_test):
'''
trains a model on train set and returns AUC on test set
'''
logreg = linear_model.LogisticRegression(C = 10)
logreg.fit(X_train, Y_train)
return roc_auc_score(Y_test, logreg.predict_proba(X_test)[:, 1])
def LrBootstrapper(train, test, nruns, sampsize):
'''
Samples with replacement, runs multiple train/eval attempts
returns mean and stdev of AUC
'''
auc_res = []
for i in range(nruns):
train_samp = train.loc[np.random.randint(0, len(train), size=sampsize)]
try:
auc_res.append(modAUC(train_samp.drop(target,1), train_samp[target], test.drop(target,1), test[target]))
except:
oops = 1
return (np.mean(auc_res), np.percentile(auc_res, 2.5), np.percentile(auc_res, 97.5))
#Run the analysis
n_seq = np.logspace(3, 7, base=2.0, num=30)
avg = []; lowers = []; uppers = []; sz = []
for n in n_seq:
mu, low, up =LrBootstrapper(train, test, 500, int(n))
avg.append(mu)
lowers.append(low)
uppers.append(up)
sz.append(n)
#Plot the analysis
#lower = np.ones(len(avg)) * (avg[len(avg)-1]-1.96*stderr[len(avg)-1])
fig = plt.figure(figsize = (12, 8))
ax = fig.add_subplot(111)
plt.title('Bootstrapped AUC + Confidence Limits \n as a Function of N Samples')
plt.plot(np.log2(n_seq), np.array(avg), label='Avg AUC')
#plt.plot(np.log2(n_seq), np.array(avg) + 1.96 * np.array(stderr), 'k--+', label = 'Upper 95% CI')
#plt.plot(np.log2(n_seq), np.array(avg) - 1.96 * np.array(stderr), 'k--', label = 'Lower 95% CI')
plt.plot(np.log2(n_seq), np.array(uppers), 'k--+', label = 'Upper 95% CI')
plt.plot(np.log2(n_seq), np.array(lowers), 'k--', label = 'Lower 95% CI')
#plt.plot(np.log2(n_seq), lower,'r-')
plt.legend(loc = 4)
ax.set_xlabel('Log2(N)')
ax.set_ylabel('Test Set AUC')
Text(0,0.5,'Test Set AUC')
We can see in the above plot that Logistic Regression does fairly well with small sample sizes. The lower bound of the $95%$ at the $max(N)$ overlaps with the confidence interval at most levels of $N$, suggesting that in expectation, the smaller samples could perform as well as the larger samples.
While this is true, always try to use as much data as you can to reduce the variance!