A Primer on Empirical Risk Minimization¶

Notations and Definitions¶

Let's first set up some notation and ideas:

Let $X \in \mathbb{R}^p$ be a $p$-dimensional real valued vector of predictor variables

Let $Y$ be a target variable, where
- $Y\in \mathbb{R}$ is real valued
- $Y\in C$ is an element in some set of classes $C=\{c_1,c_2,...c_k\}$

$X$ and $Y$ are governed by a joint distribution $P(Y,X)$ (that we likely don't know)

We seek a function $f(X)$ for predicting $Y$, given $X$, whose output can be

Real valued, i.e. $f(X)=E[Y|X]$
Discrete valued, i.e. $f(x) \in \{c_1,c_2,...c_k\}$

Second, let's define two more things

$\mathbb{F}$ is a family of functions, such that $f(x) \in \mathbb{F}$, examples are:
- All linear hyper-planes, such that $f(x)=\alpha+\beta x$
- All quadratic polynomials, such that $f(x)=\alpha+\beta_1 x +\beta_2x^2$
- All decision trees with max(depth)=$k$

A loss function $\mathbb{L}(f(X),Y)$ that measures how well $f(X)$ approximates $Y$.
- Squared Loss: $\mathbb{L}(f(x),y)=(f(x)-y)^2$
- 0-1 Loss: $\mathbb{L}(f(x),y)=\mathbb{I}(f(x)==y)$
- Logistic Loss: $\mathbb{L}(f(x),y)=-[y*Ln(f(x))+(1-y)*Ln(1-f(x))]$
- Hinge Loss: $\mathbb{L}(f(x),y)=max(0,1-f(x)*y)$

Emprical Risk Minimization¶

The main goal of Supervised Learning can be stated using the Empirical Risk Minimization framework of Statistical Learning.

We are looking for a function $f\in \mathbb{F}$ that minimizes the expected loss:

$E[\mathbb{L}(f(x),y)]=\int \mathbb{L}(f(x),y)\, P(x,y)\:\mathrm{d}x\mathrm{d}y$
Because we don't know the distribution $P(X,Y)$, we can't minimize the expected loss. However, we can minimize the empirical loss, or risk, by computing the average loss over our training data.

Thus, in Supervised Learning, we choose the function $f(X)$ that minimizes the loss over training data:

$f^{opt}= \underset{f \in \mathbb{F}} {\mathrm{argmin}} \frac{1}{n} \sum\limits_{i=1}^n \mathbb{L}(f(x_i),y_i)$

Logistic Regression¶

Definition¶

Logistic Regression: a member of the class of generalized linear models (glm) using the logit as its link function.

The goal of Logistic Regression is to model the posterior probability of membership in class $c_i$ as a function of $X$. I.e.,

$P(c_i|x)=f(x)=\frac{1}{1+e^{-(\alpha+\beta x))}}$

To make this a linear model in $X$, we take the log of the odds ratio of $p$ (called the log-odds):

$ln \frac{P(c_i|x)}{1-P(c_i|x)} = ln \frac{1}{e^{-(\alpha+\beta x))}} = \alpha+\beta x$

And effectively we do a linear regression against the log-odds of $P(c_i|x)$ (though we don't use least squares).

LogReg as ERM¶

How do we fit Logistic Regression into the ERM framework?

We find the parameters $\alpha$ and $\beta$ using the method of Maximum Likelihood Estimation.

If we consider each observation to be an indepedent Bernoulli draw with $p_i=P(y_i|x_i)$, then the likelihood of each draw can be defined as: $p_i^{y_i}(1-p_i)^{1-y_i}$, with $p_i$ given by the inverse logit function. In MLE, we wish to maximize the likelihood of observing the data as a function of the independent parameters of the model (i.e., $\alpha$ and $\beta$). The total likelihood function looks like:

$L(\alpha,\beta|X,Y)=\prod\limits_{i=1}^nP(x_i,y_i|\alpha,\beta)=\prod\limits_{i=1}^np_i^{y_i}(1-p_i)^{1-y_i}$

This is actually a difficult equation to maximize directly, so we do a little trick. We take the negative log and call this our loss function for ERM! $\mathbb{L}(f(X),Y)=-Ln [L(\alpha,\beta|X,Y)]=-\sum\limits_{i=1}^n y_i\,ln\,(p_i)+(1-y_i)\,ln\,(1-p_i)$

Example 1 - Building and Looking at a Model¶

In [4]:

'''
Let's train an actual model and see how well it generalizes
'''
import math
from sklearn.metrics import confusion_matrix, roc_auc_score
from sklearn import linear_model
import numpy as np
import os
import course_utils as bd
import pandas as pd
import statsmodels.api as sm
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')
import imp
imp.reload(bd)
%matplotlib inline

#For mac users
cwd = os.getcwd()
datadir = '/'.join(cwd.split('/')[0:-1]) + '/data/'
#For window's users, hardcode the dir:
#datadir = 


#Load data and downsample for a 1/10 pos/neg ratio, then split into a train/test
f = datadir + 'ads_dataset_cut.txt'
target = 'y_buy'
tdat = pd.read_csv(f, header = 0, sep = '\t')
moddat = bd.downSample(tdat, target, 10)

#We know the dataset is sorted so we can just split by index
train_split = 0.75
train = moddat[:int(math.floor(moddat.shape[0]*train_split))]
test = moddat[int(math.floor(moddat.shape[0]*train_split)):]

#Using Scikit-learn the model is built with two easy steps.
logreg = linear_model.LogisticRegression(C = 1e30)
logreg.fit(train.drop(target, 1), train[target])

#But we are going to also build using the statsmodel package
logit_sm = sm.Logit(train[target], train.drop(target, 1))
lr_fit = logit_sm.fit()

Warning: Maximum number of iterations has been exceeded.
         Current function value: 0.245774
         Iterations: 35

Logistic Regression has long been used as a tool in statistics and econometrics so there are a lot of additional data points one can get out of logistic regression model than one might get with standard machine learning tools.

We showed how to use scikit-learn to fit a model, but we also used statsmodel. The reason is that statsmodel returns summary statistics on each coefficient fit to the variables. In machine learning, we often only focus on the generalizability of the prediction. But in many analytical applications we want to know how statistically significant are the estimates within our model.

In [5]:

#Use statsmodel if you want to understand the fit statistics of the LR model
lr_fit.summary()

Out[5]:

Logit Regression Results
Dep. Variable:	y_buy	No. Observations:	2087
Model:	Logit	Df Residuals:	2074
Method:	MLE	Df Model:	12
Date:	Mon, 24 Sep 2018	Pseudo R-squ.:	0.1601
Time:	20:18:41	Log-Likelihood:	-512.93
converged:	False	LL-Null:	-610.74
		LLR p-value:	2.621e-35

	coef	std err	z	P>\|z\|	[0.025	0.975]
isbuyer	-23.0178	6.48e+04	-0.000	1.000	-1.27e+05	1.27e+05
buy_freq	23.9668	6.48e+04	0.000	1.000	-1.27e+05	1.27e+05
visit_freq	-0.0074	0.016	-0.462	0.644	-0.039	0.024
buy_interval	-0.0109	0.014	-0.804	0.422	-0.038	0.016
sv_interval	0.0102	0.007	1.423	0.155	-0.004	0.024
expected_time_buy	0.0106	0.016	0.656	0.512	-0.021	0.042
expected_time_visit	-0.0334	0.008	-4.313	0.000	-0.049	-0.018
last_buy	0.0102	0.005	2.086	0.037	0.001	0.020
last_visit	-0.0635	0.006	-10.851	0.000	-0.075	-0.052
multiple_buy	-22.6427	6.48e+04	-0.000	1.000	-1.27e+05	1.27e+05
multiple_visit	-0.1607	0.213	-0.755	0.450	-0.578	0.257
uniq_urls	-0.0103	0.002	-6.443	0.000	-0.013	-0.007
num_checkins	8.671e-05	8.28e-05	1.048	0.295	-7.55e-05	0.000

A Practical Aside

What exactly does the estimate of $\beta$ really mean? How can we interpret it?

Recall that $Ln \frac{p}{1-p}=\alpha+\beta x$. This means that a unit change in the value of $x$ changes the log-odds by the value of $\beta$. This is a mathematical statement that IMHO does not offer much intuitive value.

So what can we learn by looking at betas? (IMHO, not much!)

Some helpful tips, garnered from theory and experience:

$|\beta_1|>|\beta_2|$ does not guarantee that feature $X_1$ is more predictive than $X_2$. The magnitude of $\beta$ is inversely proportional to the scale of $X$, so comparing betas only makes sense when the features have the same scale (such as binary features).

Likewise, the z-score of $\beta$ is influences by sample size and should not be used to rank features by predictiveness

sign($\beta$) does tell you whether $Y$ is positively or negatively correlated with $X$. However, if the features have a lot of multi-collinearity, sign($\beta$) can be misleading.

Multi-collinearity in $X$ means the betas will have covariance with each other. The betas will "split" the effect. Sometimes they'll split the effect as positive numbers (i.e. 1=0.5+0.5) and other times they'll split as negatives (i.e., 1=2-1). This makes interpreting $\beta$ that much more difficult.

Example 2 - Robustness¶

In this example we test the sensitivity of out-of-sample performance to training set sample size. Our goal is to plot test-set $AUC$ as a function of $N$, the number of samples in the training set. Because we expect a lot of variance in the lower range of $N$, we use bootstrap algorithm to compute standard errors of AUC measurements.

To Bootstrap:

Generate $k$ training sets of size $N$, where we sample from original training set with replacement.
Compute and store the AUC on the test set for each bootstrapped sample
Get Mean and Standard Error of bootstrapped AUC estimates

In [6]:

'''
The datasets train and test are defined in the above examples.
'''

target='y_buy'

def modAUC(X_train, Y_train, X_test, Y_test):
    '''
    trains a model on train set and returns AUC on test set
    '''
    logreg = linear_model.LogisticRegression(C = 10)
    logreg.fit(X_train, Y_train)
    return roc_auc_score(Y_test, logreg.predict_proba(X_test)[:, 1])

def LrBootstrapper(train, test, nruns, sampsize):
    '''
    Samples with replacement, runs multiple train/eval attempts
    returns mean and stdev of AUC
    '''
    auc_res = []
    for i in range(nruns):
        train_samp = train.loc[np.random.randint(0, len(train), size=sampsize)]
        try:
            auc_res.append(modAUC(train_samp.drop(target,1), train_samp[target], test.drop(target,1), test[target]))
        except:
            oops = 1
    return (np.mean(auc_res), np.percentile(auc_res, 2.5), np.percentile(auc_res, 97.5))
            
#Run the analysis  
n_seq = np.logspace(3, 7, base=2.0, num=30)

avg = []; lowers = []; uppers = []; sz = []
for n in n_seq:
    mu, low, up =LrBootstrapper(train, test, 500, int(n))
    avg.append(mu)
    lowers.append(low)
    uppers.append(up)
    sz.append(n)    

    


#Plot the analysis
#lower = np.ones(len(avg)) * (avg[len(avg)-1]-1.96*stderr[len(avg)-1])

fig = plt.figure(figsize = (12, 8))
ax = fig.add_subplot(111)
plt.title('Bootstrapped AUC + Confidence Limits \n as a Function of N Samples')
plt.plot(np.log2(n_seq), np.array(avg), label='Avg AUC')
#plt.plot(np.log2(n_seq), np.array(avg) + 1.96 * np.array(stderr), 'k--+', label = 'Upper 95% CI')
#plt.plot(np.log2(n_seq), np.array(avg) - 1.96 * np.array(stderr), 'k--', label = 'Lower 95% CI')

plt.plot(np.log2(n_seq), np.array(uppers), 'k--+', label = 'Upper 95% CI')
plt.plot(np.log2(n_seq), np.array(lowers), 'k--', label = 'Lower 95% CI')

#plt.plot(np.log2(n_seq), lower,'r-')

plt.legend(loc = 4)
ax.set_xlabel('Log2(N)')
ax.set_ylabel('Test Set AUC')

Out[6]:

Text(0,0.5,'Test Set AUC')

We can see in the above plot that Logistic Regression does fairly well with small sample sizes. The lower bound of the $95%$ at the $max(N)$ overlaps with the confidence interval at most levels of $N$, suggesting that in expectation, the smaller samples could perform as well as the larger samples.

While this is true, always try to use as much data as you can to reduce the variance!