#!/usr/bin/env python # coding: utf-8 # # Logistic regression # Logistic regression can serve as a stepping stone towards neural network algorithms and supervised deep learning. For logistic regression, the minimization of the cost function leads to a non-linear equation in the parameters $\boldsymbol{\beta}$. The optimization of the problem calls therefore for minimization algorithms. This forms the bottleneck of all machine learning algorithms, namely how to find reliable minima of a multi-variable function. This leads us to the family of gradient descent methods. The latter are the workhorses of all modern machine learning algorithms. # # ### Basics # # We consider the case where the outcome $y_i$ are discrete and only take values # from $k=0,\dots,K-1$ (i.e. $K$ classes). # # The goal is to predict the # output classes from the design matrix $\boldsymbol{X}\in\mathbb{R}^{n\times p}$ # made of $n$ samples, each of which carries $p$ features or predictors. The # primary goal is to identify the classes to which new unseen samples # belong. # # Let us specialize to the case of two classes only, with outputs # $y_i=0$ and $y_i=1$. Our outcomes could represent the status of a # credit card user that could default or not on her/his credit card # debt. # One simple way to get a discrete output is to have sign # functions that map the output of a linear regressor to values $\{0,1\}$, # $f(s_i)=sign(s_i)=1$ if $s_i\ge 0$ and 0 if otherwise. # We will encounter this model in our first demonstration of neural networks. Historically it is called the "perceptron" model in the machine learning # literature. This model is extremely simple. However, in many cases it is more # favorable to use a "soft" classifier that outputs # the probability of a given category. This is achieved by the logistic function. # ### The logistic (or the fermi) function # # In most classification models it is favorable to have a "soft" classifier that outputs the probability of a given category rather than a single value. For example, given $x_i$, the classifier outputs the probability of being in a category $k$. Logistic regression is the most common example of a so-called soft classifier. In logistic regression, the probability that a data point $x_i$ belongs to a category $y_i=\{0,1\}$ is given by the so-called logit function (or the fermi function) which is meant to represent the likelihood for a given event, # $$ # p(t) = \frac{1}{1+e^{-t}}=\frac{e^t}{1+e^t}=f(-t). # $$ # Many times we also use $\tanh(x)$ function, which is related to the fermi function by $\tanh(x)=f(-x)-f(x)$ # We assume now that we have two outputs with $y_i$ either $0$ or $1$. Furthermore we assume also that we have only two parameters $\beta$ to fit. We define probabilities for different outcomes as follows # $$ # \begin{align*} # p(y_i=1|x_i,\boldsymbol{\beta}) &= \frac{\exp{(\beta_0+\beta_1x_i)}}{1+\exp{(\beta_0+\beta_1x_i)}},\nonumber\\ # p(y_i=0|x_i,\boldsymbol{\beta}) &= 1 - p(y_i=1|x_i,\boldsymbol{\beta}), # \end{align*} # $$ # where $\boldsymbol{\beta}$ are the weights we wish to extract from data, in our case $\beta_0$ and $\beta_1$. # # Here we read $p(y_i=1|x_i,\boldsymbol{\beta})$ as probability for the outcome $y_i=1$ given input variables $x_i$ and given fixed parameters $\boldsymbol{\beta}$. # # Note that we used # $$ # p(y_i=0\vert x_i, \boldsymbol{\beta})+p(y_i=1\vert x_i, \boldsymbol{\beta}) = 1. # $$ # # # At the output we will define that $\widetilde{y}_i=1$ if $p(y_i=1\vert x_i, \boldsymbol{\beta})>1/2$ and is 0 otherwise. # ## Maximum likelihood # # In order to define the total likelihood for all possible outcomes from a # dataset $\mathcal{D}=\{(y_i,x_i)\}$, with the binary labels # $y_i\in\{0,1\}$ and where the data points are drawn independently, we use the so-called [Maximum Likelihood Estimation](https://en.wikipedia.org/wiki/Maximum_likelihood_estimation) (MLE) principle. # # We aim thus at maximizing the probability of seeing the observed data. # We can then approximate the likelihood in terms of the product of the individual probabilities of a specific outcome $y_i$, that is # $$ # \begin{align*} # P(\mathcal{D}|\boldsymbol{\beta})& = \prod_{i=1}^n \left[p(y_i=1|x_i,\boldsymbol{\beta})\right]^{y_i}\left[1-p(y_i=1|x_i,\boldsymbol{\beta}))\right]^{1-y_i}\nonumber \\ # \end{align*} # $$ # If this probability is maximized, its $\log$ value is also maximized, and $-\log$ value is minimized. We can then define a corresponding **cost** function as # $$ C(\boldsymbol{\beta}) = -\log( P(\mathcal{D}|\boldsymbol{\beta}) )$$ # $$ # \mathcal{C}(\boldsymbol{\beta}) = \sum_{i=1}^n -y_i\log[p(y_i=1|x_i,\boldsymbol{\beta})] - (1-y_i)\log\left[1-p(y_i=1|x_i,\boldsymbol{\beta}))\right]. # $$ # This equation is known in statistics as the **cross entropy**. # # Finally, we note that just as in linear regression, in practice we often supplement the cross-entropy with additional regularization terms, usually $L_1$ and $L_2$ regularization as we did for Ridge and Lasso regression. # ## Minimizing the cross entropy cost function # # The derivative can be easily derived # \begin{equation} # \frac{\partial C(\beta)}{\partial\beta_j}=\sum_{i=1}^n \left(-\frac{y_i}{p_i}+\frac{1-y_i}{1-p_i}\right)\frac{\partial p_i}{\partial\beta_j} # \end{equation} # where we denoted $p(y_i=1|x_i,\beta)=p_i$ and # \begin{equation} # \frac{\partial p_i}{\partial\beta_j}= \frac{\partial}{\partial\beta_j}\frac{1}{1+e^{-(\beta_0+\beta_1x_i)}}= # \frac{1}{1+e^{-(\beta_0+\beta_1x_i)}}\frac{1}{1+e^{(\beta_0+\beta_1x_i)}}(\delta_{j=0}+x_i\delta_{j=1})=p_i (1-p_i) (\delta_{j=0}+x_i\delta_{j=1}) # \end{equation} # # We therefore see that # \begin{equation} # \frac{\partial C(\beta)}{\partial\beta_j}=\sum_{i=1}^n \left(-\frac{y_i}{p_i}+\frac{1-y_i}{1-p_i}\right)p_i (1-p_i)(\delta_{j=0}+x_i\delta_{j=1})=\sum_{i=1}^n (p_i-y_i)(\delta_{j=0}+x_i\delta_{j=1}) # \end{equation} # # If we defined a design matrix with two columns $X^T =[1,x_i]$, i.e., the first column is 1 and the second is the input observable $x_i$, than we can write the derivative of the cost function as # $$\frac{\partial C(\beta)}{\partial\beta_j}= \mathbf{X}^T (\mathbf{p}-\mathbf{y})$$ # Here $\mathbf{p},\mathbf{y}\in \mathbb{R}^n$ and $\mathbf{X}\in \mathbb{R}^{n\times 2}$ # The second derivative of the cost function is # \begin{equation} # \frac{\partial^2 C(\beta)}{\partial\beta_l\partial\beta_j}=\frac{\partial}{\partial\beta_l}\sum_{i=1}^n (p_i-y_i)(\delta_{j=0}+x_i\delta_{j=1})=\sum_{i=1}^n # (\delta_{j=0}+x_i\delta_{j=1})\frac{\partial p_i}{\partial\beta_l}= # \sum_{i=1}^n # (\delta_{j=0}+x_i\delta_{j=1})p_i(1-p_i)(\delta_{l=0}+x_i \delta_{l=1}) # \end{equation} # This can be compactly written as # \begin{equation} # \frac{\partial^2 C(\beta)}{\partial\beta^T\partial\beta}= \mathbf{X}^T \mathbf{W} \mathbf{X} # \end{equation} # where $W$ is a diagonal matrix with entries $W_{ii}=p_i(1-p_i)$. # # This matrix is positive definite, which can be proven by noting that we can redefine $\widetilde{X}_{ij}=\sqrt{p_i(1-p_i)} X_{ij}$, in terms of which the second derivative is just $\widetilde{X}^T\widetilde{X}$. Than the proof is identical to the one in linear regression. Namely, the singular values of $\widetilde{X}$ exist and are $\sigma$, therefore the eigenvalues of $\widetilde{X}^T\widetilde{X}$ are equal to $\sigma^2$, and therefore the matrix is positive definite. # The consequence is that the global minimum exists and the cost function is minimum in the solution $$\frac{\partial C(\beta)}{\partial\beta}=0$$. # Finally, to find the minimum with respect to the parameters $\beta_0,\beta_1$, we need to solve nonlinear set of equations, which can only be done numerically. # # The method of choice is the **Newton's method**, or, in multidimensional case we also call it the **gradien descent** method. # ### Gradient descent method # If we want to find zero of a gradient $$0=\frac{\partial C}{\partial \beta_j}= \sum_i X_{ji}(p_i(\beta)-y_i) \equiv g_j(\beta)$$ # # we can Taylor expand around $g_j(\beta)=0$, which gives us the famous Neton's-Raphson method # $$\beta_{k+1} = \beta_k - (\nabla_\beta g(\beta_k))^{-1} g(\beta_k)$$ # which is a generalization of the 1D case # $x_{k+1}=x_k - \frac{g(x_k)}{g'(x_k)}$. Note that $g(\beta_k)$ is a vector, and its derivative is a matrix. # # We note that $$\nabla_\beta g(\beta_k)\equiv \frac{\partial^2 C(\beta)}{\partial\beta_l\partial\beta_j}$$ is the second derivative and is the so-called Hessian matrix. We will denote it by $$H_{lj}(\beta)\equiv\frac{\partial^2 C(\beta)}{\partial\beta_l\partial\beta_j}$$ # # In the Newton's method we thus need to iterate the following equation # $$ \beta_{k+1} = \beta_k - H^{-1}(\beta_k) g(\beta_k)$$ # In most practical applications, the Hessian matrix is very expensive to calculate and to invert. Instead, in most gradient descent methods one approximates $H$ by a number, which is called **the learning rate**. # # The later is usually a parameter that is being modified or adjusted with iterations, so that minimum of $C$ is found. Note that the equation # # $$\beta_{k+1}=\beta_k - \gamma g(\beta_k)$$ # # for appropriate number $\gamma>0$ is also approaching the minimum, as $g(\beta)$ is gradient in the direction of descent of function $C(\beta)$. The problem is that we don't know apriory how large should $\gamma$ be. # # These equations are usually termed **gradient descent (GD)** method optimizing function $F(\mathbf{x})$, # $\mathbf{x} \equiv (x_1,\cdots,x_n)$, decreases fastest if one goes from $\bf {x}$ in the # direction of the negative gradient $-\nabla F(\mathbf{x})$. # # It can be shown that if # $$ # \mathbf{x}_{k+1} = \mathbf{x}_k - \gamma_k \nabla F(\mathbf{x}_k), # $$ # with $\gamma_k > 0$. # For $\gamma_k$ small enough, then $F(\mathbf{x}_{k+1}) \leq # F(\mathbf{x}_k)$. This means that for a sufficiently small $\gamma_k$ # we are always moving towards smaller function values, i.e a minimum. # # # In machine learing we are often faced with non-convex high dimensional # cost functions with many local minima. Since **GD** is deterministic we # will get stuck in a local minimum, if the method converges, unless we # have a very good intial guess. This also implies that the scheme is # sensitive to the chosen initial condition. # # Many of these shortcomings can be alleviated by introducing # randomness. One such method is that of Stochastic Gradient Descent # (SGD). # # ## Stochastic Gradient Descent # # The underlying idea of SGD comes from the observation that the cost # function, which we want to minimize, can almost always be written as a # sum over $n$ data points $\{\mathbf{x}_i\}_{i=1}^n$, # $$ # C(\mathbf{\beta}) = \sum_{i=1}^n c_i(\mathbf{x}_i, # \mathbf{\beta}). # $$ # This in turn means that the gradient can be # computed as a sum over $i$-gradients # $$ # \nabla_\beta C(\mathbf{\beta}) = \sum_{i=1}^n \nabla_\beta c_i(\mathbf{x}_i, # \mathbf{\beta}). # $$ # Stochasticity/randomness is introduced by only taking the # gradient on a subset of the data called minibatches. If there are $n$ # data points and the size of each minibatch is $M$, there will be $n/M$ # minibatches. We denote these minibatches by $B_k$ where # $k=1,\cdots,n/M$. We have # $$ # \beta_{j+1} = \beta_j - \gamma_j \sum_{i \in B_k}^n \nabla_\beta c_i(\mathbf{x}_i, # \mathbf{\beta}) # $$ # where $k$ is picked at random with equal # probability from $[1,n/M]$. An iteration over the number of # minibathces ($n/M$) is commonly referred to as an epoch. Thus it is # typical to choose a number of epochs and for each epoch iterate over # the number of minibatches. # # # More complex optimizers are used in ML, in particular ADAM, which are much more technical and beyond the scope of these lectures. # ## Extending to more classes (beyond two outputs) # # If the otput requires more than two outpus, lets say $K$ (exmple: digits between $0-9$), we can generalized the sigmoid function to so called softmax function, in which each ouput neuron has probability proportional to $e^{z_j}$ with $j\in[0,..,K-1]$. The function for probability would have the following form: # $$ # f(z_i) = \frac{\exp{(z_i)}}{\sum_{m=1}^K\exp{(z_m)}}. # $$ # where $z_i=\sum_{j}\beta_{i j} x_j + b_i$. # # This function is called **Softmax function**, which is actually Boltzman weighted sum. # # Note that the derivatives are now a bit more involved, i.e., # $$ # \frac{\partial f(z_i)}{\partial \beta_{jk}} = # \frac{\exp{(z_i)}}{\sum_{m=1}^K\exp{(z_m)}}\frac{\partial z_i}{\partial \beta_{jk}} # -\sum_p \frac{\exp{(z_i)}}{(\sum_{m=1}^K\exp{(z_m))^2}} \exp{(z_p)}\frac{\partial z_p}{\partial \beta_{jk}} # . # $$ # Since # $$ # \frac{\partial z_p}{\partial \beta_{jk}}=\delta_{p,j} x_k # $$ # we have # $$ # \frac{\partial f(z_i)}{\partial \beta_{jk}}= f(z_i)\left(\delta_{ij}-f(z_j)\right)x_k\\ # \frac{\partial f(z_i)}{\partial b_{j}}=f(z_i)\left(\delta_{ij}-f(z_j)\right) # , # $$ # which in case of the simply binary model reduces to having $i=j$. # In[ ]: