#!/usr/bin/env python # coding: utf-8 # ## Week 2+3. Regularization, Linear Classification, towards Logistic regression # # Augustin Cosse # # __Material covered:__ Ridge and LASSO regression, linear classification through Multiple Discriminant, OLS and Normal Equations, one-vs-rest and one-vs-one classifiers. Use of the meshgrid function to display the classification boundaries. # ### Example 1. Binary discriminant through OLS and Normal Equations # # In this first demo, we will solve the classification problem as follows. # # - We first assign 0/1 labels to each of the points (recall that we are in the classification framework so the target now take a finite set of values). That is we define our targets as +1 and 0 depending on whether our points are from the yellow or purple class. This gives a vector # $$\mathbf{t} = [t^{(1)}, t^{(2)}, \ldots, t^{(N)}]$$ # # if we have $n$ points # # - We then store the sample points in a matrix $\mathbf{X}$ as we did for regression. In this case the points are 2D, we thus have # # $$\mathbf{X} = \left[\begin{array} # (\mathbf{x}^{(1)})^T\\ # \mathbf{x}^{(2)})^T\\ # \vdots\\ # (\mathbf{x}^{(N)})^T # \end{array}\right]$$ # # where each $\mathbf{x}^{(i)} = \left[\begin{array}{c} # x^{(i)}_1\\ # x^{(i)}_2 # \end{array}\right]$ now encodes the two coordinates of the corresponding point in the dataset below. We want to learn a model of the form # # $y(\mathbf{x}) = \beta_0 + \beta_1X_1 + \beta_2X_2$ # # that outputs a prediction $y(\mathbf{x}^{(i)})$ that is as close as possible to the target of the point $t^{(i)}$. We will encode this model by adding an additional column of $1$'s to the matrix $\mathbf{X}$ above to get # # $$\mathbf{X} = \left[\begin{array}{cc} # 1 & (\mathbf{x}^{(1)})^T\\ # 1& \mathbf{x}^{(2)})^T\\ # \vdots & \vdots\\ # 1& (\mathbf{x}^{(N)})^T # \end{array}\right]$$ # we can then write the model as $\mathbf{y} = \mathbf{X}\mathbf{\beta}$ and we want $\mathbf{y}$ as close as possible to $\mathbf{t}$ (given that what we can achieve is limited by the linearity of the model) # # A natural approach, given what we learned so far, is thus to minimize the OLS criterion, # # $$\min_{\beta_0, \beta_1, \beta_2} \sum_{i=1}^N \frac{1}{N}|t^{(i)} - (\beta_0 + \beta_1X^{(i)}_1 + \beta_2X^{(i)}_2)|^2$$ # # As we saw in regression, this model can read in matrix form as # # $$\mathbf{v} = \mathbf{X}\mathbf{\beta} - \mathbf{t}$$ # # and then # # $$\min_{\mathbf{\beta}}\frac{1}{N}\mathbf{v}^T\mathbf{v} = \min_{\mathbf{\beta}} \left(\mathbf{X}\mathbf{\beta} - \mathbf{t}\right)^T(\mathbf{X}\mathbf{\beta} - \mathbf{t})$$ # # Instead of using a gradient descent approach, we could alternatively set the derivative of the loss with respect to the weights $\mathbf{\beta}$ to zero and solve the equations. In this case (you can verify it by computing the derivatives with respect to each of the $\beta_j$ and setting them to zero), this gives the set of equations # # $$\mathbf{X}^T\left(\mathbf{X}\mathbf{\beta} - \mathbf{t}\right) = 0 \leftrightarrow \mathbf{X}^T\mathbf{X}\mathbf{\beta} = \mathbf{X}^T\mathbf{t}$$ # # which can solve by using the inverse of $\mathbf{X}^T\mathbf{X}$ as $\mathbf{\beta} = \left(\mathbf{X}^T\mathbf{X}\right)^{-1}\mathbf{X}^T\mathbf{t}$ # # # In[1]: # We start by generating data using the make_blobs function from scikit learn. # We first consider a two classes dataset. We then learn the discriminant by solving the normal equations import numpy as np import matplotlib.pyplot as plt from sklearn.datasets import make_blobs X, t = make_blobs(n_samples=100,n_features=2,centers=[[0,0], [1,1]],cluster_std = [.2, .2]) plt.scatter(X[:,0], X[:,1], c = t) plt.show() num_samples = X.shape[0] # 1) Learn the discriminant through the normal equations sampleMat = np.hstack((np.ones((num_samples,1)), X)) RHS = np.dot(sampleMat.T,t) normalMat = np.dot(sampleMat.T, sampleMat) # beta = (XT*X)^{-1}*(X^t t) # the line below is just solving the system nomalMat*beta = RHS beta = np.linalg.solve(normalMat, RHS) print(beta) # In[3]: # To represent our model, Now that we have learned the beta, we can divide the space into a grid of equispaced points. # Concretely, we first compute the minimum and maximum X1 and X2. Once we have those, we then generate all the # points between X1_{min} and X1_{max} on the one hand, as well as between X2_{min} and X2_{max} on the other. Once we # have those two vectors, let us call them v1 and v2. Those two vectors contain the X1 coordinates and X2 coordinates # of the grid points. x1min = np.amin(X[:,0]) x1max = np.amax(X[:,0]) x2min = np.amin(X[:,1]) x2max = np.amax(X[:,1]) x1mesh = np.linspace(x1min, x1max, 100) x2mesh = np.linspace(x2min, x2max, 100) # To compute the prediction, we however need to generate all the pairs $(X1, X2)$ for all the points # on the grid. This is achieved through the meshgrid function. This function takes as first and second arguments # the vectors of X1 and X2 coordinates respectively and returns the X1 and x2 coordinates of all the points on the grid. # As an example if we want a grid with X1 = (1,2,3) and X2 = (1,2,3), meshgrid(X1, X2) will return the matrices # (1 2 3; 1 2 3; 1 2 3) and (1 1 1; 2 2 2; 3 3 3). x1MeshMat, x2MeshMat = np.meshgrid(x1mesh.reshape(-1,1), x2mesh.reshape(-1,1)) # Meshgrid thus gives us the X1 and X2 coordinates # of all the points on the grid (nopt just the border). Those coordinates are returned as matrices, so to # build the X matrix to compute our prediction, we need to flatten the X1 and X2 meshgrid matrices and then # add a column of 1 on top of those coordinates x1MeshMat = x1MeshMat.flatten() x2MeshMat = x2MeshMat.flatten() X12 = np.vstack((x1MeshMat, x2MeshMat)).T totalNumPointsGrid = len(x1MeshMat) # generating the additional vector of ones addon = np.ones((totalNumPointsGrid, 1)) # stacking it to the matrix of prototypes coordinates pointsGrid = np.hstack((addon.reshape(-1,1), X12)) # Now that we have computed the X matrix, and since we have previously obtained the weights of our discriminant, # we can now get the predicted labels for the points over the whole space. prediction = np.dot(pointsGrid, beta) # reorganize my targets (according to original values given). Since we chose the targets to be 0 or 1. We classify our # points depending on whether the output of the model gives a value larger or smaller than 1/2 prediction[prediction<.5] = 0 prediction[prediction>.5] = 1 print(prediction) # To display the class of each point, we will use the function contourf. This function takes as first # argument a matrix of X1 coordinates (coordinates of all points on the grid ) and a matrix of X2 coordinates. # The reverse of meshgrid if you will. coordinatesX1 = np.reshape(x1MeshMat, (100,100)) coordinatesX2 = np.reshape(x2MeshMat, (100,100)) reshaped_prediction = np.reshape(prediction, (100,100)) fig, ax = plt.subplots(constrained_layout=True) plt.scatter(X[:,0], X[:,1], c = t) ax.contourf(coordinatesX1, coordinatesX2, reshaped_prediction, alpha=0.1) plt.show() # ### Example 2. From the Binary classifier to multiclass discriminant. # # __Material covered__ : one vs rest, one vs one, multiclass discriminant, one hot encoding # In[30]: print(np.shape(coordinates_grid)) print(np.shape(np.ones((numSamplePoints, 1)))) # In[71]: import numpy as np import matplotlib.pyplot as plt from sklearn.datasets import make_blobs X, t = make_blobs(n_samples=100,n_features=2,centers=[[-1,0], [1,0], [.5, 1]],cluster_std = [.2, .2, .2]) plt.scatter(X[:,0], X[:,1], c = t) plt.show() # we first consider a 3 classes problem for which we code the one vs rest and one vs one classifiers K = 3 dim = 2 numSamplePoints = X.shape[0] # 1) one vs rest target = np.zeros((numSamplePoints,K)) Xtilde = np.hstack((np.ones((numSamplePoints, 1)), X)) beta = np.zeros((dim+1, K)) for i in np.arange(0, K): target[np.where(t==i), i] = 1 beta[:,i] = np.dot(np.linalg.inv(np.dot(Xtilde.T, Xtilde)), np.dot(Xtilde.T, target[:,i])) # plotting and generating the grid x1min = np.amin(X[:,0]) x1max = np.amax(X[:,0]) x2min = np.amin(X[:,1]) x2max = np.amax(X[:,1]) x1mesh = np.linspace(x1min, x1max, 100) x2mesh = np.linspace(x2min, x2max, 100) X1meshMat, X2meshMat = np.meshgrid(x1mesh, x2mesh) coordinates_grid = np.vstack((X1meshMat.flatten(), X2meshMat.flatten())).T coordinates_tilde = np.hstack((np.ones((np.shape(coordinates_grid)[0], 1)),coordinates_grid )) # getting the predictions out_classifier = np.zeros((np.shape(coordinates_grid)[0],K)) indClassification = np.zeros((np.shape(coordinates_grid)[0], K)) target_out_total = np.zeros((np.shape(coordinates_grid)[0], 1)) for i in np.arange(0, K): out_classifier[:,i] = np.dot(coordinates_tilde, beta[:,i]) indClassification[np.where(out_classifier[:,i]>0.5),i] = 1 target_out_total[np.where(out_classifier[:,i]>0.5)] = i # looking for overlaps misclassification = np.sum(indClassification, axis=1) ind_misclassified1 = np.where(misclassification>1) ind_misclassified2 = np.where(misclassification==0) target_out_total[ind_misclassified1] = K target_out_total[ind_misclassified2] = K print(ind_misclassified) # In[75]: # plotting the resulting classifier fig, ax = plt.subplots(constrained_layout=True) plt.scatter(X[:,0], X[:,1], c = t) ax.contourf(X1meshMat, X2meshMat, target_out_total.reshape((100,100)), alpha=0.2) plt.show() # The representation below also shows that the classifier cannot properly classify some regions in # the (x_1, x_2) space (show in yellow). For those regions, points are either classified as belonging to # multiple classes (such as in the center) or they are do not get any label at all (such as the triangles on the border) # In[111]: # I now code the one vs one classifier. We still consider the three classes dataset from above import numpy as np import matplotlib.pyplot as plt from sklearn.datasets import make_blobs X, t = make_blobs(n_samples=100,n_features=2,centers=[[-1,0], [1,0], [.5, 1]],cluster_std = [.2, .2, .2]) plt.scatter(X[:,0], X[:,1], c = t) plt.show() # we first consider a 3 classes problem for which we code the one vs rest and one vs one classifiers K = 3 dim = 2 numSamplePoints = X.shape[0] Xtilde = np.hstack((np.ones((numSamplePoints, 1)),X)) # we will sore our beta vectors in a 3D array beta_tensor = np.zeros((K,K, 3)) # generating the grid (we will classify the points from the grid in real time) x1min = np.amin(X[:,0]) x1max = np.amax(X[:,0]) x2min = np.amin(X[:,1]) x2max = np.amax(X[:,1]) x1mesh = np.linspace(x1min, x1max, 100) x2mesh = np.linspace(x2min, x2max, 100) X1meshMat, X2meshMat = np.meshgrid(x1mesh, x2mesh) coordinates_grid = np.vstack((X1meshMat.flatten(), X2meshMat.flatten())).T coordinates_tilde = np.hstack((np.ones((np.shape(coordinates_grid)[0], 1)),coordinates_grid )) # define vector counting the votes for each point count = np.zeros((np.shape(coordinates_grid)[0], K)) for i in np.arange(0, K): for j in np.arange(0,K): if i > j: XtildeLocal = Xtilde[np.squeeze(np.hstack((np.where(t==i), np.where(t==j)))), :] tlocal = t[np.squeeze(np.hstack((np.where(t==i), np.where(t==j))))] tlocal2 = np.zeros(np.shape(tlocal)) tlocal2[tlocal == i] = 1 # here we do i vs j tlocal2[tlocal == j] = 0 beta_ij = np.dot(np.linalg.inv(np.dot(XtildeLocal.T, XtildeLocal)), np.dot(XtildeLocal.T, tlocal2)) beta_tensor[i,j, :] = beta_ij beta_tensor[j,i, :] = beta_ij prediction = np.dot(coordinates_tilde,beta_ij) ind_i = np.where(prediction > 0.5) ind_j = np.where(prediction < 0.5) count[ind_i, i] += 1 count[ind_j, j] += 1 # Now that we have all the beta, we compute each of the regions. We will proceed through a majority vote. # That is we start by getting all the labels for each of the binary classifiers. And we add one each time a point # is classified in a given class count_final = np.squeeze(np.zeros((np.shape(coordinates_grid)[0], 1))) for i in np.arange(np.shape(coordinates_grid)[0]): count_final[i] = np.argmax(count[i,:]) # add a few lines highlighting the regions where points are classified at random # In[112]: # plotting the resulting classifier fig, ax = plt.subplots(constrained_layout=True) plt.scatter(X[:,0], X[:,1], c = t) ax.contourf(X1meshMat, X2meshMat, count_final.reshape((100,100)), alpha=0.2) plt.show() # note that here although the plot does not show it, we can have regions in space that get the same # number of votes and there is thus also some form of ambiguity. When taking the max of the votes, in case of an # equality numpy selects an index at random which is why it is not visible on the plot. # In[142]: # To conclude, we consider a K class discriminant function. We still consider the 3 classes example from above # In this case, the first step is to encode the # each of the targets through a 1-hot encoding. We have import numpy as np import matplotlib.pyplot as plt from sklearn.datasets import make_blobs X, t = make_blobs(n_samples=100,n_features=2,centers=[[-1,0], [1,0], [.5, 1]],cluster_std = [.2, .2, .2]) plt.scatter(X[:,0], X[:,1], c = t) plt.show() # we first consider a 3 classes problem for which we code the one vs rest and one vs one classifiers K = 3 dim = 2 numSamplePoints = X.shape[0] oneHotEncodingT = np.zeros((len(t),K)) oneHotEncodingT[np.arange(0,len(t)), t] = 1 vec_ones = np.ones((numSamplePoints, 1)) # As before, we build the matrix Xtilde by adding a column of 1 to X. print(np.shape((np.ones((np.shape(X)[0],1))))) print(np.shape(X)) Xtilde = np.hstack((np.ones((np.shape(X)[0],1)), X)) # we then solve the matrix system Xtilde*BETA = T (here both X and BETA are matrices) solutionBeta = np.dot(np.linalg.inv(np.dot(Xtilde.T, Xtilde)), np.dot(Xtilde.T, oneHotEncodingT)) # if you print the shape of the matrix you now see that this matrix has size numClasses*3 so that we # now have 3 weight vectors. If we apply this weight matrix to the meshgrid data, as we did before, x1min = np.amin(X[:,0]) x1max = np.amax(X[:,0]) x2min = np.amin(X[:,1]) x2max = np.amax(X[:,1]) x1mesh = np.linspace(x1min, x1max, 100) x2mesh = np.linspace(x2min, x2max, 100) X1meshMat, X2meshMat = np.meshgrid(x1mesh, x2mesh) Xmeshgrid = np.vstack((X1meshMat.flatten(), X2meshMat.flatten())) XmeshgridTilde = np.hstack((np.ones((np.shape(Xmeshgrid.T)[0],1)),Xmeshgrid.T )) prediction_grid = np.dot(XmeshgridTilde, solutionBeta) # From this, we can then define our class to be the index of the row that gives the highest value for t[i,j], for each column j # the columns corresponding to the dataset points final_classificationTargets = np.squeeze(np.zeros((np.shape(prediction_grid)[0],1))) for i in np.arange(0, len(final_classificationTargets)): final_classificationTargets[i] = np.argmax(prediction_grid[i,:]) # Now we display the prediction using contourf # In[143]: # plotting the resulting classifier fig, ax = plt.subplots(constrained_layout=True) plt.scatter(X[:,0], X[:,1], c = t) ax.contourf(X1meshMat, X2meshMat, final_classificationTargets.reshape((100,100)), alpha=0.2) plt.show() # In[ ]: # we are now guaranteed to predict a single class for each of the points from our dataset # (except if our classifier outputs twice the same value which is quite unlikely). # The OLS classifier which we used throughout the various steps above is however not robust to outliers. # ### Example 2. The sigmoid function # In[145]: # to resolve the lack of robustness of the OLS classifier vis a vis outliers, we can consider an additional # non linear function which should return approximately the same value regardless of the distance of a point to # the separating plane. This means that we would like the function to return a value alpha_0 when the output to the # OLS classifier is positive and a value alpha1 when the output to the classifier is negative. A further improvement # can be achieved if we could 'force' those values to be 0 and 1. In this case taking the decision regarding the # classes becomes a lot easier. As we will see, this is what the sigmoid will approximately give us. # From the two figures below, you can see that the sigmoid always return a value between 0 and 1 and it # quickly saturates as we move aways from 0. x = np.linspace(-10,10, 100) sigmoid = np.true_divide(1, 1+np.exp(-x)) plt.plot(x, sigmoid) plt.show() # In[146]: x = np.linspace(-100,100, 100) sigmoid = np.true_divide(1, 1+np.exp(-x)) plt.plot(x, sigmoid) plt.show()