import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
First we'll generate a random matrix
#Number of columns (features)
K = 5
#Number of records
N = 1000
#Generate an NxK matrix of uniform random variables
X = np.random.random([N, K])
Let's peak at our data to confirm it looks as we expect it
X[100, :]
array([0.9786862 , 0.50933013, 0.47327265, 0.57154566, 0.17725616])
X.shape
(1000, 5)
This exercise is about designing a scoring function for a logistic regression. As we are not concerned with fitting a model to data, we can just make up a logistic regression.
For quick intro, the Logistic Regression takes the form of ˆY=f(x∗βT), where x is the 1xK vector of features and β is the 1xK vector of weights. The function f, called a 'link' function, is the inverse logit:
In this notebook we'll write a function that, given inputs of X and β, returns a value for ˆY.
First let's generate a random set of weights to represent β.
beta = np.random.uniform(low= -1.0, high= 1.0, size=(K,))
beta
array([-0.21854798, 0.15267024, -0.07139126, 0.19223845, 0.99029314])
Notice how we applied a neat NumPy trick here. The numpy.random.random() function returns an array, yet we applied what appears to be a scalar operation on the vector. This is an example of what NumPy calls vectorization and broadcasting, which offers us both a very fast way to do run vector computations as well as a clean and concise method of coding.
Question: we designed the above beta vector such that E[βi]=0. How can we confirm that we did this correctly?
#start by taking the mean of the beta we already calculated
beta.mean()
0.20905252029504284
#It is likely the above is not equal to zero. Let's simulate this 100k times and see what the average mean is
sims = 100000
means = []
for i in range(sims):
means.append(2 * (np.random.random(K) - 0.5).mean())
Now let's use matplotlibs hist function to plot the histogram of means here.
plt.hist(means)
plt.show()
We should expect the distribution to be centered around zero. Is it? As fun technical side, let's dive a little deeper into what this distribution should look like. The histogram shows a distribution of the average of a sample of 5 uniformly distributed random variables taken over N different samples. Can we compare this to a theoretical distribution?
Yes we can! We sampled each βi from a uniform distribution over the interval [−1,1]. The variance of a sample of uniformly distributed variables is given by (1/12)∗(b−a)2, where b and a are the min/max of the support interval. The standard error (or the standard deviation of the mean) of a sample of size K with with Var(X)=σ2 is σ/√(K).
Given the above knowledge, we should expect our distribution of averages to be normally distributed with mean = 0 and var = (12∗5)−1∗(1−(−1))2=0.66667. Let's compare this normal distribution to our sample above.
#Compute a vector from the normal distribution specified above
from scipy.stats import norm
mu = 0
sig = np.sqrt(4 / 60.0)
xs = np.linspace(-1, 1, 1000)
ys = norm.pdf(xs, mu, sig)
plt.hist(means, density = True)
plt.plot(xs, ys)
plt.show()
Now let's write our scoring function. Let's try to use as much of Numpy's inner optimization as possible (hint, this can be done in two lines and without writing any loops). The key is that numpy functions that would normally take in a scalar can also take in an array, and the function applies the operations element wise to the array and returns an array. i.e.:
ex_array = np.array([-1, 1])
np.abs(ex_array)
array([1, 1])
Let's use this feature to write a fast and clean scoring function
def score_logistic_regression(X, beta):
'''
This function takes in an NxK matrix X and 1xK vector beta.
The function should apply the logistic scoring function to each record of X.
The output should be an Nx1 vector of scores
'''
#First let's calculate X*beta - make sure to use numpy's 'dot' method
xbeta = X.dot(beta)
#Now let's input this into the link function
prob_score = 1 / (1 + np.exp(-1 * xbeta))
return prob_score
So how much faster is it by using Numpy? We can test this be writing the same function that uses no Numpy and executes via loops.
def score_logistic_regression_NoNumpy(X, beta):
'''
This function takes in an NxK matrix X and 1xK vector beta.
The function should apply the logistic scoring function to each record of X.
The output should be an Nx1 vector of scores
'''
#Let's calculate xbeta using loops
xbeta = []
for row in X:
xb = 0
for i, el in enumerate(row):
xb += el * beta[i]
xbeta.append(xb)
#Now let's apply the link function to each xbeta
prob_score = []
for xb in xbeta:
prob_score.append(1 / (1 + np.exp(-1 * xb)))
return prob_score
Before doing any analysis, let's test the output of each to make sure they equal
diff = np.abs(score_logistic_regression_NoNumpy(X, beta) - score_logistic_regression(X, beta))
print(f'Mean Absolute Deviation = {np.round(diff.sum(), 8)}')
Mean Absolute Deviation = 0.0
If they equal then we can proceed with timing analysis
%timeit score_logistic_regression_NoNumpy(X, beta)
6.17 ms ± 55.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit score_logistic_regression(X, beta)
19.2 µs ± 476 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)