Here's a typical linear regression problem. We're trying to predict prices of individual houses
. And we're given three pieces of information about each house, three features:
floor area
public transport distance
room number
And we're going to represent our features in a matrix with as many rows as we have houses and three columns, one for each of our input features. We'll call that matrix $X$. And we're trying to predict this vector $Y$, which represents the housing prices.
# getting numpy setup
import numpy as np
# setting up training data for X and Y
X_train = np.array([
[1250, 350, 3],
[1700, 900, 6],
[1400, 600, 3]
])
Y_train = np.array([345000, 580000, 360000])
Next we have to consider our model
. The model is the set of functions that we're going to consider in mapping $X$ to $Y$.
So the parameters of this model will be the three weights that correspond to each feature, and the intercept:
weights = np.array([300, -10, -1])
intercept = -26497
And the key operation of this model will be matrix multiplication of $X$ by the weights
. Then we'll add the intercept element-wise to get our final prediction.
def model(X, weights, intercept):
return X.dot(weights) + intercept
Now the next ingredient we'll need is a cost function
, also called a loss function. We need this to measure how good or bad a set of parameters is, how close our predictions are getting to the actual values.
def cost(Y_hat, Y):
return np.sum((Y_hat - Y)**2)
Now we need to actually find the parameters that give us the best fit. In other words, holding $X$ and $Y$ constant, we'll adjust our parameters to minimize the cost. Each set of parameters will yield a cost, so we can plot cost against parameter values. Our goal in optimization is to find the parameters that correspond to that lowest point.
And we're going to do that by trial and error. By this I don't mean just trying random sets of parameters and seeing what works best, but the trial and error you do when you're, say, practising how to shoot hoops and you're trying to adjust your angle. So you shoot, and you miss by a couple inches. You're too far to the right. So you adjust your angle to the left and try again.
That's what we're going to do also. We'll try a set of parameters, then we'll calculate our cost, and then we'll follow the gradient of the cost curve at that point down towards the minimum. This process is called gradient descent
.
So we need to be able to calculate the gradient of the cost, $\epsilon$, with respect to each of the weights and the intercept:
$$ Chain Rule: (\frac{\partial\epsilon}{\partial w_i} = \frac{d\epsilon}{d\hat{y}}\frac{\partial\hat{y}}{\partial w_i})$$Applying the chain rule, we can break that up into two pieces: the gradient of the cost with respect to $\hat{y}$ (i.e. the predicted $y$), and the gradient of $\hat{y}$ with respect to the weight. So let's calculate those:
$$ \hat{y} = w_0x_0 + w_1x_1 + w_2x_2 + b \\ \frac{\partial\hat{y}}{\partial w_0} = x_0 $$The gradient of $\hat{y}$ with respect to $w_0$ is pretty simple. All the terms are constant with respect to $w_0$ so those go to zero and we're left with $x_0$.
$$ \epsilon = (y-\hat{y})^2 \\ \frac{d\epsilon}{d\hat{y}} = -2(y-\hat{y}) $$For the second gradient, we bring down the power and then apply the chain rule again to bring out that negative sign.
$$ \frac{\partial\hat{y}}{\partial w_0} = x_0 \\ \frac{d\epsilon}{d\hat{y}} = -2(y-\hat{y}) \\ \frac{\partial\epsilon}{\partial w_0} = -2(y-\hat{y})x_0 $$So to get our desired gradient, we multiply those together to get this expression. And that goes for all the weights.
$$ \hat{y} = w_0x_0 + w_1x_1 + w_2x_2 + b\cdot 1 \\ \frac{\partial\epsilon}{\partial b} = -2(y-\hat{y})\cdot 1 $$As for the intercept $b$, we can consider that a special weight where the $x$ it corresponds to is always 1. So that's the form the gradient will take with respect to $b$.
Putting all that together, here's how training goes. Whatever our current weights and intercept are, we calculate our prediction, calculate our error, calculate the gradients, and update our parameters by gradient descent.
def training_round(x, y, weights, intercept, alpha=0.05):
# calculate our prediction
y_hat = model(x, weights, intercept)
# calculate error
delta_y = y - y_hat
# calculate gradients
gradient_weights = -2 * delta_y * x
gradient_intercept = -2 * delta_y
# update parameters
weights = weights - alpha * gradient_weights
intercept = intercept - alpha * gradient_intercept
return weights, intercept
That was a single round of training. The entire training process involves first initializing our parameters and doing some number of training rounds. Whatever the weights and intercept are at the end, that's what we'll use to predict with.
NUM_EPOCHS = 10
def train(X, Y):
# initialize parameters
weights = np.random.randn(3)
intercept = 0
# training rounds
for i in range(NUM_EPOCHS):
for (x, y) in zip(X, Y):
weights, intercept = training_round(x, y, weights, intercept)
# return weights and intercept
return weights, intercept
And testing is simple: we get our estimate and figure out how far off we were on average.
def test(X_test, Y_test, weights, intercept):
Y_predicted = model(X_test, weights, intercept)
error = cost(Y_predicted, Y_test)
return np.sqrt(np.mean(error))