In this post, it will cover the basic concept of Multi-variable Linear Regression. Unlike Simple Linear Regression, Multi-variable Linear Regression have several dependent variables, so its hypothesis is different from we saw in previous posts.
import tensorflow as tf
import numpy as np
import matplotlib.pyplot as plt
plt.rcParams['figure.figsize'] = (10, 8)
plt.style.use('seaborn')
In the simple Linear Regression, we expressed the hypothesis like this,
$$ H(x) = W x + b $$But, most of real-world problem is related on various variables. For the case with 3 variables, hypothesis can be expanded with 3-variables,
$$ H(x_1, x_2, x_3) = w_1 x_1 + w_2 x_2 + w_3 x_3 +b $$And also, cost function for this hypothesis will be different from the one we saw in previous post.
$$ \text{cost}(W, b) = \frac{1}{m} \sum_{i=1}^{m} (H(x_1, x_2, x_3) - y_i)^2 $$In general, if we have $n$ variables for regression, the hypothesis will be,
$$ H(x_1, x_2, x_3, \dots, x_n) = w_1 x_1 + w_2 x_2 + w_3 x_3 + \dots + w_n x_n + b $$If we express with mathematical form, it is hard to display it in one line. Instead, we can express it with matrix multiplication form. Suppose $X$ is the vector of $x$, and $W$ is the vector of $w$, hypothesis with matrix form will be like this,
$$ H(X) = W \cdot X = \begin{bmatrix} w_1 & w_2 & \dots & w_n \end{bmatrix} \cdot \begin{bmatrix} x_1 \\ x_2 \\ \dots \\ x_n \end{bmatrix} = w_1 x_1 + w_2 x_2 + \dots + w_n x_n $$Or we can reverse the order of $W$ and $X$, (same notation)
$$ H(X) = X \cdot W = \begin{bmatrix} x_1 & x_2 & \dots & x_n \end{bmatrix} \cdot \begin{bmatrix} w_1 \\ w_2 \\ \dots \\ w_n \end{bmatrix} = w_1 x_1 + w_2 x_2 + \dots + w_n x_n $$$$ H(X) = X \cdot W = \begin{bmatrix} x_1 & x_2 & \dots & x_n & 1 \end{bmatrix} \cdot \begin{bmatrix} w_1 \\ w_2 \\ \dots \\ w_n \\ b \end{bmatrix} = w_1 x_1 + w_2 x_2 + \dots + w_n x_n + b $$Note: We omit the bias term($b$) for simplicity. If you really want to express the hypothesis with bias, just increase the shape by 1 and adding bias term,
The advantage of matrix multiplication is parallelization. In the previous example, we just express the formula with just one row of $X$. What if $X$ has lots of rows? It can also expand from previous formula.
$$ H(X) = X \cdot W = \begin{bmatrix} x_{11} & x_{12} & \dots & x_{1n} \\ x_{21} & x_{22} & \dots & x_{2n} \end{bmatrix} \cdot \begin{bmatrix} w_1 \\ w_2 \\ \dots \\ w_n \end{bmatrix} = \begin{bmatrix} w_1 x_{11} + w_2 x_{12} + \dots + w_n x_{1n} \\ w_1 x_{21} + w_2 x_{22} + \dots + w_n x_{2n} \end{bmatrix}$$Also, we can expand the dimension of weight term, meaning that there is another layer of weight vector. With matrix multiplication, we don't need to expand it manually, and usually GPU (short for Graphic Processing Unit) has an advantage to calculate the matrix multiplication thanks to its architecture.
Suppose we have three dependent variable. Then, the hypothesis will be
$$ H(x_1, x_2, x_3) = w_1 x_1 + w_2 x_2 + w_3 x_3 $$We have datasets and initialize the weights, and define the hypothesis in tensorflow.
# Data
x1 = [73., 93., 89., 96., 73.]
x2 = [80., 88., 91., 98., 66.]
x3 = [75., 93., 90., 100., 70.]
y = [152., 185., 180., 196., 142.]
# random weights
w1 = tf.Variable(tf.random.normal([1]))
w2 = tf.Variable(tf.random.normal([1]))
w3 = tf.Variable(tf.random.normal([1]))
b = tf.Variable(tf.random.normal([1]))
# Hypothesis
h = w1 * x1 + w2 * x2 + w3 * x3 + b
We can build the training process with Gradient Descent.
learning_rate = 0.00001
for e in range(1000):
# Record the gradient history of the cost function
with tf.GradientTape() as tape:
h = w1 * x1 + w2 * x2 + w3 * x3 + b
cost = tf.reduce_mean(tf.square(h - y))
# Calculate the gradient of each weight
w1_grad, w2_grad, w3_grad, b_grad = tape.gradient(cost, [w1, w2, w3, b])
# update the weight
w1.assign_sub(learning_rate * w1_grad)
w2.assign_sub(learning_rate * w2_grad)
w3.assign_sub(learning_rate * w3_grad)
b.assign_sub(learning_rate * b_grad)
if e % 100 == 0:
print("epoch: {:5} | cost: {:12.4f}".format(e, cost.numpy()))
epoch: 0 | cost: 7312.8384 epoch: 100 | cost: 35.9158 epoch: 200 | cost: 34.0776 epoch: 300 | cost: 32.3360 epoch: 400 | cost: 30.6861 epoch: 500 | cost: 29.1231 epoch: 600 | cost: 27.6422 epoch: 700 | cost: 26.2393 epoch: 800 | cost: 24.9103 epoch: 900 | cost: 23.6510
We can express it with matrix multiplication form. To do this, the dataset is merged with one numpy array.
data = np.array([
[73., 80., 75., 152.],
[93., 88., 93., 185.],
[89., 91., 90., 180.],
[96., 98., 100., 196.],
[73., 66., 70., 142. ]
], dtype=np.float32)
X = data[:, :-1]
y = data[:, [-1]]
W = tf.Variable(tf.random.normal(shape=[X.shape[1], 1]))
b = tf.Variable(tf.random.normal(shape=[1]))
# Replace hypothesis with predict function
def predict(X):
return tf.matmul(X, W) + b
Note: it may be confused to make y from slicing with
[1]
. Because it must maintain the matrix form, we generate the y like that.
data[:, -1].shape
(5,)
data[:, [-1]].shape
(5, 1)
Same learning process with gradient descent is applied,
learning_rate = 0.00001
for e in range(2000):
# Record the gradient history of the cost function
with tf.GradientTape() as tape:
cost = tf.reduce_mean(tf.square(predict(X) - y))
# Calculate the gradient of each weight
W_grad, b_grad = tape.gradient(cost, [W, b])
# update the weight
W.assign_sub(learning_rate * W_grad)
b.assign_sub(learning_rate * b_grad)
if e % 100 == 0:
print("epoch: {:5} | cost: {:12.4f}".format(e, cost.numpy()))
epoch: 0 | cost: 103242.9219 epoch: 100 | cost: 1.8692 epoch: 200 | cost: 1.8591 epoch: 300 | cost: 1.8491 epoch: 400 | cost: 1.8392 epoch: 500 | cost: 1.8294 epoch: 600 | cost: 1.8198 epoch: 700 | cost: 1.8103 epoch: 800 | cost: 1.8009 epoch: 900 | cost: 1.7917 epoch: 1000 | cost: 1.7825 epoch: 1100 | cost: 1.7735 epoch: 1200 | cost: 1.7645 epoch: 1300 | cost: 1.7557 epoch: 1400 | cost: 1.7469 epoch: 1500 | cost: 1.7382 epoch: 1600 | cost: 1.7296 epoch: 1700 | cost: 1.7211 epoch: 1800 | cost: 1.7127 epoch: 1900 | cost: 1.7044
As you can see from the result, cost is decreased significantly while 100 epoch are passed. And you can also notice the advantage from matrix multiplication that we don't need to define weight vector manually. ($w_1, w_2, w_3 \to W$)
In this post, we expand the linear regression from single variable to multi variables. And using matrix multiplication notation, it helps to operate gradient descent effectively.