We show that a simple linear neuron can be "learned" with Gaussian elimination, and indeed is much faster and more accurate upon doing so. (Much of machine learning is non-linear.)
Our model of the universe is that we have an unknow 3-vector
$w = \left[ \begin{array}{c} w_1 \\ w_2 \\ w_3 \end{array} \right]$
that we wish to learn. We have three 3-vectors $x_1,x_2,x_3$ and the corresponding scalar values $y_1 = w \cdot x_1$, $\ y_2 = w \cdot x_2$, $\ y_3 = w \cdot x_3$. (Caution: The $x_i$ are 3-vectors, not components.) We will show that Gauassian elimination learns $w$ very quickly, while standard deep learning approaches (which use a version of gradient descent currently considered the best known as ADAM can require many steps, may be inaccurate, and inconsistent.
One of the issues is how to organize the "x" data and the "y" data. The "x"s can be the columns or rows of a matrix, or can be a vector of vectors. Many applications prefer the matrix approach. The "y"s can be bundled into a vector similarly.
w = rand(3) ## We are setting up a w. We will know it, but the learning algorithm will only have X and y data below.
3-element Array{Float64,1}: 0.982331 0.1774 0.212845
# Here is the data. Each "x" is a 3-vector. Each "y" is a number.
n = 3
x1 = rand(3); y1=w ⋅ x1 # We are using the dot product (type \cdot+tab)
x2 = rand(3); y2=w ⋅ x2
x3 = rand(3); y3=w ⋅ x3
# Gather the "x" data into the rows of a matrix and "y" into a vector
X=[x1 x2 x3]'
y=[y1; y2; y3]
3-element Array{Float64,1}: 0.881336 1.0557 0.485883
# We check that the linear system for the "unknown" w is X*w = y
X*w-y
3-element Array{Float64,1}: 0.0 0.0 0.0
## Recover w with Gaussian Elimination
X\y
3-element Array{Float64,1}: 0.982331 0.1774 0.212845
w
3-element Array{Float64,1}: 0.982331 0.1774 0.212845
## Recover w with a machine learning package -- 18.06 students might just want to execute as a black box
using Flux
We show how the same problem is commonly done with machine learning. Many learning cycles seem to be needed.
# t ... a model to be learned to fit the data
t = Dense(3,1)
loss(x,y) = Flux.mse(t(x),y)
opt = ADAM(Flux.params(t)[1:1])
Flux.train!(loss, Iterators.repeated( (X',y'), 20000), opt) # 20000 steps of training
println((t.W).data, " : <== estimate after training")
[0.982331 0.1774 0.212845] : <== estimate after training
## Adding more data does not help a whole lot
n = 3000
X = randn(n,3)
y = X*w
t = Dense(3,1)
loss(x,y) = Flux.mse(t(x),y)
opt = ADAM(Flux.params(t)[1:1])
Flux.train!(loss, Iterators.repeated( (X',y'), 2000), opt) # 2000 steps of training
println((t.W).data, " : <== estimate after training")
[0.948837 0.17883 0.218774] : <== estimate after training