This notebook is focused on the basics of using MXNet in Julia to create a simple Multilayer Perceptron (MLP). This basic neural network building block is described more fully in Wikipedia.
We will be making predictions on the famous MNIST data set, which is a labeled set of individual handwritten digits from zero to nine. A brief description is available in Wikipedia. The data is available at http://yann.lecun.com/exdb/mnist/, where there are also descriptions of the effectiveness of the different approaches applied to this data and references to related papers. Instead of this source, however, I will be using the data from Kaggle as I have learned a lot from the competitions there and like to use it as an example.
I also intend to use this as brief tutorial on looking at data using the Julia language. It assumes some basic knowledge of Julia, although someone who knows Python or R could probably figure it all out.
You can get the data from https://www.kaggle.com/c/digit-recognizer/data. The Kaggle website describes the files as follows:
The data files
train.csv
andtest.csv
contain gray-scale images of hand-drawn digits, from zero through nine.
Each image is 28 pixels in height and 28 pixels in width, for a total of 784 pixels in total. Each pixel has a single pixel-value associated with it, indicating the lightness or darkness of that pixel, with higher numbers meaning darker. This pixel-value is an integer between 0 and 255, inclusive.
The training data set,
train.csv
, has 785 columns. The first column, called "label", is the digit that was drawn by the user. The rest of the columns contain the pixel-values of the associated image.
Each pixel column in the training set has a name like pixelx, where x is an integer between 0 and 783, inclusive. To locate this pixel on the image, suppose that we have decomposed x as x = i * 28 + j, where i and j are integers between 0 and 27, inclusive. Then pixelx is located on row i and column j of a 28 x 28 matrix, (indexing by zero).
For example, pixel31 indicates the pixel that is in the fourth column from the left, and the second row from the top, as in the ascii-diagram below.
Visually, if we omit the "pixel" prefix, the pixels make up the image like this:
000 001 002 003 ... 026 027
028 029 030 031 ... 054 055
056 057 058 059 ... 082 083
| | | | ... | |
728 729 730 731 ... 754 755
756 757 758 759 ... 782 783
The test data set, test.csv
, is the same as the training set, except that it does not contain the "label" column.
Your submission file should be in the following format: For each of the 28000 images in the test set, output a single line with the digit you predict. For example, if you predict that the first image is of a 3, the second image is of a 7, and the third image is of a 8, then your submission file would look like:
3
7
8
(27997 more lines)
Of course, we can learn a lot of this by just looking at the data. The code below assumes you downloaded the csv files and put them in a folder named data
.
using DataFrames, CSV
A few comments for those new to Julia. Most people start working with the REPL where the contents of any line you type is immediately evaluated and returned. Working with Jupyter notebooks using iJulia allows the possibility of depicting such results graphically, and in the case of DataFrames
, they are rendered in a nice tabular format. In this case, by following the line with a semicolon, I will suppress the normal output as I don't want to potentially clutter the notebook with a table that has 42000 rows in it. I encourage you to download this notebook and experiment with it, removing the semicolon and seeing what the raw data looks like. Of course, you can use functions like head()
and tail()
as demonstrated in the Stats section.
I am also using the @time
macro here to as I like to track how long different processes take and memory involved. You can delete the leading @time
from any command line if you just want to execute the operation.
@time train = CSV.read("data/train.csv");
6.867796 seconds (15.87 M allocations: 751.361 MiB, 6.01% gc time)
first(train, 6)
label | pixel0 | pixel1 | pixel2 | pixel3 | pixel4 | pixel5 | pixel6 | pixel7 | pixel8 | pixel9 | |
---|---|---|---|---|---|---|---|---|---|---|---|
Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | |
1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
3 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
4 | 4 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
5 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
6 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
# to get the size/shape of the DataFrame
size(train)
(42000, 785)
# first element is label, rest is 28x28=784 pixel values, so to look at first row
reshape([train[1,col] for col in 2:785], 28, 28)
28×28 Array{Int64,2}: 0 0 0 0 0 0 0 0 0 … 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 … 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 15 94 89 0 0 0 0 0 0 0 0 0 0 0 0 0 0 89 220 253 251 214 0 0 0 0 0 0 0 0 0 0 0 0 0 240 253 253 253 218 0 0 0 0 0 0 0 0 0 0 0 0 0 … 253 253 253 250 95 0 0 0 0 0 0 0 0 0 0 0 0 0 195 80 94 131 0 0 0 0 0 0 0 0 0 0 0 0 0 0 25 0 0 0 0 0 0 0 0 ⋮ ⋮ ⋱ ⋮ ⋮ 0 0 0 0 0 0 0 0 29 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 80 207 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 123 247 253 0 0 0 0 0 0 0 0 0 0 0 0 0 0 191 248 253 235 0 0 0 0 0 0 0 0 0 0 0 0 0 188 250 253 208 77 … 0 0 0 0 0 0 0 0 0 0 0 0 0 255 253 167 13 0 0 0 0 0 0 0 0 0 0 0 0 0 0 94 93 10 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 … 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
There are several plotting libraries available for Julia. Plots.jl is one of my favorites as it is developing an excellent ecosystem that allows multiple backends including PyPlot which is a wrapper around Matplotlib and Plotly which is an interface to the plotly.js visualization library. Use StatPlots if you want to plot a DataFrame directly.
using Plots
# I like Plotly for interactivity.
plotly(legend=false)
# to use it on GitHub, you have to setup Plotly Online and inject their code into the notebook
# so just switch to PyPlot
pyplot()
┌ Info: Recompiling stale cache file /home/milton/.julia/compiled/v1.1/PyPlot/oatAj.ji for PyPlot [d330b81b-6aea-500a-939a-2ce795aea3ee] └ @ Base loading.jl:1184
Plots.PyPlotBackend()
# heatmap() plots a 2D array
@time heatmap(reshape([train[1,col] for col in 2:785], 28, 28), aspect_ratio=:equal)
12.920942 seconds (27.09 M allocations: 1.324 GiB, 7.05% gc time)
# to see it upright, use the rotl90() function to rotate left 90 degrees
# there are also rotr90() and rot180() functions
@time heatmap(rotl90(reshape([train[1,col] for col in 2:785], 28, 28)), aspect_ratio=:equal)
0.071421 seconds (110.27 k allocations: 5.517 MiB)
# we can use an array comprehension to plot the first sixteen rows
# the Plots library automatically resizes everything for you and adds the axes
plot([heatmap(rotl90(reshape([train[i,col] for col in 2:785], 28, 28)), aspect_ratio=:equal) for i=1:16]...)
Now that we have confirmed that the data really are characters that are arranged as described in the data description, we will take a brief eyeball on the data just to make sure there is nothing strange.
# note that summary in Julia just gives basic information as it is supposed to work on ANY datatype
summary(train)
"42000×785 DataFrame"
# note that summary in Julia just gives basic information as it is supposed to work on ANY datatype
# those who are used to R may be looking for the describe() function which is for DataFrames
describe(train)
variable | mean | min | median | max | nunique | nmissing | eltype | |
---|---|---|---|---|---|---|---|---|
Symbol | Float64 | Int64 | Float64 | Int64 | Nothing | Nothing | DataType | |
1 | label | 4.45664 | 0 | 4.0 | 9 | Int64 | ||
2 | pixel0 | 0.0 | 0 | 0.0 | 0 | Int64 | ||
3 | pixel1 | 0.0 | 0 | 0.0 | 0 | Int64 | ||
4 | pixel2 | 0.0 | 0 | 0.0 | 0 | Int64 | ||
5 | pixel3 | 0.0 | 0 | 0.0 | 0 | Int64 | ||
6 | pixel4 | 0.0 | 0 | 0.0 | 0 | Int64 | ||
7 | pixel5 | 0.0 | 0 | 0.0 | 0 | Int64 | ||
8 | pixel6 | 0.0 | 0 | 0.0 | 0 | Int64 | ||
9 | pixel7 | 0.0 | 0 | 0.0 | 0 | Int64 | ||
10 | pixel8 | 0.0 | 0 | 0.0 | 0 | Int64 | ||
11 | pixel9 | 0.0 | 0 | 0.0 | 0 | Int64 | ||
12 | pixel10 | 0.0 | 0 | 0.0 | 0 | Int64 | ||
13 | pixel11 | 0.0 | 0 | 0.0 | 0 | Int64 | ||
14 | pixel12 | 0.003 | 0 | 0.0 | 116 | Int64 | ||
15 | pixel13 | 0.0111905 | 0 | 0.0 | 254 | Int64 | ||
16 | pixel14 | 0.00514286 | 0 | 0.0 | 216 | Int64 | ||
17 | pixel15 | 0.000214286 | 0 | 0.0 | 9 | Int64 | ||
18 | pixel16 | 0.0 | 0 | 0.0 | 0 | Int64 | ||
19 | pixel17 | 0.0 | 0 | 0.0 | 0 | Int64 | ||
20 | pixel18 | 0.0 | 0 | 0.0 | 0 | Int64 | ||
21 | pixel19 | 0.0 | 0 | 0.0 | 0 | Int64 | ||
22 | pixel20 | 0.0 | 0 | 0.0 | 0 | Int64 | ||
23 | pixel21 | 0.0 | 0 | 0.0 | 0 | Int64 | ||
24 | pixel22 | 0.0 | 0 | 0.0 | 0 | Int64 | ||
25 | pixel23 | 0.0 | 0 | 0.0 | 0 | Int64 | ||
26 | pixel24 | 0.0 | 0 | 0.0 | 0 | Int64 | ||
27 | pixel25 | 0.0 | 0 | 0.0 | 0 | Int64 | ||
28 | pixel26 | 0.0 | 0 | 0.0 | 0 | Int64 | ||
29 | pixel27 | 0.0 | 0 | 0.0 | 0 | Int64 | ||
30 | pixel28 | 0.0 | 0 | 0.0 | 0 | Int64 | ||
⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ |
# the describe() function gives basic stats
# you could describe(train) to do the entire DataFrame, but that would take a lot of space here
describe(train[:,406])
Summary Stats: Length: 42000 Missing Count: 0 Mean: 97.163476 Minimum: 0.000000 1st Quartile: 0.000000 Median: 35.000000 3rd Quartile: 231.000000 Maximum: 255.000000 Type: Int64
# can also specify a column in a DataFrame with a Symbol
# Symbols in Julia have a colon before them and are reserved when created (:pixel404)
# I just just picked 404 because it is near the center for 28x28cv
describe(train[!, :pixel404])
Summary Stats: Length: 42000 Missing Count: 0 Mean: 97.163476 Minimum: 0.000000 1st Quartile: 0.000000 Median: 35.000000 3rd Quartile: 231.000000 Maximum: 255.000000 Type: Int64
# see the distribution of values in that column
histogram(train[!, :pixel404], bins=20)
# a point a little more to the edge has a lot more zeros
histogram(train[!, :pixel397], bins=20)
# distribution of labels - seems fairly even
histogram(train[!, :label], ticks=collect(0:9))
# we can plot different subsets to eyeball variation
histogram([train[1:8400,:label] train[8401:16800,:label] train[16801:25200,:label] train[25201:33600,:label] train[33601:end,:label]], ticks=collect(0:9), legend=true,
label=["first fifth" "second fifth" "third fifth" "fourth fifth" "last fifth"], alpha=0.4)
If you really want to check stats, you might want to do a chi-square or Kolmogorov–Smirnov test to really see if the distribution is uniform. You can do these with the HypothesisTests package in Julia. Documentation is at https://juliastats.org/HypothesisTests.jl/stable/. The real focus of this notebook is MLP, so let us move on and start setting it up.
? permutedims
search: permutedims permutedims! PermutedDimsArray
permutedims(A::AbstractArray, perm)
Permute the dimensions of array A
. perm
is a vector specifying a permutation of length ndims(A)
.
See also: PermutedDimsArray
.
jldoctest
julia> A = reshape(Vector(1:8), (2,2,2))
2×2×2 Array{Int64,3}:
[:, :, 1] =
1 3
2 4
[:, :, 2] =
5 7
6 8
julia> permutedims(A, [3, 2, 1])
2×2×2 Array{Int64,3}:
[:, :, 1] =
1 3
5 7
[:, :, 2] =
2 4
6 8
permutedims(m::AbstractMatrix)
Permute the dimensions of the matrix m
, by flipping the elements across the diagonal of the matrix. Differs from LinearAlgebra
's transpose
in that the operation is not recursive.
jldoctest;
julia> a = [1 2; 3 4];
julia> b = [5 6; 7 8];
julia> c = [9 10; 11 12];
julia> d = [13 14; 15 16];
julia> X = [[a] [b]; [c] [d]]
2×2 Array{Array{Int64,2},2}:
[1 2; 3 4] [5 6; 7 8]
[9 10; 11 12] [13 14; 15 16]
julia> permutedims(X)
2×2 Array{Array{Int64,2},2}:
[1 2; 3 4] [9 10; 11 12]
[5 6; 7 8] [13 14; 15 16]
julia> transpose(X)
2×2 Transpose{Transpose{Int64,Array{Int64,2}},Array{Array{Int64,2},2}}:
[1 3; 2 4] [9 11; 10 12]
[5 7; 6 8] [13 15; 14 16]
permutedims(v::AbstractVector)
Reshape vector v
into a 1 × length(v)
row matrix. Differs from LinearAlgebra
's transpose
in that the operation is not recursive.
jldoctest;
julia> permutedims([1, 2, 3, 4])
1×4 Array{Int64,2}:
1 2 3 4
julia> V = [[[1 2; 3 4]]; [[5 6; 7 8]]]
2-element Array{Array{Int64,2},1}:
[1 2; 3 4]
[5 6; 7 8]
julia> permutedims(V)
1×2 Array{Array{Int64,2},2}:
[1 2; 3 4] [5 6; 7 8]
julia> transpose(V)
1×2 Transpose{Transpose{Int64,Array{Int64,2}},Array{Array{Int64,2},1}}:
[1 3; 2 4] [5 7; 6 8]
@time a = convert(Matrix, train[:,2:end])
1.337898 seconds (6.89 k allocations: 502.799 MiB, 54.34% gc time)
42000×784 Array{Int64,2}: 0 0 0 0 0 0 0 0 0 0 0 0 0 … 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 … 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 … 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ⋮ ⋮ ⋮ ⋱ ⋮ ⋮ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 … 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 … 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
@time transpose(a)
0.134896 seconds (7.85 k allocations: 453.027 KiB)
784×42000 LinearAlgebra.Transpose{Int64,Array{Int64,2}}: 0 0 0 0 0 0 0 0 0 0 0 0 0 … 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 … 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 … 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ⋮ ⋮ ⋮ ⋱ ⋮ ⋮ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 … 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 … 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
transpose(a) == permutedims(a)
@time permutedims(a)
0.419683 seconds (6 allocations: 251.221 MiB, 1.94% gc time)
784×42000 Array{Int64,2}: 0 0 0 0 0 0 0 0 0 0 0 0 0 … 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 … 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 … 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ⋮ ⋮ ⋮ ⋱ ⋮ ⋮ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 … 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 … 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
# need to split train into training and eval sets
# convert DataFrame to an Array, remembering Julia is column-major like R and Matlab (unlike Python)
X = transpose(Array(train[:,2:end]))
y = Array(train[:,1])
MethodError: no method matching Array(::DataFrame) Closest candidates are: Array(!Matched::LinearAlgebra.SymTridiagonal) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.1/LinearAlgebra/src/tridiag.jl:111 Array(!Matched::LinearAlgebra.Tridiagonal) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.1/LinearAlgebra/src/tridiag.jl:489 Array(!Matched::LinearAlgebra.AbstractTriangular) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.1/LinearAlgebra/src/triangular.jl:106 ... Stacktrace: [1] top-level scope at In[41]:1
size(X)
(784,42000)
N = size(X)[2]
42000
length(y)
42000
extrema(X)
(0,255)
mean(X)
33.408911169825075
# scale X to range 0-1
X = X./255;
# we need to split the train data into a training set (cv_X) and an eval set (eval_X)
split = 0.8
cv_X = X[:,1:floor(Int,split*N)]
eval_X = X[:,floor(Int,split*N)+1:N]
784×8400 Array{Float64,2}: 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 … 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 … 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 … 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ⋮ ⋮ ⋱ ⋮ 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 … 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 … 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
cv_y = y[1:floor(Int,split*N)]
eval_y = y[floor(Int,split*N)+1:N]
8400-element Array{Int64,1}: 0 7 7 2 2 6 5 7 8 5 3 0 2 ⋮ 0 5 3 1 9 6 4 0 1 7 6 9
Here we get to the heart of this notebook: setting up the problem in MXNet. Full documentation for the Julia MXNet.jl
library is available at http://dmlc.ml/MXNet.jl/latest/. Here is the big picture:
Neural networks in MXNet are set up as Models. Presently there are only Feedforward models that you define with the Symbolic API. The model is fed data through a Data Provider. The fit()
or train()
function trains a Model with a Data Provider with a chosen Optimizer, EvalMetric, and Initializer.
KVStore which is a system to allow synchronization of data across different devices such as different CPUs or GPUs on the same or different machines. It automatically attempts to parallelize any operations that can be split for faster performance. This happens under the hood so you usually don't have to worry about it unless you have a complicated setup.
Let's see how this all works in a basic MLP.
using MXNet
ArgumentError: Package MXNet not found in current path: - Run `import Pkg; Pkg.add("MXNet")` to install the MXNet package. Stacktrace: [1] require(::Module, ::Symbol) at ./loading.jl:823 [2] top-level scope at In[39]:1
batch_size = 1000
1000
MXNet sets up a AbstractDataProvider
class that helps abstract the data retrieval for the Executor from the KVStore. (I hope that is not too abstract). There are a few predefined concrete classes including one that specifically provides the MNIST dataset, appropriately named MNISTIter()
or MNISTProvider
. I want to demonstrate a more general case, so I will be using the ArrayDataProvider with an Array filled from the Kaggle file above. Documentation is at http://dmlc.ml/MXNet.jl/latest/api/io/#MXNet.mx.ArrayDataProvider, but briefly, parameters are:
ArrayDataProvider(data[, label]; batch_size, shuffle)
Shuffle is a boolean to shuffle the data each different epoch of training.
Note that although the Executor can process in parallel, access to a DataProvider can NOT do so. Don't call a DataProvider from more than one place.
# since this is a training set, we specify shuffle to help randomize the training
train_provider = mx.ArrayDataProvider(cv_X, cv_y, batch_size=batch_size, shuffle=true)
MXNet.mx.ArrayDataProvider(Array{Float32,N}[ Float32[0.0 0.0 … 0.0 0.0; 0.0 0.0 … 0.0 0.0; … ; 0.0 0.0 … 0.0 0.0; 0.0 0.0 … 0.0 0.0]],Symbol[:data],Array{Float32,N}[ Float32[1.0 0.0 … 2.0 2.0]],Symbol[:softmax_label],1000,33600,true,0.0f0,0.0f0,MXNet.mx.NDArray[mx.NDArray{Float32}(784,1000)],MXNet.mx.NDArray[mx.NDArray{Float32}(1000,)])
# we don't want to randomize the eval set during training, so shuffle is false
eval_provider = mx.ArrayDataProvider(eval_X, eval_y, batch_size=batch_size, shuffle=false)
MXNet.mx.ArrayDataProvider(Array{Float32,N}[ Float32[0.0 0.0 … 0.0 0.0; 0.0 0.0 … 0.0 0.0; … ; 0.0 0.0 … 0.0 0.0; 0.0 0.0 … 0.0 0.0]],Symbol[:data],Array{Float32,N}[ Float32[0.0 7.0 … 6.0 9.0]],Symbol[:softmax_label],1000,8400,false,0.0f0,0.0f0,MXNet.mx.NDArray[mx.NDArray{Float32}(784,1000)],MXNet.mx.NDArray[mx.NDArray{Float32}(1000,)])
At this point, we want to define the model by specifying the number of nodes in each layer and how they are connected to each other. Looking at the paper LeCun, Y., Bottou, L., Bengio, Y., & Haffner, P. (1998). Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11), 2278-2324. The article is available on ResearchGate at https://www.researchgate.net/profile/Yann_Lecun/publication/2985446_Gradient-based_learning_applied_to_document_recognition/links/0deec519dfa1983fc2000000/Gradient-based-learning-applied-to-document-recognition.pdf. They compared a one layer nets (28x28-300-10 and 28x28-1000-10) which had error rates of 4.7% and 4.5% on MNIST to two layer net (28x28-300-100-10) with error rate of 3.05% and their preference, a convolutional net with rate of 0.95%. The advantage of the ConvNet is that it captures more spatial relations without having to make the network 5 layers deep. I look at the ConvNet in a separate notebook. For now, let's try creating the first two fully-connected nets.
For this network, we need to create a FullyConnected
SymbolicNode for our hidden layer/s which will set up all the weights and a layer bias if desired. We will also need an Activation
SymbolicNode so each node can do the job of separation. The Activation()
function can specify an act_type = ’relu’, ’sigmoid’, ’softrelu’, or ’tanh’. There are also specific SoftmaxActivation()
and LeakyReLU()
functions the later of which has options for act_type = ’elu’, ’leaky’, ’prelu’, or ’rrelu’. These are all described in the Wikipedia article on Activation Functions.
Besides the hidden layer/s, we need input and output. The input is labeled with the Variable()
function and will be taken from the DataProvider
specified in the fit()
instruction. We also need to create a SymbolicNode
for the output layer. The MXNet API provides several choices including LinearRegressionOutput
, LogisticRegressionOutput
, MAERegressionOutput
(mean absolute error), and SoftmaxOutput
. The layers are connected by specifying the appropriate input data for each function.
data = mx.Variable(:data)
fc1 = mx.FullyConnected(data, name=:fc1, num_hidden=300)
act1 = mx.Activation(fc1, name=:tanh1, act_type=:tanh)
fc2 = mx.FullyConnected(act1, name=:fc2, num_hidden=10)
mlp1 = mx.SoftmaxOutput(fc2, name=:softmax)
MXNet.mx.SymbolicNode(MXNet.mx.MX_SymbolHandle(Ptr{Void} @0x000000001962b220))
MXNet.jl
has a to_graphviz()
function to produce a .dot
file from a model. Unfortunately, I don't have a way to directly display dot
output in Jupyter, so we will save as a dot
file, convert to png
and insert that into the notebook. Of course if you are playing with this yourself, you can use dot
file viewer directly.
# save file - note that the `do` block automatically closes the filestream
open("mlp1graph.dot", "w") do fs
print(fs, mx.to_graphviz(mlp1))
end
# we can run the `dot` program to convert to png if it is installed on your computer
run(pipeline(`dot -Tpng mlp1graph.dot`, stdout="mlp1graph.png"))
We can display this file:
# change context to gpu(number) if you have a gpu and want to use that for processing
model = mx.FeedForward(mlp1, context=mx.cpu())
MXNet.mx.FeedForward(MXNet.mx.SymbolicNode(MXNet.mx.MX_SymbolHandle(Ptr{Void} @0x000000001962b220)),MXNet.mx.Context[CPU0],#undef,#undef,#undef)
Now that we have our model and data, we need to choose an Optimizer to do backprop. For more background on how backpropagation works to use an error function to adjust network weights, check out the Wikipedia article. To learn about optimization, read the Wikipedia article on Stochastic Gradient Descent. It describes several variants, including AdaGrad, RMSProp,and Adam.
A lot of different optimizer options are available in MXNet. These include SGD
(Stochastic Gradient Descent), AdaGrad
(Adaptive Gradient), RMSProp
(Root Mean Square Propagation) and its variants ADAM
(Adaptive Moment Estimation), AdaMax
, Nadam
(Nesterov Adam).
Documentation is available at http://dmlc.ml/MXNet.jl/latest/api/optimizer/#built-in-optimizers, and it includes references to the source papers.
# we are going to use the basic Stochastic Gradient Descent optimizer with a fixed learning rate and momentum
optimizer = mx.SGD(lr=0.1, momentum=0.9, weight_decay=0.00001)
MXNet.mx.SGD(MXNet.mx.SGDOptions(0.1,0.9,0,1.0e-5,MXNet.mx.LearningRate.Fixed(0.1),MXNet.mx.Momentum.Fixed(0.9)),#undef)
At this point, we start training ith the data providers, model and optimizer set up above using the fit()
function. We can specify the number of epochs to run. At this point, we will start with just one epoch, and see the output that is automatically generated. The first fit()
will set up all the memory structures for the model, initialize the weighs, create the KVStore, and start training. This first fit()
will also take longer than subsequent ones as Julia does Just In Time compilation of code using LLVM. Subsequent calls will not have to be compiled.
Note that fit()
also allows the choice of other evaluation metrics, other initializers, and setting up callback routines between epochs. Full documentation is available at http://dmlc.ml/MXNet.jl/latest/api/model/#MXNet.mx.fit-Tuple%7BMXNet.mx.FeedForward,MXNet.mx.AbstractOptimizer,MXNet.mx.AbstractDataProvider%7D. Here, I am just using the default metric (Accuracy()
) and defualt initializer (UniformInitializer(0.01)
). The later will intialize the weights of the network to uniformly distributed random numbers less that .01.
@time mx.fit(model, optimizer, train_provider, eval_data=eval_provider, n_epoch=1)
INFO: Start training on MXNet.mx.Context[CPU0] INFO: Initializing parameters... INFO: Creating KVStore... INFO: TempSpace: Total 4 MB allocated on CPU0 INFO: Start training... INFO: == Epoch 001/001 ========== INFO: ## Training summary INFO: accuracy = 0.6336
32.429736 seconds (7.59 M allocations: 540.899 MB, 2.80% gc time)
INFO: time = 11.8315 seconds INFO: ## Validation summary INFO: accuracy = 0.7843 INFO: Finish training on MXNet.mx.Context[CPU0]
I am taking a break in training here so we can take a quick peek at the initial internals of the network. Note all the parameters of the model are stored as MXNet.NDArrays which need to be copy
to memory in an Array to play with them. Although the model is defined earlier, no memory is assigned until the fit()
function is executed, so if you try to access the arg_params
of the model before that, you will get UndefRefError: access to undefined reference
.
# note the input layer has 784*300 = 235,200 trainable weights stored in a NDArray. The output layer has 3000 weights.
model.arg_params
Dict{Symbol,MXNet.mx.NDArray} with 4 entries: :fc1_weight => mx.NDArray{Float32}(784,300) :fc1_bias => mx.NDArray{Float32}(300,) :fc2_weight => mx.NDArray{Float32}(300,10) :fc2_bias => mx.NDArray{Float32}(10,)
# let's examine the weights in the hidden layer
w2 = copy(model.arg_params[:fc2_weight])
300×10 Array{Float32,2}: -0.127887 0.0771416 0.0249448 … 0.145933 0.010249 -0.0196595 -0.0296244 0.0209594 -0.0811309 -0.0449037 0.0956391 -0.128065 -0.103133 0.0358045 -0.0394614 -0.107717 0.0242749 0.0551586 0.0597582 -0.0298745 -0.0984035 0.0960803 0.115726 -0.0061346 0.0479453 -0.0183465 -0.207955 -0.000346427 … 0.0321886 0.0209672 0.120342 -0.122031 -0.024811 0.0252119 -0.0228585 -0.0212993 0.0206527 0.0558286 -0.0572732 -0.0384446 0.0798484 -0.0635753 0.0840668 0.0380003 -0.0590102 0.0820934 0.0308938 0.12895 0.0494668 -0.120226 0.145466 -0.128332 -0.110208 … 0.0314653 0.0496928 0.0384468 0.168292 -0.0586571 -0.0639272 -0.044879 0.0877501 -0.0616897 -0.141263 0.0441554 0.0759257 ⋮ ⋱ 0.0791338 -0.226709 -0.0443711 -0.0835878 0.0361731 0.0163245 0.045172 0.0819094 0.0264293 -0.0304361 -0.0624429 -0.0678274 -0.0740325 … -0.0847413 0.0844292 0.0406324 0.0564422 -0.106171 0.0146266 0.0337845 -0.0579353 0.0977567 -0.0983308 -0.112232 0.0105776 -0.0233625 -0.0358019 -0.0986435 -0.0249763 0.0135266 0.0388482 0.0118654 -0.16399 0.0670357 0.120794 -0.0115069 0.0122509 -0.0674744 … 0.0703299 0.119037 0.191457 -0.0999065 0.00194451 -0.0827886 -0.032837 0.0187783 -0.0221313 -0.0729263 0.008342 0.0182222 0.00468942 -0.175613 0.0360089 0.0626512 0.0410399 0.124774 -0.122346 0.0201343 0.0144864 -0.0635198
# let's get the extrema mean and standard deviation for each of the 10 groups
[ [extrema(w2[:,i]) for i in 1:10] [mean(w2[:,i]) for i in 1:10] [std(w2[:,i]) for i in 1:10] ]
10×3 Array{Any,2}: (-0.280105,0.208179) 0.0080839 0.0957205 (-0.267891,0.284985) 0.000389835 0.103438 (-0.202828,0.246249) -0.00487307 0.0891263 (-0.212816,0.190532) 0.003161 0.0817025 (-0.22252,0.225273) -0.00620983 0.0842009 (-0.202098,0.172084) 0.00473183 0.0636334 (-0.238122,0.210844) -0.00532095 0.0923733 (-0.254803,0.263656) 0.00346573 0.0984543 (-0.180616,0.145933) -0.00335081 0.0618578 (-0.194464,0.198503) -0.000115971 0.0770083
# plot mean and std
plot([ [mean(w2[:,i]) for i in 1:10] [std(w2[:,i]) for i in 1:10] ], legend=true, label=["mean" "std"])
# plot them as 30x10 arrays
heatmap(reshape(w2[:,1], 10,30), aspect_ratio=:equal)
# switch to PyPlot because Plotly doesn't support clims attribute and I will want to compare later
pyplot()
Plots.PyPlotBackend()
plot([heatmap(reshape(w2[:,i], 10,30), aspect_ratio=:equal, clims=(-0.6,0.6)) for i=1:10]...)
Ok. Let's run it for a few epochs and watch what happens. Note that every time we call fit()
, the default is to continue training the same model without resetting the weights.
@time mx.fit(model, optimizer, train_provider, eval_data=eval_provider, n_epoch=9)
INFO: Start training on MXNet.mx.Context[CPU0] INFO: Initializing parameters... INFO: Creating KVStore... INFO: TempSpace: Total 4 MB allocated on CPU0 INFO: Start training... INFO: == Epoch 001/009 ========== INFO: ## Training summary INFO: accuracy = 0.8643 INFO: time = 9.1088 seconds INFO: ## Validation summary INFO: accuracy = 0.8391 INFO: == Epoch 002/009 ========== INFO: ## Training summary INFO: accuracy = 0.9066 INFO: time = 8.3779 seconds INFO: ## Validation summary INFO: accuracy = 0.9173 INFO: == Epoch 003/009 ========== INFO: ## Training summary INFO: accuracy = 0.9159 INFO: time = 8.7162 seconds INFO: ## Validation summary INFO: accuracy = 0.9248 INFO: == Epoch 004/009 ========== INFO: ## Training summary INFO: accuracy = 0.9223 INFO: time = 8.3265 seconds INFO: ## Validation summary INFO: accuracy = 0.9298 INFO: == Epoch 005/009 ========== INFO: ## Training summary INFO: accuracy = 0.9274 INFO: time = 8.3956 seconds INFO: ## Validation summary INFO: accuracy = 0.9319 INFO: == Epoch 006/009 ========== INFO: ## Training summary INFO: accuracy = 0.9330 INFO: time = 8.5606 seconds INFO: ## Validation summary INFO: accuracy = 0.9370 INFO: == Epoch 007/009 ========== INFO: ## Training summary INFO: accuracy = 0.9363 INFO: time = 9.2693 seconds INFO: ## Validation summary INFO: accuracy = 0.9400 INFO: == Epoch 008/009 ========== INFO: ## Training summary INFO: accuracy = 0.9426 INFO: time = 9.4478 seconds INFO: ## Validation summary INFO: accuracy = 0.9428 INFO: == Epoch 009/009 ==========
# let's examine how the weights in the hidden layer have changed after 10 more epochs of training
w211 = copy(model.arg_params[:fc2_weight])
300×10 Array{Float32,2}: -0.18999 0.0523741 0.0521278 … 0.256539 0.0830147 -0.0566305 -0.0292824 0.000472053 -0.176414 -0.137913 -0.0510069 -0.15316 -0.15406 -0.00278085 -0.175738 -0.160078 0.0158039 0.0645762 0.101266 0.0279298 -0.121867 0.126162 0.203941 -0.0970455 0.1576 -0.27373 -0.234331 0.03053 … -0.02137 0.0339543 0.165604 -0.171726 -0.0604989 0.0526133 -0.0206003 -0.0281766 0.0293302 0.0430013 -0.0955367 -0.137775 0.126382 -0.076785 0.10942 0.0555131 -0.0920628 0.128483 0.0501477 0.136502 0.0789185 -0.133262 0.211869 -0.153832 -0.119128 … 0.0677559 0.0706252 0.0836029 0.188536 -0.144794 -0.05817 -0.049355 0.156777 -0.0915638 -0.21475 0.0954725 0.209062 ⋮ ⋱ 0.085546 -0.250519 -0.0668047 -0.137696 -0.00329412 0.00653774 0.0642993 0.191455 -0.0250502 0.0283506 -0.0594797 -0.109547 -0.072032 … -0.111813 0.00390982 0.0777404 0.0444848 -0.131087 0.0178672 0.0707252 -0.118424 0.150786 -0.147908 -0.172449 -0.0152548 -0.0545296 -0.045317 -0.174648 -0.0304535 -0.048231 0.066315 0.0347018 -0.228092 0.080211 0.342091 0.0115476 0.0107955 -0.0769882 … 0.116553 0.283723 0.257533 -0.133712 0.104216 -0.208887 -0.0636919 0.0174476 -0.00290674 -0.132057 0.01859 -0.0688223 -0.0502207 -0.193162 0.0843327 0.113111 0.0176484 0.11381 -0.142134 0.0347138 0.00476562 -0.157754
# extrema mean and standard deviation for each of the 10 groups
[ [extrema(w211[:,i]) for i in 1:10] [mean(w211[:,i]) for i in 1:10] [std(w211[:,i]) for i in 1:10] ]
10×3 Array{Any,2}: (-0.357168,0.36403) 0.00906962 0.144172 (-0.291686,0.346591) 0.000505016 0.124114 (-0.35688,0.567968) -0.00167273 0.145727 (-0.354625,0.29979) 0.0025032 0.126793 (-0.335417,0.335986) -0.0121043 0.133443 (-0.61578,0.406012) 0.0105811 0.144681 (-0.34189,0.273747) -0.00809373 0.127449 (-0.443309,0.390159) 0.00113069 0.145592 (-0.423108,0.589517) -0.00721869 0.125767 (-0.339918,0.404306) 0.00526154 0.132497
# plot comparison to previous
plot([ [mean(w2[:,i]) for i in 1:10] [std(w2[:,i]) for i in 1:10] [mean(w211[:,i]) for i in 1:10] [std(w211[:,i]) for i in 1:10] ],
label=["mean 1" "std 1" "mean 10" "std 10"], legend=true)
plot([heatmap(reshape(w211[:,i], 10,30), aspect_ratio=:equal, clims=(-0.6,0.6)) for i=1:10]...)
# test on eval set
preds = mx.predict(model, eval_provider)
correct = 0
for i = 1:size(preds)[2]
if indmax(preds[:,i]) == eval_y[i]+1
correct += 1
end
end
correct/size(preds)[2]
INFO: TempSpace: Total 1 MB allocated on CPU0
The accuracy already seems to be stabilizing around 95%, but we will run a few more epochs to see what happens
@time mx.fit(model, optimizer, train_provider, eval_data=eval_provider, n_epoch=10)
INFO: Start training on MXNet.mx.Context[CPU0] INFO: Initializing parameters... INFO: Creating KVStore... INFO: TempSpace: Total 4 MB allocated on CPU0 INFO: Start training... INFO: == Epoch 001/010 ========== INFO: ## Training summary INFO: accuracy = 0.9494 INFO: time = 8.3934 seconds INFO: ## Validation summary INFO: accuracy = 0.9477 INFO: == Epoch 002/010 ========== INFO: ## Training summary INFO: accuracy = 0.9522 INFO: time = 8.7644 seconds INFO: ## Validation summary INFO: accuracy = 0.9524 INFO: == Epoch 003/010 ========== INFO: ## Training summary INFO: accuracy = 0.9553 INFO: time = 8.9129 seconds INFO: ## Validation summary INFO: accuracy = 0.9527 INFO: == Epoch 004/010 ========== INFO: ## Training summary INFO: accuracy = 0.9586 INFO: time = 8.4194 seconds INFO: ## Validation summary INFO: accuracy = 0.9558 INFO: == Epoch 005/010 ========== INFO: ## Training summary INFO: accuracy = 0.9613 INFO: time = 8.2062 seconds INFO: ## Validation summary INFO: accuracy = 0.9571 INFO: == Epoch 006/010 ========== INFO: ## Training summary INFO: accuracy = 0.9637 INFO: time = 7.6110 seconds INFO: ## Validation summary INFO: accuracy = 0.9586 INFO: == Epoch 007/010 ========== INFO: ## Training summary INFO: accuracy = 0.9666 INFO: time = 8.5968 seconds INFO: ## Validation summary INFO: accuracy = 0.9579 INFO: == Epoch 008/010 ========== INFO: ## Training summary INFO: accuracy = 0.9676 INFO: time = 8.1803 seconds INFO: ## Validation summary INFO: accuracy = 0.9600 INFO: == Epoch 009/010 ========== INFO: ## Training summary INFO: accuracy = 0.9694 INFO: time = 8.5670 seconds INFO: ## Validation summary INFO: accuracy = 0.9596 INFO: == Epoch 010/010 ========== INFO: ## Training summary INFO: accuracy = 0.9716 INFO: time = 8.7106 seconds INFO: ## Validation summary
93.436206 seconds (3.92 M allocations: 2.322 GB, 6.35% gc time)
INFO: accuracy = 0.9629 INFO: Finish training on MXNet.mx.Context[CPU0]
It looks like it might have improved to 97%, but that is on the eval set. Let's go ahead and make predictions with the model we have trained using the Kaggle test set and create a file that can be submitted on the website. Note that running a prediction is a lot faster than training.
@time test = readtable("data/test.csv");
26
size(test)
(28000,784)
test_X = transpose(Array(test))
784×28000 Array{Int64,2}: 0 0 0 0 0 0 0 0 0 0 0 0 … 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 … 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 … 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ⋮ ⋮ ⋮ ⋱ ⋮ ⋮ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 20 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 … 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 … 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
plot([heatmap(rotl90(reshape(Array(test_X[1:end,i]), 28, 28)), aspect_ratio=:equal) for i=1:16]...)
extrema(test_X)
(0,255)
mean(test_X)
33.3515450983965
# scale test_X to range 0-1
test_X = test_X./255;
test_provider = mx.ArrayDataProvider(test_X, batch_size=batch_size, shuffle=false)
MXNet.mx.ArrayDataProvider(Array{Float32,N}[ Float32[0.0 0.0 … 0.0 0.0; 0.0 0.0 … 0.0 0.0; … ; 0.0 0.0 … 0.0 0.0; 0.0 0.0 … 0.0 0.0]],Symbol[:data],Array{Float32,N}[],Symbol[],1000,28000,false,0.0f0,0.0f0,MXNet.mx.NDArray[mx.NDArray{Float32}(784,1000)],MXNet.mx.NDArray[])
# this uses the previously trained model and the DataProvider specified above
@time tpreds = mx.predict(model, test_provider)
INFO: TempSpace: Total 1 MB allocated on CPU0
4.253657 seconds (20.44 k allocations: 86.709 MB, 10.64% gc time)
10×28000 Array{Float32,2}: 3.32992f-5 0.998368 1.43607f-5 … 1.062f-6 2.4091f-5 1.16891f-10 3.90927f-8 0.000117007 1.37874f-8 2.84631f-9 0.999818 0.000398155 7.91866f-5 5.73282f-7 0.999624 0.00013095 1.55586f-5 0.000858953 0.00192545 0.000157291 7.27301f-7 1.95835f-9 0.0574737 0.0149685 6.15969f-5 2.27206f-7 0.00120688 0.00746393 … 0.000177767 1.21758f-5 2.64876f-7 7.05455f-6 1.17546f-5 1.13237f-7 5.25664f-6 8.56755f-7 3.20001f-6 0.00172127 0.000133627 7.38207f-8 1.45219f-5 1.30237f-6 0.0118794 7.05102f-5 0.000103447 1.1988f-6 2.87172f-7 0.92038 0.982722 1.20239f-5
# create submission
open("MLP1submission.csv", "w") do f
write(f, "ImageId,Label\n")
for i = 1:size(tpreds)[2]
write(f, string(i),",",string(indmax(tpreds[:,i])-1),"\n")
end
end
When I submit the resulting file to https://kaggle.com/c/digit-recognizer/submit, I get a score of 0.95200. which is pretty close to the error rate of 4.7% reported above for this network. The validation accuracy was just beginning to plateau so it isn't clear if we could get a little more out of training, but I'm more interested here in demonstrating creating the two layer network. We will duplicate the 300-100 net that got an error rate of 4.5%.
Here we define a new, two layer network with two FullyConnected Layers and tanh activations. We could use code similar to last time:
data = mx.Variable(:data)
fc1 = mx.FullyConnected(data, name=:fc1, num_hidden=300)
act1 = mx.Activation(fc1, name=:relu1, act_type=:relu)
fc2 = mx.FullyConnected(act1, name=:fc2, num_hidden=100)
act2 = mx.Activation(fc2, name=:relu2, act_type=:relu)
fc3 = mx.FullyConnected(act2, name=:fc3, num_hidden=10)
mlp2 = mx.SoftmaxOutput(fc3, name=:softmax)
but, to demonstrate a shortcut, we will use the MXNet chain macro to create this net using =>
connections.
# define new network
mlp2 = @mx.chain mx.Variable(:data) =>
mx.FullyConnected(name=:fc1, num_hidden=300) =>
mx.Activation(name=:tanh1, act_type=:tanh) =>
mx.FullyConnected(name=:fc2, num_hidden=100) =>
mx.Activation(name=:tanh2, act_type=:tanh) =>
mx.FullyConnected(name=:fc3, num_hidden=10) =>
mx.SoftmaxOutput(name=:softmax)
MXNet.mx.SymbolicNode(MXNet.mx.MX_SymbolHandle(Ptr{Void} @0x000000001cfa0a10))
# save file - note that the `do` block automatically closes the filestream
open("mlp2graph.dot", "w") do fs
print(fs, mx.to_graphviz(mlp2))
end
# we can run the `dot` program to convert to png if it is installed on your computer
run(pipeline(`dot -Tpng mlp2graph.dot`, stdout="mlp2graph.png"))
We can display this file:
# change context to gpu(number) if you have a gpu
model = mx.FeedForward(mlp2, context=mx.cpu())
MXNet.mx.FeedForward(MXNet.mx.SymbolicNode(MXNet.mx.MX_SymbolHandle(Ptr{Void} @0x000000001cfa0a10)),MXNet.mx.Context[CPU0],#undef,#undef,#undef)
@time mx.fit(model, optimizer, train_provider, eval_data=eval_provider, n_epoch=20)
INFO: Start training on MXNet.mx.Context[CPU0] INFO: Initializing parameters... INFO: Creating KVStore... INFO: TempSpace: Total 4 MB allocated on CPU0 INFO: Start training... INFO: == Epoch 001/020 ========== INFO: ## Training summary INFO: accuracy = 0.1096 INFO: time = 11.8936 seconds INFO: ## Validation summary INFO: accuracy = 0.1056 INFO: == Epoch 002/020 ========== INFO: ## Training summary INFO: accuracy = 0.3144 INFO: time = 12.0328 seconds INFO: ## Validation summary INFO: accuracy = 0.5613 INFO: == Epoch 003/020 ========== INFO: ## Training summary INFO: accuracy = 0.7609 INFO: time = 12.0750 seconds INFO: ## Validation summary INFO: accuracy = 0.8576 INFO: == Epoch 004/020 ========== INFO: ## Training summary INFO: accuracy = 0.8721 INFO: time = 11.7286 seconds INFO: ## Validation summary INFO: accuracy = 0.8989 INFO: == Epoch 005/020 ========== INFO: ## Training summary INFO: accuracy = 0.8987 INFO: time = 12.1276 seconds INFO: ## Validation summary INFO: accuracy = 0.9143 INFO: == Epoch 006/020 ========== INFO: ## Training summary INFO: accuracy = 0.9137 INFO: time = 12.9629 seconds INFO: ## Validation summary INFO: accuracy = 0.9227 INFO: == Epoch 007/020 ========== INFO: ## Training summary INFO: accuracy = 0.9231 INFO: time = 12.6956 seconds INFO: ## Validation summary INFO: accuracy = 0.9293 INFO: == Epoch 008/020 ========== INFO: ## Training summary INFO: accuracy = 0.9311 INFO: time = 13.9586 seconds INFO: ## Validation summary INFO: accuracy = 0.9364 INFO: == Epoch 009/020 ========== INFO: ## Training summary INFO: accuracy = 0.9381 INFO: time = 14.6798 seconds INFO: ## Validation summary INFO: accuracy = 0.9411 INFO: == Epoch 010/020 ========== INFO: ## Training summary INFO: accuracy = 0.9443 INFO: time = 12.7957 seconds INFO: ## Validation summary INFO: accuracy = 0.9454 INFO: == Epoch 011/020 ========== INFO: ## Training summary INFO: accuracy = 0.9491 INFO: time = 12.6748 seconds INFO: ## Validation summary INFO: accuracy = 0.9521 INFO: == Epoch 012/020 ========== INFO: ## Training summary INFO: accuracy = 0.9544 INFO: time = 13.5925 seconds INFO: ## Validation summary INFO: accuracy = 0.9556 INFO: == Epoch 013/020 ========== INFO: ## Training summary INFO: accuracy = 0.9591 INFO: time = 13.5508 seconds INFO: ## Validation summary INFO: accuracy = 0.9570 INFO: == Epoch 014/020 ========== INFO: ## Training summary INFO: accuracy = 0.9620 INFO: time = 12.6269 seconds INFO: ## Validation summary INFO: accuracy = 0.9554 INFO: == Epoch 015/020 ========== INFO: ## Training summary INFO: accuracy = 0.9640 INFO: time = 12.5352 seconds INFO: ## Validation summary INFO: accuracy = 0.9579 INFO: == Epoch 016/020 ========== INFO: ## Training summary INFO: accuracy = 0.9683 INFO: time = 11.9030 seconds INFO: ## Validation summary INFO: accuracy = 0.9607 INFO: == Epoch 017/020 ========== INFO: ## Training summary INFO: accuracy = 0.9711 INFO: time = 13.4604 seconds INFO: ## Validation summary INFO: accuracy = 0.9604 INFO: == Epoch 018/020 ========== INFO: ## Training summary INFO: accuracy = 0.9729 INFO: time = 11.2488 seconds INFO: ## Validation summary INFO: accuracy = 0.9611 INFO: == Epoch 019/020 ========== INFO: ## Training summary INFO: accuracy = 0.9736 INFO: time = 13.9827 seconds INFO: ## Validation summary INFO: accuracy = 0.9648 INFO: == Epoch 020/020 ========== INFO: ## Training summary INFO: accuracy = 0.9768 INFO: time = 13.0553 seconds INFO: ## Validation summary INFO: accuracy = 0.9644 INFO: Finish training on MXNet.mx.Context[CPU0]
282.506040 seconds (8.59 M allocations: 4.680 GB, 4.63% gc time)
After 20 epochs of training, we seem to be hitting a ceiling on validation accuracy. Let's use this newly trained model to predict on the test set and submit to Kaggle.
# this uses the recently trained model and the same DataProvider on the test set
@time tpreds = mx.predict(model, test_provider)
INFO: TempSpace: Total 1 MB allocated on CPU0
3.578575 seconds (5.62 k allocations: 86.121 MB, 0.41% gc time)
10×28000 Array{Float32,2}: 3.17956f-5 0.999099 1.12871f-6 … 1.32057f-6 6.29556f-6 5.82953f-8 1.96553f-8 0.000100165 3.53726f-8 1.18404f-7 0.99966 0.000367172 1.80263f-6 1.26685f-6 0.999141 0.000230099 1.24621f-6 8.82856f-5 0.0032008 0.000576255 5.89889f-6 1.49464f-7 0.0208702 0.0349224 5.21555f-5 1.43514f-7 0.000451465 0.000577244 … 0.000289193 7.39706f-7 9.90732f-7 7.13181f-5 6.17309f-7 2.8054f-7 1.98969f-6 1.8722f-5 4.91723f-6 0.000499786 0.00010417 4.51312f-6 5.00769f-5 1.48886f-6 0.00483644 0.000480367 0.00021646 1.78278f-6 3.23831f-6 0.973024 0.961 1.00423f-6
# create submission
open("MLP2submission.csv", "w") do f
write(f, "ImageId,Label\n")
for i = 1:size(tpreds)[2]
write(f, string(i),",",string(indmax(tpreds[:,i])-1),"\n")
end
end
When I submit this new file to Kaggle, I get a score of 0.95657, which is similar to the error of 4.5% from the LeCun paper. Creating MLPs is very easy with MXNet and there is even an MLP()
constructor. The equivalent code for this is:
mlp2 = @mx.chain mx.Variable(:data) =>
mx.MLP([300, 100, 10]) =>
mx.SoftmaxOutput(name=:softmax)
Running this model for 20 epochs and submitting to Kaggle got me a better score of 0.96543. I'm not sure why yet, although the defaults for this constructor are different. For example, MLP()
defaults to using :relu
activations which have become more standard as they are faster than the original :tanh
activations I used. ReLU do have problems at zero due to the discontinuity there, so other variants like LeakyReLU have been developed as mentioned above.
I hope this notebook has been helpful. Please leave a comment if you have suggestions for improvement.
Licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.