Statistics with Julia from the ground up

A JuliaCon 2021 workshop by Yoni Nazarathy

Many of the code examples for this workshop are adapted from Statistics with Julia: Fundamentals for Data Science, Machine Learning and Artificial Intelligence by Yoni Nazarathy and Hayden Klok.

Also related (and recommended in this JuliaCon):


Before we start

The tutorial was developed and tested under Julia 1.6.0. It is best to run it with the Project.toml and Manifest.toml files present in the working directory of the notebook. It also uses the following data files:

In [1]:
readdir("./data")
Out[1]:
8-element Vector{String}:
 "L1L2data.csv"
 "fertilizer.csv"
 "machine1.csv"
 "machine2.csv"
 "machine3.csv"
 "purchaseData.csv"
 "temperatures.csv"
 "weightHeight.csv"

You will find all these files and this notebook in the Github repo for this workshop. You can either "clone" the repo or download a zip file.

We use a Jupyter Notebook. Here is a quick reference sheet. Many other reserouces on the web as well for Jupyter - many of which use Python and not Julia, but Jupyter is the same. BTW the "J" in "Jupyter" is for "Julia".

Load the packages we will use:

In [2]:
using Pkg
Pkg.activate(".")
Pkg.instantiate()
  Activating environment at `~/git/mine/JuliaCon2021-StatisticsWithJuliaFromTheGroundUp/Project.toml`
In [3]:
Pkg.status()
      Status `~/git/mine/JuliaCon2021-StatisticsWithJuliaFromTheGroundUp/Project.toml`
  [336ed68f] CSV v0.8.5
  [aaaa29a8] Clustering v0.14.2
  [861a8166] Combinatorics v1.0.2
  [a93c6f00] DataFrames v1.2.0
  [31c24e10] Distributions v0.25.11
  [587475ba] Flux v0.12.5
  [38e38edf] GLM v1.5.1
  [09f84164] HypothesisTests v0.10.4
  [5ab0869b] KernelDensity v0.6.3
  [b964fa9f] LaTeXStrings v1.2.1
  [b4fcebef] Lasso v0.6.2
  [eb30cadb] MLDatasets v0.5.7
  [442fdcdd] Measures v0.3.1
  [dbeba491] Metalhead v0.5.3
  [6f286f6a] MultivariateStats v0.8.0
  [91a5bcdd] Plots v1.19.2
  [ce6b1742] RDatasets v0.7.5
  [f2b01f46] Roots v1.0.10
  [276daf66] SpecialFunctions v1.5.1
  [2913bbd2] StatsBase v0.33.8
  [f3b207a7] StatsPlots v0.14.25
In [4]:
using Random, Statistics, LinearAlgebra, Dates #Shipped with Julia
using Distributions, StatsBase #Core statistics
using CSV, DataFrames #Basic Data
using Plots, StatsPlots, LaTeXStrings, Measures #Plotting and Output
using HypothesisTests, KernelDensity, GLM, Lasso, Clustering, MultivariateStats #Statistical/ML methods
using Flux, Metalhead #Deep learning 
using Combinatorics, SpecialFunctions, Roots #Mathematical misc.
using RDatasets, MLDatasets #Example datasets
#uncomment if using R:  using RCall #Interface with R
In [5]:
# We run this before many examples for reproducibility
fix_seed!() = Random.seed!(0)
Out[5]:
fix_seed! (generic function with 1 method)


Why Julia?

home

Julia Curve

Some ways to run Julia

  • REPL
    • As an application
    • Out of your shell
    • As part of an IDE
  • Jupyter (IJulia)
    • In your web browser
    • Jupyter Lab
  • Google collab
  • Pluto
  • Visual Studio Code
  • Legacy: Atom (Juno)
  • JuliaHub
  • In RMarkdown with IJulia (e.g. in R studio)
  • ...

Key resources (My favorites)


What do you mean?

home

In [6]:
fix_seed!()
data = rand(Normal(),5)
Out[6]:
5-element Vector{Float64}:
  0.6791074260357777
  0.8284134829000359
 -0.3530074003005963
 -0.13485387193052173
  0.5866170746331097
In [7]:
n = length(data)
Out[7]:
5
In [8]:
sum(data)/n
Out[8]:
0.3212553422675611
In [9]:
+(data...)/n
Out[9]:
0.3212553422675611
In [ ]:
? +
In [11]:
"""
This is a function that takes in data and returns its sum.
"""
function my_sum(data)
    s = 0.0
    for d in data
        s += d
    end
    return s
end
Out[11]:
my_sum
In [12]:
? my_sum
search: my_sum

Out[12]:

This is a function that takes in data and returns its sum.

In [13]:
my_sum(data)/length(data)
Out[13]:
0.3212553422675611
In [14]:
mean(data)
Out[14]:
0.3212553422675611
In [15]:
data'*ones(n)/n
Out[15]:
0.3212553422675611
In [16]:
dot(data,ones(n))/n
Out[16]:
0.3212553422675611
In [ ]:
? dot

Doing it a little differently with the "running mean" formula

$$ \overline{X}_i = \frac{1}{i} X_i + \frac{i-1}{i} \overline{X}_{i-1}. $$
In [17]:
mn = 0
for i in 1:length(data)
    global mn = (1/i)*data[i] + (i-1)/i*mn
end
mn
Out[17]:
0.32125534226756114
In [18]:
mn = 0
for (i,d) in enumerate(data)
    global mn = (1/i)*d + (i-1)/i*mn #Note that in Jupyter `global` isn't needed here but in the REPL it is.
end
mn
Out[18]:
0.32125534226756114
In [19]:
function my_mean(data)
    mn = 0
    for (i,d) in enumerate(data)
        mn = (1/i)*d + (i-1)/i*mn
    end
    return mn
end
Out[19]:
my_mean (generic function with 1 method)
In [20]:
my_mean(data)
Out[20]:
0.32125534226756114
In [21]:
fix_seed!()
data = rand(Normal(),n) + im*rand(Normal(),n)
Out[21]:
5-element Vector{ComplexF64}:
   0.6791074260357777 + 0.29733585084941616im
   0.8284134829000359 + 0.06494754854834232im
  -0.3530074003005963 - 0.10901738508171745im
 -0.13485387193052173 - 0.514210390833322im
   0.5866170746331097 + 1.5743302021369892im
In [22]:
mean(data)
Out[22]:
0.3212553422675611 + 0.2626771651239416im
In [ ]:
methods(mean)
In [23]:
@which mean(data)
In [24]:
methods(my_mean)
Out[24]:
# 1 method for generic function my_mean:
  • my_mean(data) in Main at In[19]:1
In [25]:
function my_mean(data::Vector{Float64})::Float64
    mn = 0.0
    for (i,d) in enumerate(data)
        mn = (1/i)*d + (i-1)/i*mn
    end
    return mn
end
Out[25]:
my_mean (generic function with 2 methods)
In [26]:
methods(my_mean)
Out[26]:
# 2 methods for generic function my_mean:
  • my_mean(data::Vector{Float64}) in Main at In[25]:1
  • my_mean(data) in Main at In[19]:1
In [27]:
@which my_mean(data)
Out[27]:
my_mean(data) in Main at In[19]:1
In [28]:
@which my_mean([1,2])
Out[28]:
my_mean(data) in Main at In[19]:1
In [29]:
@which my_mean([1.0,2.0])
Out[29]:
my_mean(data::Vector{Float64}) in Main at In[25]:1
In [30]:
@which my_mean([1.0,2])
Out[30]:
my_mean(data::Vector{Float64}) in Main at In[25]:1
In [31]:
function my_mean(data::Vector{T})::T where T
    mn = zero(data[begin])
    for (i,d) in enumerate(data)
        mn = (1/i)*d + (i-1)/i*mn
    end
    return mn
end
Out[31]:
my_mean (generic function with 3 methods)
In [32]:
methods(my_mean)
Out[32]:
# 3 methods for generic function my_mean:
  • my_mean(data::Vector{Float64}) in Main at In[25]:1
  • my_mean(data::Vector{T}) where T in Main at In[31]:1
  • my_mean(data) in Main at In[19]:1

The mean of other types

In [33]:
fix_seed!()
data = [rand(Normal(),2,2) for _ in 1:5]
Out[33]:
5-element Vector{Matrix{Float64}}:
 [0.6791074260357777 -0.3530074003005963; 0.8284134829000359 -0.13485387193052173]
 [0.5866170746331097 0.06494754854834232; 0.29733585084941616 -0.10901738508171745]
 [-0.514210390833322 -0.6889071278256981; 1.5743302021369892 -0.7628038164104581]
 [0.39748240921816347 -0.34635460427879816; 0.8116296225068749 -0.18757268194516005]
 [-1.6072563241277753 2.2762328327845243; -2.48079273065994 0.21969346754254096]
In [34]:
mean(data)
Out[34]:
2×2 Matrix{Float64}:
 -0.091652   0.190582
  0.206183  -0.194911
In [35]:
my_mean(data)
Out[35]:
2×2 Matrix{Float64}:
 -0.091652   0.190582
  0.206183  -0.194911
In [36]:
@which my_mean(data)
Out[36]:
my_mean(data::Vector{T}) where T in Main at In[31]:1

A glimpse under the hood

In [ ]:
@code_lowered my_mean(data)
In [ ]:
@code_llvm my_mean(data)
In [ ]:
@code_native my_mean(data)

An array of arrays

In [38]:
fix_seed!()
data = [rand(Normal(μ,1),5) for μ in 1:5] #\mu + [TAB]
Out[38]:
5-element Vector{Vector{Float64}}:
 [1.6791074260357777, 1.8284134829000358, 0.6469925996994037, 0.8651461280694783, 1.5866170746331099]
 [2.297335850849416, 2.0649475485483424, 1.8909826149182825, 1.485789609166678, 3.574330202136989]
 [2.311092872174302, 2.2371961835895418, 3.3974824092181635, 3.811629622506875, 2.6536453957212016]
 [3.81242731805484, 2.3927436758722247, 1.5192072693400598, 6.276232832784524, 4.219693467542541]
 [4.882862469267065, 4.398746405814617, 6.14227643894132, 4.911383681799527, 5.2794662520898985]
In [39]:
mean.(data)
Out[39]:
5-element Vector{Float64}:
 1.321255342267561
 2.262677165123942
 2.8822092966420163
 3.644060912718838
 5.122947049582486

A matrix of data

In [40]:
data = hcat(data...)
Out[40]:
5×5 Matrix{Float64}:
 1.67911   2.29734  2.31109  3.81243  4.88286
 1.82841   2.06495  2.2372   2.39274  4.39875
 0.646993  1.89098  3.39748  1.51921  6.14228
 0.865146  1.48579  3.81163  6.27623  4.91138
 1.58662   3.57433  2.65365  4.21969  5.27947
In [41]:
heatmap(data',yflip=true)
Out[41]:
In [42]:
mean.(data) #does nothing useful
Out[42]:
5×5 Matrix{Float64}:
 1.67911   2.29734  2.31109  3.81243  4.88286
 1.82841   2.06495  2.2372   2.39274  4.39875
 0.646993  1.89098  3.39748  1.51921  6.14228
 0.865146  1.48579  3.81163  6.27623  4.91138
 1.58662   3.57433  2.65365  4.21969  5.27947
In [43]:
mean(data,dims=1) #better!
Out[43]:
1×5 Matrix{Float64}:
 1.32126  2.26268  2.88221  3.64406  5.12295

Dictionaries

In [44]:
data = Dict()
data[:cats] = rand(Normal(2.3,1),5)
data["dogs"] = rand(Normal(5.2,1),5)
data[25] = rand(Normal(7.8,1),5)
data
Out[44]:
Dict{Any, Any} with 3 entries:
  "dogs" => [6.62305, 5.60839, 5.78862, 4.90372, 5.89111]
  25     => [8.30687, 7.74307, 6.02898, 9.39062, 9.19706]
  :cats  => [2.41142, 1.94212, 2.77371, 2.60023, 1.53732]
In [45]:
keys(data)
Out[45]:
KeySet for a Dict{Any, Any} with 3 entries. Keys:
  "dogs"
  25
  :cats
In [46]:
mean(data[:cats])
Out[46]:
2.2529618709854136
In [47]:
mean(data["dogs"])
Out[47]:
5.762977963045932
In [48]:
mean(data["cats"])
KeyError: key "cats" not found

Stacktrace:
 [1] getindex(h::Dict{Any, Any}, key::String)
   @ Base ./dict.jl:482
 [2] top-level scope
   @ In[48]:1
 [3] eval
   @ ./boot.jl:360 [inlined]
 [4] include_string(mapexpr::typeof(REPL.softscope), mod::Module, code::String, filename::String)
   @ Base ./loading.jl:1094
In [49]:
mean(data[Symbol("cats")])
Out[49]:
2.2529618709854136
In [50]:
data[Symbol("cats")] |> mean
Out[50]:
2.2529618709854136

Tuples

In [51]:
fix_seed!()
rn() = rand(Normal())
data = (rn(), rn(),rn())
Out[51]:
(0.6791074260357777, 0.8284134829000359, -0.3530074003005963)
In [52]:
typeof(data)
Out[52]:
Tuple{Float64, Float64, Float64}
In [53]:
mean(data)
Out[53]:
0.3848378362117391
In [54]:
data[2]
Out[54]:
0.8284134829000359
In [55]:
data[2] = 7.8 #Tuples are immutable!
MethodError: no method matching setindex!(::Tuple{Float64, Float64, Float64}, ::Float64, ::Int64)

Stacktrace:
 [1] top-level scope
   @ In[55]:1
 [2] eval
   @ ./boot.jl:360 [inlined]
 [3] include_string(mapexpr::typeof(REPL.softscope), mod::Module, code::String, filename::String)
   @ Base ./loading.jl:1094

Structs

In [56]:
struct Person
    weight::Float64
    height::Float64
end
In [57]:
someone = Person(102, 180)
Out[57]:
Person(102.0, 180.0)
In [58]:
someone.weight = 103 #Structs are immutable unless you state immutable
setfield! immutable struct of type Person cannot be changed

Stacktrace:
 [1] setproperty!(x::Person, f::Symbol, v::Int64)
   @ Base ./Base.jl:34
 [2] top-level scope
   @ In[58]:1
 [3] eval
   @ ./boot.jl:360 [inlined]
 [4] include_string(mapexpr::typeof(REPL.softscope), mod::Module, code::String, filename::String)
   @ Base ./loading.jl:1094
In [59]:
me = Person(102, 180)
you = Person(83, 170)
me + you #can't add persons just like that
MethodError: no method matching +(::Person, ::Person)
Closest candidates are:
  +(::Any, ::Any, ::Any, ::Any...) at operators.jl:560
  +(::T, ::Any) where T<:Intervals.Interval at /Users/uqjnazar/.julia/packages/Intervals/ua9cq/src/interval.jl:294
  +(::ChainRulesCore.Tangent{P, T} where T, ::P) where P at /Users/uqjnazar/.julia/packages/ChainRulesCore/IHuAH/src/differential_arithmetic.jl:162
  ...

Stacktrace:
 [1] top-level scope
   @ In[59]:3
 [2] eval
   @ ./boot.jl:360 [inlined]
 [3] include_string(mapexpr::typeof(REPL.softscope), mod::Module, code::String, filename::String)
   @ Base ./loading.jl:1094
In [60]:
import Base: + 

function +(x::Person, y::Person)
    Person(x.weight + y.weight, x.height + y.height)
end
Out[60]:
+ (generic function with 472 methods)
In [61]:
me + you #Now you can
Out[61]:
Person(185.0, 350.0)
In [62]:
import Base: /
function /(x::Person, n::Real)
    Person(x.weight/n, x.height/n)
end
Out[62]:
/ (generic function with 209 methods)
In [63]:
mean([me,you]) #With + and / (by number) we can compute the mean of persons
Out[63]:
Person(92.5, 175.0)

Dataframes

In [64]:
ENV["LINES"] = 10
df = CSV.File("./data/temperatures.csv") |> DataFrame
Out[64]:

777 rows × 5 columns

YearMonthDayBrisbaneGoldCoast
Int64Int64Int64Float64Float64
120151131.330.9
220151230.530.1
320151328.930.1
420151430.230.1
520151528.128.0
620151629.529.3
720151727.126.4
820151828.427.7
920151929.529.6
10201511030.229.5
In [65]:
data = df.GoldCoast
mean(data)
Out[65]:
26.163835263835264

All kind of other descriptive statistics

In [66]:
println("Sample Mean: ", mean(data))
println("Harmonic ≤ Geometric ≤ Arithmetic ",      #\le + [TAB]
            (harmmean(data), geomean(data), mean(data)))

println("Sample Variance: ",var(data))
println("Sample Standard Deviation: ",std(data))
println("Minimum: ", minimum(data))
println("Maximum: ", maximum(data))
println("Median: ", median(data))
println("95th percentile: ", percentile(data, 95))
println("0.95 quantile: ", quantile(data, 0.95))
println("Interquartile range: ", iqr(data),"\n")

summarystats(data)
Sample Mean: 26.163835263835264
Harmonic ≤ Geometric ≤ Arithmetic (25.65423319190783, 25.915563409789065, 26.163835263835264)
Sample Variance: 12.367253570765161
Sample Standard Deviation: 3.516710618001595
Minimum: 15.4
Maximum: 36.5
Median: 26.8
95th percentile: 30.9
0.95 quantile: 30.9
Interquartile range: 5.5

Out[66]:
Summary Stats:
Length:         777
Missing Count:  0
Mean:           26.163835
Minimum:        15.400000
1st Quartile:   23.400000
Median:         26.800000
3rd Quartile:   28.900000
Maximum:        36.500000
In [67]:
min(5,2,-3) #Minimum of individual function arguments
Out[67]:
-3
In [68]:
minimum([5,2,-3]) #Minimum of an array (AbstractArray)
Out[68]:
-3
In [69]:
min([5,2,-3]) #doesn't work
MethodError: no method matching min(::Vector{Int64})
Closest candidates are:
  min(::Any, ::Missing) at missing.jl:127
  min(::T1, ::DataValues.DataValue{T2}) where {T1, T2} at /Users/uqjnazar/.julia/packages/DataValues/N7oeL/src/scalar/operations.jl:75
  min(::Any, ::Any) at operators.jl:433
  ...

Stacktrace:
 [1] top-level scope
   @ In[69]:1
 [2] eval
   @ ./boot.jl:360 [inlined]
 [3] include_string(mapexpr::typeof(REPL.softscope), mod::Module, code::String, filename::String)
   @ Base ./loading.jl:1094
In [70]:
min([5,2,-3]...) #The splat operator passes the arguments
Out[70]:
-3
In [71]:
fix_seed!()
data = rand(Normal(),10^6);
In [72]:
@time begin
    minimum(data)
end
  0.001196 seconds (1 allocation: 16 bytes)
Out[72]:
-4.846415885152558
In [73]:
@time begin
    min(data...)
end
  0.551472 seconds (2.10 M allocations: 82.025 MiB, 11.16% gc time, 7.09% compilation time)
Out[73]:
-4.846415885152558

Something rand.

home

The Monty Hall Problem

In [74]:
fix_seed!()

function montyHall(switch_policy)
    prize, choice = rand(1:3), rand(1:3)
    if prize == choice
        revealed = rand(setdiff(1:3,choice))
    else
        revealed = rand(setdiff(1:3,[prize,choice]))
    end

    if switch_policy
        choice = setdiff(1:3,[revealed,choice])[1]
    end
    return choice == prize
end

N = 10^6
println("Success probability with policy I (stay): ",  sum([montyHall(false) for _ in 1:N])/N)
println("Success probability with policy II (switch): ", sum([montyHall(true) for _ in 1:N])/N)
Success probability with policy I (stay): 0.334146
Success probability with policy II (switch): 0.666056

Common random numbers

$$ X \sim \text { Uniform }(0,2 \lambda(1-\lambda)) $$$$ \mathbb{E}_{\lambda}[X]=\lambda(1-\lambda) $$
In [75]:
seed = 1
N = 100
λ_grid = 0.01:0.01:0.99

theorM(λ) = mean(Uniform(0,2λ*(1-λ)))
estM(λ) = mean(rand(Uniform(0,2λ*(1-λ)),N))

function estM(λ, seed)
    Random.seed!(seed)
    estM(λ)
end

trueM = theorM.(λ_grid)
estM0 = estM.(λ_grid)
estMCRN = estM.(λ_grid,seed)

plot(λ_grid, trueM, c=:black, label="Expected curve")
plot!(λ_grid, estM0, c=:blue, label="No CRN estiamte")
plot!(λ_grid, estMCRN, c=:red, label="CRN estimate", 
    xlims=(0,1), ylims=(0,0.4), xlabel=L"\lambda", ylabel = "Mean")
Out[75]:

Multiple RNGs

Say we want to estimate the mean of $$ X=\sum_{i=1}^{N} Z_{i} $$ with $N \sim \operatorname{Poisson}(K \lambda)$ and $Z_{i} \sim \text { Uniform }(0,2(1-\lambda))$ with $\lambda \in(0,1)$ and $K>0$.

$$ \mathbb{E}_{\lambda}[X]=K \lambda(1-\lambda) $$
In [76]:
K = 50

prn(λ,rng) = quantile(Poisson(λ),rand(rng))
zDist(λ) = Uniform(0,2*(1-λ))

rv(λ,rng) = sum([rand(rng,zDist(λ)) for _ in 1:prn(K*λ, rng)])
rv2(λ,rng1,rng2) = sum([rand(rng1,zDist(λ)) for _ in 1:prn(K*λ,rng2)])

mEst(λ,rng) = mean([rv(λ,rng) for _ in 1:N])
mEst2(λ,rng1,rng2) = mean([rv2(λ,rng1,rng2) for _ in 1:N])

#No Common Random Numbers (CRN)
function mGraph0(seed)
    singleRng = MersenneTwister(seed)
    [mEst(λ,singleRng) for λ in λ_grid]
end

#CRN but single random number generator (RNG)
mGraph1(seed) = [mEst(λ,MersenneTwister(seed)) for λ in λ_grid]

#CRN with two random number generators
mGraph2(seed1,seed2) = [mEst2(λ,MersenneTwister(seed1), MersenneTwister(seed2)) for λ in λ_grid]

plot(λ_grid,mGraph0(1987), c=:red, label="No CRN")
plot!(λ_grid,mGraph1(1987), c=:green, label="CRN and one RNG")
plot!(λ_grid,mGraph2(1987,1988), c=:blue, label="CRN and two RNG's", xlims=(0,1),ylims=(0,14),
    xlabel=L"\lambda", ylabel = "Mean")
Out[76]:
In [77]:
argMaxλ(graph) = λ_grid[findmax(graph)[2]]

M = 10^3
std0 = std([argMaxλ(mGraph0(seed)) for seed in 1:M])
std1 = std([argMaxλ(mGraph1(seed)) for seed in 1:M])
std2 = std([argMaxλ(mGraph2(seed,seed+M)) for seed in 1:M])

println("Standard deviation with no CRN: ", std0)
println("Standard deviation with CRN and single RNG: ", std1)
println("Standard deviation with CRN and two RNGs: ", std2)
Standard deviation with no CRN: 0.03708052002015299
Standard deviation with CRN and single RNG: 0.03411444555309959
Standard deviation with CRN and two RNGs: 0.014645353747396703

Do you still miss R? So Just RCall.

home

See the docs for RCall.jl

In [78]:
using RCall

data1 = DataFrame(CSV.File("./data/machine1.csv", header=false))[:,1]
data2 = DataFrame(CSV.File("./data/machine2.csv", header=false))[:,1]
data3 = DataFrame(CSV.File("./data/machine3.csv", header=false))[:,1]

function R_ANOVA(allData)
    data = vcat([ [x fill(i, length(x))] for (i, x) in enumerate(allData) ]...)
    df = DataFrame(data, [:Diameter, :MachNo])
    @rput df

    R"""
    df$MachNo <- as.factor(df$MachNo)
    anova <- summary(aov( Diameter ~ MachNo, data=df))
    fVal <- anova[[1]]["F value"][[1]][1]
    pVal <- anova[[1]]["Pr(>F)"][[1]][1]
    """
    println("R ANOVA f-value: ", @rget fVal)
    println("R ANOVA p-value: ", @rget pVal)
end

R_ANOVA([data1, data2, data3])
R ANOVA f-value: 10.516968568709089
R ANOVA p-value: 0.00014236168817139574

To use Julia from R, use JuliaCall

Some Plots.

home

See example image gallery with code - part of "Statistics with Julia" book.

See docs for Plots.jl and StatsPlots.jl

In [79]:
function hailLength(x::Int)
    n = 0
    while x != 1
        if x % 2 == 0
            x = x ÷ 2 #\div + [TAB]
        else
            x = 3x +1
        end
        n += 1
    end
    return n
end

lengths = [hailLength(x₀) for x₀ in 2:10^7] #\_0 + [TAB]

histogram(lengths, bins=1000, normed=:true, 
    fill=(:blue, true), la=0, legend=:none,
    xlims=(0, 500), ylims=(0, 0.012),
    xlabel="Length", ylabel="Frequency")
Out[79]:
In [80]:
data = CSV.File("./data/temperatures.csv") |> DataFrame
brisbane = data.Brisbane
goldcoast = data.GoldCoast

diff = brisbane - goldcoast
dates = [Date(
            Year(data.Year[i]), 
            Month(data.Month[i]), 
            Day(data.Day[i])
        ) for i in 1:nrow(data)]

fortnightRange = 250:263
brisFortnight = brisbane[fortnightRange]
goldFortnight = goldcoast[fortnightRange]

p1 = plot(dates, [brisbane goldcoast], 
        c=[:blue :red], xlabel="Time", ylabel="Temperature", label=["Brisbane" "Gold Coast"])
p2 = plot(dates[fortnightRange], [brisFortnight goldFortnight], xticks=:none,
        c=[:blue :red], m=(:dot, 5, Plots.stroke(1)), ylabel="Temperature",
        label=["Brisbane" "Gold Coast"], xlabel="Time within one fortnight")
p3 = plot(dates, diff, 
        c=:black, ylabel="Temperature Difference",legend=false)
p4 = histogram(diff, bins=-4:0.5:6, 
        ylims=(0,140), legend = false,
        xlabel="Temperature Difference", ylabel="Frequency")
plot(p1,p2,p3,p4, size = (800,500), margin = 5mm)
Out[80]:
In [81]:
fix_seed!()

μ₁, σ₁ = 10, 5 #\mu + [TAB] \_1 + [TAB] etc...
μ₂, σ₂ = 40, 12
dist1, dist2 = Normal(μ₁,σ₁), Normal(μ₂,σ₂)
p = 0.3
mixRv() = (rand() <= p) ? rand(dist1) : rand(dist2)

n = 2000
data = [mixRv() for _ in 1:n]

density(data, c=:blue, label="Density via StatsPlots", xlims=(-20,80), ylims=(0,0.035))
stephist!(data, bins=50, c=:black, norm=true, label="Histogram", xlabel="x", ylabel = "Density")
Out[81]:
In [82]:
mixPDF(x) = p*pdf(dist1,x) + (1-p)*pdf(dist2,x)

kdeDist = kde(data)

xGrid = -20:0.1:80
pdfKDE = pdf(kdeDist,xGrid)

p1 = plot(xGrid, pdfKDE, c=:blue, label="KDE PDF")
stephist!(data, bins=50, c=:black, normed=:true, label="Histogram")
plot!(xGrid, mixPDF.(xGrid), c=:red, label="Underlying PDF",
    xlims=(-20,80), ylims=(0,0.035), legend=:topleft,
    xlabel="X", ylabel = "Density")

hVals = [0.5,2,10]
kdeS = [kde(data,bandwidth=h) for h in hVals]

p2 = plot(xGrid, pdf(kdeS[1],xGrid), c = :green, label= "h=$(hVals[1])")
plot!(xGrid, pdf(kdeS[2],xGrid), c = :blue, label= "h=$(hVals[2])")
plot!(xGrid, pdf(kdeS[3],xGrid), c = :purple, label= "h=$(hVals[3])",
    xlims=(-20,80), ylims=(0,0.035), legend=:topleft, 
    xlabel="X", ylabel = "Density")

plot(p1,p2,size = (800,400))
Out[82]:
In [87]:
fix_seed!()

mixCDF(x) = p*cdf(dist1,x) + (1-p)*cdf(dist2,x)

n = [30, 100]
data1 = [mixRv() for _ in 1:n[1]]
data2 = [mixRv() for _ in 1:n[2]]

empiricalCDF1 = ecdf(data1)
empiricalCDF2 = ecdf(data2)

xGrid = -10:0.1:80
plot(xGrid,empiricalCDF1.(xGrid), c=:blue, label="ECDF with n = $(n[1])")
plot!(xGrid,empiricalCDF2.(xGrid), c=:red, label="ECDF with n = $(n[2])")
plot!(xGrid, mixCDF.(xGrid), c=:black, label="Underlying CDF",
    xlims=(-10,80), ylims=(0,1), 
    xlabel="x", ylabel="Probability", legend=:topleft)
Out[87]:
In [83]:
fix_seed!()

b1, b2 = 0.5 , 2
dist1, dist2, = Beta(b1,b1), Beta(b2,b2)
 
n = 2000
data1 = rand(dist1,n)
data2 = rand(dist2,n)

stephist(data1, bins=15, label = "beta($b1,$b1)", c = :red, normed = true)
p1 = stephist!(data2, bins=15, label = "beta($b2,$b2)",
        c = :blue, xlabel="x", ylabel="Density",normed = true)

p2 = qqplot(data1, data2, c=:black, ms=1, msw =0,
        xlabel="Quantiles for beta($b1,$b1) sample",
        ylabel="Quantiles for beta($b2,$b2) sample",
        legend=false)

plot(p1, p2, size=(800,400), margin = 5mm)
Out[83]: