For a more elementary introduction to MLJ, see Getting Started.
Note. Be sure this file has not been separated from the accompanying Project.toml and Manifest.toml files, which should not should be altered unless you know what you are doing. Using them, the following code block instantiates a julia environment with a tested bundle of packages known to work with the rest of the script:
using Pkg
Pkg.activate(@__DIR__)
Pkg.instantiate()
Activating environment at `~/Google Drive/Julia/MLJ/MLJ/examples/lightning_tour/Project.toml`
Assuming Julia 1.6
In MLJ a model is just a container for hyper-parameters, and that's all. Here we will apply several kinds of model composition before binding the resulting "meta-model" to data in a machine for evaluation, using cross-validation.
Loading and instantiating a gradient tree-boosting model:
using MLJ
MLJ.color_off()
Booster = @load EvoTreeRegressor # loads code defining a model type
booster = Booster(max_depth=2) # specify hyper-parameter at construction
[ Info: For silent loading, specify `verbosity=0`. import EvoTrees ✔
EvoTreeRegressor( loss = EvoTrees.Linear(), nrounds = 10, λ = 0.0, γ = 0.0, η = 0.1, max_depth = 2, min_weight = 1.0, rowsample = 1.0, colsample = 1.0, nbins = 64, α = 0.5, metric = :mse, rng = MersenneTwister(444), device = "cpu") @522
booster.nrounds=50 # or mutate post facto
booster
EvoTreeRegressor( loss = EvoTrees.Linear(), nrounds = 50, λ = 0.0, γ = 0.0, η = 0.1, max_depth = 2, min_weight = 1.0, rowsample = 1.0, colsample = 1.0, nbins = 64, α = 0.5, metric = :mse, rng = MersenneTwister(444), device = "cpu") @522
This model is an example of an iterative model. As is stands, the
number of iterations nrounds
is fixed.
Let's create a new model that automatically learns the number of iterations,
using the NumberSinceBest(3)
criterion, as applied to an
out-of-sample l1
loss:
using MLJIteration
iterated_booster = IteratedModel(model=booster,
resampling=Holdout(fraction_train=0.8),
controls=[Step(2), NumberSinceBest(3), NumberLimit(300)],
measure=l1,
retrain=true)
DeterministicIteratedModel( model = EvoTreeRegressor( loss = EvoTrees.Linear(), nrounds = 50, λ = 0.0, γ = 0.0, η = 0.1, max_depth = 2, min_weight = 1.0, rowsample = 1.0, colsample = 1.0, nbins = 64, α = 0.5, metric = :mse, rng = MersenneTwister(444), device = "cpu"), controls = Any[Step(2), NumberSinceBest(3), NumberLimit(300)], resampling = Holdout( fraction_train = 0.8, shuffle = false, rng = Random._GLOBAL_RNG()), measure = LPLoss( p = 1), weights = nothing, class_weights = nothing, operation = MLJModelInterface.predict, retrain = true, check_measure = true, iteration_parameter = nothing, cache = true) @630
Combining the model with categorical feature encoding:
pipe = @pipeline ContinuousEncoder iterated_booster
Pipeline290( continuous_encoder = ContinuousEncoder( drop_last = false, one_hot_ordered_factors = false), deterministic_iterated_model = DeterministicIteratedModel( model = EvoTreeRegressor{Float64,…} @522, controls = Any[Step(2), NumberSinceBest(3), NumberLimit(300)], resampling = Holdout @258, measure = LPLoss{Int64} @628, weights = nothing, class_weights = nothing, operation = MLJModelInterface.predict, retrain = true, check_measure = true, iteration_parameter = nothing, cache = true)) @585
First, we define a hyper-parameter range for optimization of a (nested) hyper-parameter:
max_depth_range = range(pipe,
:(deterministic_iterated_model.model.max_depth),
lower = 1,
upper = 10)
typename(MLJBase.NumericRange)(Int64, :(deterministic_iterated_model.model.max_depth), ... )
Now we can wrap the pipeline model in an optimization strategy to make it "self-tuning":
self_tuning_pipe = TunedModel(model=pipe,
tuning=RandomSearch(),
ranges = max_depth_range,
resampling=CV(nfolds=3, rng=456),
measure=l1,
acceleration=CPUThreads(),
n=50)
DeterministicTunedModel( model = Pipeline290( continuous_encoder = ContinuousEncoder @294, deterministic_iterated_model = DeterministicIteratedModel{EvoTreeRegressor{Float64,…}} @630), tuning = RandomSearch( bounded = Distributions.Uniform, positive_unbounded = Distributions.Gamma, other = Distributions.Normal, rng = Random._GLOBAL_RNG()), resampling = CV( nfolds = 3, shuffle = true, rng = MersenneTwister(456)), measure = LPLoss( p = 1), weights = nothing, operation = MLJModelInterface.predict, range = NumericRange( field = :(deterministic_iterated_model.model.max_depth), lower = 1, upper = 10, origin = 5.5, unit = 4.5, scale = :linear), selection_heuristic = MLJTuning.NaiveSelection(nothing), train_best = true, repeats = 1, n = 50, acceleration = CPUThreads{Int64}(5), acceleration_resampling = CPU1{Nothing}(nothing), check_measure = true, cache = true) @132
Loading a selection of features and labels from the Ames House Price dataset:
X, y = @load_reduced_ames;
Binding the "self-tuning" pipeline model to data in a machine (which will additionally store learned parameters):
mach = machine(self_tuning_pipe, X, y)
Machine{DeterministicTunedModel{RandomSearch,…},…} @769 trained 0 times; caches data args: 1: Source @557 ⏎ `Table{Union{AbstractVector{Continuous}, AbstractVector{Count}, AbstractVector{Multiclass{15}}, AbstractVector{Multiclass{25}}, AbstractVector{OrderedFactor{10}}}}` 2: Source @538 ⏎ `AbstractVector{Continuous}`
Evaluating the "self-tuning" pipeline model's performance using 5-fold cross-validation (implies multiple layers of nested resampling):
evaluate!(mach,
measures=[l1, l2],
resampling=CV(nfolds=5, rng=123),
acceleration=CPUThreads())
[ Info: Performing evaluations using 5 threads. Evaluating over 5 folds: 100%[=========================] Time: 0:06:01
┌────────────────────┬───────────────┬──────────────────────────────────────────
│ _.measure │ _.measurement │ _.per_fold ⋯
├────────────────────┼───────────────┼──────────────────────────────────────────
│ LPLoss{Int64} @628 │ 17000.0 │ [16500.0, 16200.0, 16600.0, 16600.0, 19 ⋯
│ LPLoss{Int64} @406 │ 6.86e8 │ [6.18e8, 6.16e8, 6.08e8, 6.21e8, 9.66e8 ⋯
└────────────────────┴───────────────┴──────────────────────────────────────────
1 column omitted
_.per_observation = [[[24800.0, 29400.0, ..., 5360.0], [4300.0, 31900.0, ..., 12600.0], [22400.0, 51600.0, ..., 35700.0], [1940.0, 22200.0, ..., 1920.0], [8920.0, 17900.0, ..., 6750.0]], [[6.15e8, 8.67e8, ..., 2.88e7], [1.85e7, 1.02e9, ..., 1.59e8], [5.03e8, 2.66e9, ..., 1.27e9], [3.76e6, 4.91e8, ..., 3.7e6], [7.96e7, 3.19e8, ..., 4.55e7]]]
_.fitted_params_per_fold = [ … ]
_.report_per_fold = [ … ]
_.train_test_rows = [ … ]
This notebook was generated using Literate.jl.