MLJ has a set of functions that generate random data sets, closely resembling functions of the same name in scikit-learn. They are great for testing machine learning models (e.g., testing user-defined composite models; see Composing Models)
using MLJ, VegaLite, DataFrames
@docs
make_blobs
X, y = make_blobs(100, 3; centers=2, cluster_std=[1.0, 3.0])
dfBlobs = DataFrame(X)
dfBlobs.y = y
first(dfBlobs, 3)
x1 | x2 | x3 | y | |
---|---|---|---|---|
Float64 | Float64 | Float64 | Cat… | |
1 | 5.30601 | 7.33548 | 9.9446 | 2 |
2 | 5.14757 | 5.8813 | 8.84096 | 2 |
3 | 3.34118 | 9.36617 | 12.1529 | 2 |
dfBlobs |> @vlplot(:point, x=:x1, y=:x2, color = :"y:n")
dfBlobs |> @vlplot(:point, x=:x1, y=:x3, color = :"y:n")
@docs
make_circles
X, y = make_circles(100; noise=0.05, factor=0.3)
dfCircles = DataFrame(X)
dfCircles.y = y
first(dfCircles, 3)
x1 | x2 | y | |
---|---|---|---|
Float64 | Float64 | Cat… | |
1 | -0.342997 | -0.00629956 | 0 |
2 | 1.02085 | 0.0437288 | 1 |
3 | -0.21868 | 0.958061 | 1 |
dfCircles |> @vlplot(:circle, x=:x1, y=:x2, color = :"y:n")
@docs
make_moons
X, y = make_moons(100; noise=0.05)
dfHalfCircles = DataFrame(X)
dfHalfCircles.y = y
first(dfHalfCircles, 3)
x1 | x2 | y | |
---|---|---|---|
Float64 | Float64 | Cat… | |
1 | 1.24069 | -0.6675 | 1 |
2 | 1.241 | -0.632259 | 1 |
3 | 0.0699098 | 1.06216 | 0 |
dfHalfCircles |> @vlplot(:circle, x=:x1, y=:x2, color = :"y:n")
@docs
make_regression
X, y = make_regression(100, 5; noise=0.5, sparse=0.2, outliers=0.1)
dfRegression = DataFrame(X)
dfRegression.y = y
first(dfRegression, 3)
x1 | x2 | x3 | x4 | x5 | y | |
---|---|---|---|---|---|---|
Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | |
1 | 0.443821 | 0.136731 | -1.10758 | -0.504443 | 1.08749 | 0.215017 |
2 | -0.727496 | 0.843299 | 0.468311 | -0.922993 | -0.297077 | -0.59015 |
3 | -0.412518 | -1.26038 | 0.932722 | 0.116239 | -0.570425 | -0.712242 |