The models fit by, e.g. the ols
and lasso
functions, are specified in a compact symbolic form. The tilde operator ~ is basic in the formation of such models. An expression of the form y ~ model
is interpreted as a specification that the response y
is modelled by a linear predictor specified symbolically by model.
import $ivy.`com.github.haifengl::smile-scala:2.6.0`
import $ivy.`org.slf4j:slf4j-simple:1.7.30`
import scala.language.postfixOps
import smile._
import smile.data._
import smile.data.formula._
import smile.data.formula.Terms.$
import smile.data.`type`._
import smile.data.measure._
In the simplest case, the right hand side of ~ can be empty. That is, all the variables except the response variable will be used as predictors.
val f = "class" ~
where class
is the response variable. When your data is already prepared, such a simple model is usually sufficient. For feature engineering and selection, however, you may be more specific on the features. In those cases, a model consists of a series of terms separated by + operators. The terms themselves consist of variable and factor names separated by :: operators. Such a term is interpreted as the interaction of all the variables and factors appearing in the term.
val f = "class" ~ "salary" + ("gender"::"race") + "age"
It is possible to create a formula without response variable (in case of unsupervised learning). In such cases, the formula is used to generate features.
val f = ~ "salary" + "gender" + "age"
In addition to + and ::, a number of other operators are useful in model formulae. The && operator denotes factor crossing: a&&b interpreted as a+b+a::b. The - operator removes the specified terms.
"salary" ~ "." + ("a" && "b" && "c") - "d"
This formula includes all the cross terms of a
, b
, and c
, removes the term d
. Here, .
means all other variables in the data, not otherwise in the formula. Most mathematical functions can be applied to terms. For example,
"salary" ~ "." + log("age") + "gender"
So far, we have defined several abstract formulas. We may bind a formula to a schema, which associates the formula variable to the schema's columns. The output schema is inferred based on input data types.
val inputSchema = DataTypes.struct(
new StructField("water", DataTypes.ByteType, new NominalScale("dry", "wet")),
new StructField("sowing_density", DataTypes.ByteType, new NominalScale("low", "high")),
new StructField("wind", DataTypes.ByteType, new NominalScale("weak", "strong"))
)
val formula = ~ "water" + "sowing_density" + "wind" + ("water" :: "sowing_density" :: "wind")
val outputSchema = formula.bind(inputSchema)
Now let's apply a formula on a data frame. In a program or Scala REPL, we should be able to use the $
method directly. However the $
sign is in special use in the notebook. So we apply the full qualiifer Terms.$
here.
val iris = read.arff("../data/weka/iris.arff")
val formula = log("petallength") ~ sin(exp("petalwidth")) + (Terms.$("sepalwidth") + Terms.$("sepallength")) + "." - "class"
formula.frame(iris)
And train a random forest model with a formula.
smile.classification.randomForest("class" ~ ".", iris)
Lastly, we apply a formula with factor cross on the weather data.
val weather = read.arff("../data/weka/weather.nominal.arff")
val formula = "class" ~ "outlook" + ("temperature" && "humidity" && "windy")
formula.frame(weather)