Formula¶

The models fit by, e.g. the ols and lasso functions, are specified in a compact symbolic form. The tilde operator ~ is basic in the formation of such models. An expression of the form y ~ model is interpreted as a specification that the response y is modelled by a linear predictor specified symbolically by model.

In [ ]:

import $ivy.`com.github.haifengl::smile-scala:2.6.0`
import $ivy.`org.slf4j:slf4j-simple:1.7.30`  

import scala.language.postfixOps
import smile._
import smile.data._
import smile.data.formula._
import smile.data.formula.Terms.$
import smile.data.`type`._
import smile.data.measure._

In the simplest case, the right hand side of ~ can be empty. That is, all the variables except the response variable will be used as predictors.

In [ ]:

val f = "class" ~

where class is the response variable. When your data is already prepared, such a simple model is usually sufficient. For feature engineering and selection, however, you may be more specific on the features. In those cases, a model consists of a series of terms separated by + operators. The terms themselves consist of variable and factor names separated by :: operators. Such a term is interpreted as the interaction of all the variables and factors appearing in the term.

In [ ]:

val f = "class" ~ "salary" + ("gender"::"race") + "age"

It is possible to create a formula without response variable (in case of unsupervised learning). In such cases, the formula is used to generate features.

In [ ]:

val f = ~ "salary" + "gender" + "age"

In addition to + and ::, a number of other operators are useful in model formulae. The && operator denotes factor crossing: a&&b interpreted as a+b+a::b. The - operator removes the specified terms.

In [ ]:

"salary" ~ "." + ("a" && "b" && "c") - "d"

This formula includes all the cross terms of a, b, and c, removes the term d. Here, . means all other variables in the data, not otherwise in the formula. Most mathematical functions can be applied to terms. For example,

In [ ]:

"salary" ~ "." + log("age") + "gender"

So far, we have defined several abstract formulas. We may bind a formula to a schema, which associates the formula variable to the schema's columns. The output schema is inferred based on input data types.

In [ ]:

val inputSchema = DataTypes.struct(
  new StructField("water", DataTypes.ByteType, new NominalScale("dry", "wet")),
  new StructField("sowing_density", DataTypes.ByteType, new NominalScale("low", "high")),
  new StructField("wind", DataTypes.ByteType, new NominalScale("weak", "strong"))
)

val formula = ~ "water" + "sowing_density" + "wind" + ("water" :: "sowing_density" :: "wind")

val outputSchema = formula.bind(inputSchema)

Now let's apply a formula on a data frame. In a program or Scala REPL, we should be able to use the $ method directly. However the $ sign is in special use in the notebook. So we apply the full qualiifer Terms.$ here.

In [ ]:

val iris = read.arff("../data/weka/iris.arff")
val formula = log("petallength") ~ sin(exp("petalwidth")) + (Terms.$("sepalwidth") + Terms.$("sepallength")) + "." - "class"
formula.frame(iris)

And train a random forest model with a formula.

In [ ]:

smile.classification.randomForest("class" ~ ".", iris)

Lastly, we apply a formula with factor cross on the weather data.

In [ ]:

val weather = read.arff("../data/weka/weather.nominal.arff")
val formula = "class" ~ "outlook" + ("temperature" && "humidity" && "windy")
formula.frame(weather)