DataFrame¶

Many Smile algorithms take simple double[] as input. But we also use the encapsulation class DataFrame. As shown in Data notebook, the output of most Smile data parsers is a DataFrame object. DataFrames are immutable and contain a fixed number of named columns.

In [ ]:

import $ivy.`com.github.haifengl::smile-scala:3.0.2`
import $ivy.`org.slf4j:slf4j-simple:2.0.7`  

import scala.language.postfixOps
import org.apache.commons.csv.CSVFormat
import java.nio.file.{Files, Paths}
import smile._
import smile.data._

def display(df: DataFrame, limit: Int = 20, truncate: Boolean = true) = {
  import xml.Utility.escape
  val header = df.names
  val rows = df.toStrings(limit, truncate)
  kernel.publish.html(
    s"""
      <table>
        <tr>${header.map(h => s"<th>${escape(h)}</th>").mkString}</tr>
        ${rows.map { row =>
          s"<tr>${row.map{c => s"<td>${escape(c)}</td>" }.mkString}</tr>"
        }.mkString}
      </table>
    """
  )
}

In this session, we will explore the functionality of DataFrame with the iris data. The iris data is from early statistical work of R.A. Fisher, who used three species of Iris flowers to develop linear discriminant analysis.

In [ ]:

val iris = read.arff("../data/weka/iris.arff")

First, let's check out the statistic summary of numeric columns in the data.

In [ ]:

iris.summary

We can get a row with the array syntax.

In [ ]:

iris(0)

When selecting a row, it returns a Tuple, which is an immutable finite ordered list (sequence) of elements. Moreover, we can slice a DataFrame into a new one.

In [ ]:

iris.slice(10, 20)

We can refer a column by its name and it returns a vector.

In [ ]:

iris("sepallength")

Similarly, we can select a few columns to create a new data frame.

In [ ]:

iris.select("sepallength", "sepalwidth")

Advanced operations such as exists, forall, find, filter are also supported. The predicate of these functions expect a Tuple.

In [ ]:

iris.exists(_.getDouble(0) > 4.5)

In this example, we test if there is any sample with sepallength > 4.5. Since sepallength is the first column, we use getDouble(0) to retrive the value in the predicate labmda. Note that Tuple allows generic access by get() method, which will incur boxing overhead for primitives. Therefore, Tuple also provides the native primitive access method getXXX(), where XXX is the type.

It is invalid to use the native primitive interface to retrieve a value that is null, instead a user must check isNullAt before attempting to retrieve a value that might be null.

In [ ]:

iris.forall(_.getDouble(0) < 10)

In contrast to exists, the function forall returns true only if all rows pass the test.

In [ ]:

iris.find(_("class") == 1)

The find method returns the first row passes the test if it exists. Otherwise, it returns Optional.empty. Note that _("class") in the example returns an object of Integer because the nominal data are stored as integers (byte, short, or int, depending on the levels of measurements). To the string representation of class, one can use getString() method.

In [ ]:

iris.find(_.getString("class").equals("Iris-versicolor"))

Let's combine what we just learn into an example of filter.

In [ ]:

iris.filter { row => row.getDouble(1) > 3 && row("class") != 0 }

For data wrangling, the most important functions of DataFrame are map and groupBy.

In [ ]:

iris.map { row =>
  val x = new Array[Double](6)
  for (i <- 0 until 4) x(i) = row.getDouble(i)
  x(4) = x(0) * x(1)
  x(5) = x(2) * x(3)
  x
}

In [ ]:

iris.groupBy(row => row.getString("class"))

Besides numeric and nominal values, many other data types are also supported in DataFrame.

In [ ]:

val strings = read.arff("../data/weka/string.arff")
strings.filter(_.getString(0).startsWith("AS"))

In [ ]:

val dates = read.arff("../data/weka/date.arff")