import $ivy.`com.github.haifengl::smile-scala:3.0.2`
import $ivy.`org.slf4j:slf4j-simple:2.0.7`
import scala.language.postfixOps
import org.apache.commons.csv.CSVFormat
import java.nio.file.{Files, Paths}
import smile._
import smile.data._
def display(df: DataFrame, limit: Int = 20, truncate: Boolean = true) = {
import xml.Utility.escape
val header = df.names
val rows = df.toStrings(limit, truncate)
kernel.publish.html(
s"""
<table>
<tr>${header.map(h => s"<th>${escape(h)}</th>").mkString}</tr>
${rows.map { row =>
s"<tr>${row.map{c => s"<td>${escape(c)}</td>" }.mkString}</tr>"
}.mkString}
</table>
"""
)
}
In this session, we will explore the functionality of DataFrame
with the iris
data. The iris
data is from early statistical work of R.A. Fisher, who used three species of Iris flowers to develop linear discriminant analysis.
val iris = read.arff("../data/weka/iris.arff")
First, let's check out the statistic summary of numeric columns in the data.
iris.summary
We can get a row with the array syntax.
iris(0)
When selecting a row, it returns a Tuple
, which is an immutable finite ordered list (sequence) of elements. Moreover, we can slice a DataFrame
into a new one.
iris.slice(10, 20)
We can refer a column by its name and it returns a vector.
iris("sepallength")
Similarly, we can select a few columns to create a new data frame.
iris.select("sepallength", "sepalwidth")
Advanced operations such as exists
, forall
, find
, filter
are also supported. The predicate of these functions expect a Tuple
.
iris.exists(_.getDouble(0) > 4.5)
In this example, we test if there is any sample with sepallength > 4.5
. Since sepallength
is the first column, we use getDouble(0)
to retrive the value in the predicate labmda. Note that Tuple
allows generic access by get()
method, which will incur boxing overhead for primitives. Therefore, Tuple
also provides the native primitive access method getXXX()
, where XXX
is the type.
It is invalid to use the native primitive interface to retrieve a value
that is null, instead a user must check isNullAt
before attempting
to retrieve a value that might be null.
iris.forall(_.getDouble(0) < 10)
In contrast to exists
, the function forall
returns true
only if all rows pass the test.
iris.find(_("class") == 1)
The find
method returns the first row passes the test if it exists. Otherwise, it returns Optional.empty
. Note that _("class")
in the example returns an object of Integer because the nominal data are stored as integers (byte, short, or int, depending on the levels of measurements). To the string representation of class
, one can use getString()
method.
iris.find(_.getString("class").equals("Iris-versicolor"))
Let's combine what we just learn into an example of filter
.
iris.filter { row => row.getDouble(1) > 3 && row("class") != 0 }
For data wrangling, the most important functions of DataFrame
are map
and groupBy
.
iris.map { row =>
val x = new Array[Double](6)
for (i <- 0 until 4) x(i) = row.getDouble(i)
x(4) = x(0) * x(1)
x(5) = x(2) * x(3)
x
}
iris.groupBy(row => row.getString("class"))
Besides numeric and nominal values, many other data types are also supported in DataFrame
.
val strings = read.arff("../data/weka/string.arff")
strings.filter(_.getString(0).startsWith("AS"))
val dates = read.arff("../data/weka/date.arff")