A feature is an individual measurable property of a phenomenon being observed. Features are also called explanatory variables, independent variables, predictors, regressors, etc. Any attribute could be a feature, but choosing informative, discriminating and independent features is a crucial step for effective algorithms in machine learning. Features are usually numeric and a set of numeric features can be conveniently described by a feature vector. Structural features such as strings, sequences and graphs are also used in areas such as natural language processing, computational biology, etc.
Feature engineering is the process of using domain knowledge of the data to create features that make machine learning algorithms work. Feature engineering is fundamental to the application of machine learning, and is both difficult and expensive. It requires the experimentation of multiple possibilities and the combination of automated techniques with the intuition and knowledge of the domain expert.
Generally speaking, there are two major types of attributes:
The data values are non-numeric categories. Examples: Blood type, Gender.
The data values are counts or numerical measurements. A quantitative variable can be either discrete such as the number of students receiving an 'A' in a class, or continuous such as GPA, salary and so on.
Another way of classifying data is by the measurement scales. In statistics, there are four generally used measurement scales:
Data values are non-numeric group labels. For example, Gender variable can be defined as male = 0 and female =1.
Data values are categorical and may be ranked in some numerically meaningful way. For example, strongly disagree to strong agree may be defined as 1 to 5.
-- Interval data: Data values are ranged in a real interval, which can be as large as from negative infinity to positive infinity. The difference between two values are meaningful, however, the ratio of two interval data is not meaningful. For example temperature, IQ. -- Ratio data: Both difference and ratio of two values are meaningful. For example, salary, weight. Many machine learning algorithms can only handle numeric attributes while a few such as decision trees can process nominal attribute directly. Date attribute is useful in plotting. With some feature engineering, values like day of week can be used as nominal attribute. String attribute could be used in text mining and natural language processing.
Smile provides a variety of data loading utilities for popular data formats, such as Parquet, Avro, Arrow, SAS7BDAT, Weka's ARFF files, LibSVM's file format, delimited text files, JSON, and binary sparse data. We will demonstrate these parsers with the sample data in the data directory. In Scala API, the parsing functions are in the smile.read
object. Firstly, let's import Smile's Scala API.
import $ivy.`com.github.haifengl::smile-scala:3.1.0`
import $ivy.`org.slf4j:slf4j-simple:2.0.13`
import scala.language.postfixOps
import org.apache.commons.csv.CSVFormat
import java.nio.file.{Files, Paths}
import smile._
import smile.data._
Most parsers return a DataFrame
object. By default, almond displays it as a String by calling toString
. To show a DataFrame
in HTML, let's define the below helper function.
def display(df: DataFrame, limit: Int = 20, truncate: Boolean = true): Unit = {
import xml.Utility.escape
val header = df.names
val rows = df.toStrings(limit, truncate)
kernel.publish.html(
s"""
<table>
<tr>${header.map(h => s"<th>${escape(h)}</th>").mkString}</tr>
${rows.map { row =>
s"<tr>${row.map{c => s"<td>${escape(c)}</td>" }.mkString}</tr>"
}.mkString}
</table>
"""
)
}
Apache Parquet is a columnar storage format that supports nested data structures. It uses the record shredding and assembly algorithm described in the Dremel paper. To read Parquet files, its library and related Hadoop libraries should be provided. Since we don't ship these libraries with Smile, we import them explicitly in the below.
import $ivy.`org.apache.parquet:parquet-hadoop:1.13.1`
import $ivy.`org.apache.hadoop:hadoop-common:3.3.5`
val userdata1 = read.parquet("../data/kylo/userdata1.parquet")
display(userdata1)
Apache Avro is a data serialization system. Avro provides rich data structures, a compact, fast, binary data format, a container file, to store persistent data, and remote procedure call (RPC). Avro relies on schemas. When Avro data is stored in a file, its schema is stored with it. Avro schemas are defined with JSON.
import $ivy.`org.apache.avro:avro:1.11.1`
import $ivy.`org.apache.hadoop:hadoop-common:3.3.5`
val userdata1 = read.avro("../data/kylo/userdata1.avro", "../data/avro/userdata.avsc")
display(userdata1)
Apache Arrow is a cross-language development platform for in-memory data. It specifies a standardized language-independent columnar memory format for flat and hierarchical data, organized for efficient analytic operations on modern hardware.
Feather uses the Apache Arrow columnar memory specification to represent binary data on disk. This makes read and write operations very fast. This is particularly important for encoding null/NA values and variable-length types like UTF8 strings. Feather is a part of the broader Apache Arrow project. Feather defines its own simplified schemas and metadata for on-disk representation.
In the below example, we first load the data from a SQLite database, write it into Feather file, and then read it back.
import $ivy.`org.apache.arrow:arrow-vector:12.0.1`
import $ivy.`org.apache.arrow:arrow-memory:12.0.1`
import $ivy.`org.apache.arrow:arrow-memory-netty:12.0.1`
import $ivy.`org.xerial:sqlite-jdbc:3.42.0.0`
// load SQLite JDBC driver
Class.forName("org.sqlite.JDBC")
val url = String.format("jdbc:sqlite:%s", "../data/sqlite/chinook.db")
val sql = """select e.firstname as 'Employee First', e.lastname as 'Employee Last', c.firstname as 'Customer First', c.lastname as 'Customer Last', c.country, i.total
from employees as e
join customers as c on e.employeeid = c.supportrepid
join invoices as i on c.customerid = i.customerid"""
val conn = java.sql.DriverManager.getConnection(url)
val stmt = conn.createStatement
val rs = stmt.executeQuery(sql)
// Create a DataFrame from JDBC ResultSet
val chinook = read.jdbc(rs)
val temp = java.io.File.createTempFile("chinook", "arrow")
val path = temp.toPath()
write.arrow(chinook, path)
val chinook2 = read.arrow(path)
SAS7BDAT is currently the main format used for storing SAS datasets across all platforms.
val airline = read.sas("../data/sas/airline.sas7bdat")
Weka ARFF (attribute relation file format) is an ASCII text file format that is essentially a CSV file with a header that describes the meta data. ARFF was developed for use in the Weka machine learning software.
A dataset is firstly described, beginning with the name of the dataset (or the relation in ARFF terminology). Each of the variables (or attribute in ARFF terminology) used to describe the observations is then identified, together with their data type, each definition on a single line. The actual observations are then listed, each on a single line, with fields separated by commas, much like a CSV file.
Missing values in an ARFF dataset are identified using the question mark '?'. Comments can be included in the file, introduced at the beginning of a line with a '%', whereby the remainder of the line is ignored.
A significant advantage of the ARFF data file over the CSV data file is the meta data information. Also, the ability to include comments ensure we can record extra information about the data set, including how it was derived, where it came from, and how it might be cited.
val cpu = read.arff("https://github.com/haifengl/smile/blob/master/shell/src/universal/data/weka/cpu.arff?raw=true")
So far, we have read local data. The above example shows that Smile can remote data (http, ftp, hdfs, etc.) seamlessly.
The delimited text files are widely used in machine learning research community. The comma-separated values (CSV) file is a special case. Smile provides flexible parser for them.
def csv(file: String,
delimiter: Char = ',',
header: Boolean = true,
quote: Char = '"',
escape: Char = '\\',
schema: StructType = null): DataFrame
val zip = read.csv("../data/usps/zip.train", delimiter = ' ', header = false)
display(zip)
Note that Smile infers the schema from the data. If the inferred schema is not correct, the user can also provided a customized schema.
LibSVM is a very fast and popular library for support vector machines. LibSVM uses a sparse format where zero values do not need to be stored. Each line of a libsvm file is in the format:
<label> <index1>:<value1> <index2>:<value2> ...
where <label>
is the target value of the training data. For classification, it should be an integer which identifies a class (multi-class classification is supported). For regression, it's any real number. For one-class SVM, it's not used so can be any number. <index>
is an integer starting from 1, and <value>
is a real number. The indices must be in an ascending order. The labels in the testing data file are only used to calculate accuracy or error. If they are unknown, just fill this column with a number.
val glass = read.libsvm("../data/libsvm/glass.txt")
read.libsvm
returns a Dataset[Instance[SparseArray]]
object.
The function SparseDataset.from(Path path, int arrayIndexOrigin)
can read sparse data in coordinate triple tuple list format. The parameter arrayIndexOrigin is the starting index of array. By default, it is 0 as in C/C++ and Java. But it could be 1 to parse data produced by other programming language such as Fortran.
The coordinate file stores a list of (row, column, value) tuples:
instanceID attributeID value
instanceID attributeID value
instanceID attributeID value
instanceID attributeID value
...
instanceID attributeID value
instanceID attributeID value
instanceID attributeID value
Ideally, the entries are sorted (by row index, then column index) to improve random access times. This format is good for incremental matrix construction.
Optionally, there may be 2 header lines
D // The number of instances
W // The number of attributes
or 3 header lines
D // The number of instances
W // The number of attributes
N // The total number of nonzero items in the dataset.
These header lines will be ignored.
The sample data data/sparse/kos.txt
is in the coordinate format.
val kos = SparseDataset.from(java.nio.file.Paths.get("../data/sparse/kos.txt"), 1)
In Harwell-Boeing column-compressed sparse matrix file, nonzero values are stored in an array (top-to-bottom, then left-to-right-bottom). The row indices corresponding to the values are also stored. Besides, a list of pointers are indexes where each column starts. The class SparseMatrix
supports two formats for Harwell-Boeing files. The simple one is organized as follows:
The first line contains three integers, which are the number of rows, the number of columns, and the number of nonzero entries in the matrix.
Following the first line, there are m + 1 integers that are the indices of columns, where m is the number of columns. Then there are n integers that are the row indices of nonzero entries, where n is the number of nonzero entries. Finally, there are n float numbers that are the values of nonzero entries.
The function SparseMatrix.text(Path path)
can read this simple format. In the directory data/matrix
, there are several sample files in the Harwell-Boeing format.
import smile.math.matrix.SparseMatrix
val blocks = SparseMatrix.text(Paths.get("../data/matrix/08blocks.txt"))
The second format is more complicated and powerful, called Harwell-Boeing Exchange Format. For details, see http://people.sc.fsu.edu/~jburkardt/data/hb/hb.html. Note that our implementation supports only real-valued matrix and we ignore the optional right hand side vectors. This format is supported by the function SparseMatrix.harwell(Path path)
.