Notebook

Performance Prediction: How Well Does The Model Work?

This chapter describes some of the many performance prediction methods that can be applied to machine learning and other data minining models.

Selecting Test Data

Let's assume we have chosen a performance metric that we will used to determine how well a given model performs. It is necessary to select a set of *test* data used as model input for measuring the selected metric. The *test* data should **not** contain any of the data used in training or optimizing the model.

"Big Data"

If the available data set is large enough, it can be divided into three sets:

training data - used by one or more learning schemes
validation data - used to optimize parameters
test data - used to calculate the performance metric

How large is "large enough"? Ideally, the training and test data sets should be representative of the data that will be observed in practice. This means that the data should span the possible cases and distribution of cases should be the same as that observed in practice. In general, it is impossible to know that the data is representative. If one assumes the data set is representative, the requirement for selecting training and test data is that it is stratified, that is that the distribution of cases is the same between the test and training data.

Cross Validation - "Normal Data"

Most problems of interest have limited data available for training and testing. In such cases, it is not desirable or feasible to divide the data into training and testing sets. The *standard* procedure in these cases is to use *stratified tenfold cross-validation performed ten times*. A single stratified *N* fold cross-validation involves first splitting the data into N approximately equal stratified sets. Then N-1 sets of the data are combined and used for training and the Nth data set is used for performance prediction. This is done in turn, so that an error estimate is yielded for each of the N data sets. The final error estimate is the average of the N estimates. In tenfold cross-validation, N is equal to 10. The standard method is to repeat the tenfold cross-validation ten times, that is divide the data into 10 sets 10 times, running the cross-validation each time. This helps minimize error prediction variation resulting from the random selection of the N=10 data folds.

Other Methods

Other testing methods include *Leave-One-Out Cross Validation*, essentially this n-fold cross-validation where n is the size of the complete data set, and *0.632 Bootstrap* validation, whihc uses sampling with replacement to form the training data set.

Performance Metrics

There are a variety of metrics that can be used to predict how well a given learning scheme will perform. The appropriate measure, naturally, depends on the problem.

Classification Success

If the problem is a classification problem, then one possible metric is the percentage of inputs correctly classified. See page 151 of [Data Mining](http://www.amazon.com/Data-Mining-Practical-Techniques-Management/dp/0123748569/ref=sr_1_1?s=books&ie=UTF8&qid=1356718659&sr=1-1&keywords=data+mining) for confidence interval discussion.

Predicting Probabilities

Often the outcome of a learning scheme is a probability measure. Here we consider the case where the learning scheme assigns a probability to *K* possible outcomes for a given instance. Assume these are output as a probability vector $\mathbf{p} = \left(p_1, p_2, \ldots, p_k \right)\hspace{2pt}$. Express the actual class for the instance as a vector $\mathbf{a} = \left(a_1, a_2, \ldots, a_k \right) \hspace{3pt}$ where $a_i$ equals 1 if *i* is the class the instance actually belongs to and 0 otherwise. A performance metric that may apply in such situations is the use of a loss function that is calculated for a given instance. If the test set contains several instances the loss function is summed over all of them. Two are described here.

Quadratic Loss Function

$$\sum_{j=1}^K \left(p_j-a_j\right)^2$$

Informational Loss Function

$$-\log_2 p_i$$

where i is the actual class for the instance. This function represents the information (in bits) required to express the actual class i with respect to the probability distribution $\mathbf{p}$, i.e. if one has the knowledge of the distribution, this is the number of bits required to communicate a specific class if done optimally.

In [ ]: