Prepare the Data¶

In [1]:

import addutils.toc ; addutils.toc.js(ipy_notebook=True)

Out[1]:

Data in scikit-learn, with very few exceptions, is assumed to be stored as a two-dimensional array, of size [n_samples, n_features].

Most machine learning algorithms implemented in scikit-learn expect data to be stored in a two-dimensional array or matrix. The arrays can be either numpy arrays, pandas DataFrame, or in some cases scipy.sparse matrices. The size of the array is expected to be [n_samples, n_features]

The number of features must be fixed in advance. However it can be very high dimensional (e.g. millions of features) with most of them being zeros for a given sample. This is a case where scipy.sparse matrices can be useful, in that they are much more memory-efficient than numpy arrays.

In [2]:

import scipy.io
import numpy as np
import pandas as pd
from sklearn import datasets
from addutils import css_notebook
css_notebook()

Out[2]:

1 Datasets available in scikit-learn¶

Typical scikit-learn dataset are dictionary-like object that holds all the data and metadata.

Features are usually stored in the .data field in the form of a 2D array [n_samples, n_features].
Explanatory variables (targets) are usually stored in the .target field in the form of a 1D array.

Scikit-learn makes available a host of datasets for testing learning algorithms:

Packaged Data: these small datasets can be downloaded with sklearn.datasets.load_*
Downloadable Data: larger datasets that can be fetched from the web with sklearn.datasets.fetch_*
Generated Data: can be created with sklearn.datasets.make_*

Try by yourself:

datasets.load_<TAB>
datasets.fetch_<TAB>
datasets.make_<TAB>

In [3]:

#datasets.make_

1.1 Example: the "Iris" Packaged Dataset¶

Features in the Iris dataset:
1. sepal length in cm
2. sepal width in cm
3. petal length in cm
4. petal width in cm
Target classes to predict:
1. Iris Setosa
2. Iris Versicolour
3. Iris Virginica

In [4]:

d = datasets.load_iris()

Try by yourself one of the following commands where 'd' is the variable containing the dataset:

print d.keys()           # Structure of the contained data
print d.DESCR            # A complete description of the dataset
print d.data.shape       # [n_samples, n_features]
print d.target.shape     # [n_samples,]
print d.feature_names
datasets.get_data_home() # This is where the datasets are stored

In [5]:

print(d.keys())
print(d.target_names)
print(d.feature_names)

dict_keys(['target_names', 'DESCR', 'data', 'feature_names', 'target'])
['setosa' 'versicolor' 'virginica']
['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']

1.2 Example: the "Digits" Packaged Dataset¶

The Digits contains 1797 samples made of 64 features: each feature represents the grey-scale value of a 8x8 digit image:

In [6]:

import bokeh.plotting as bk
bk.output_notebook()

BokehJS successfully loaded.

In [7]:

from bokeh.palettes import Greys9
from bokeh.models.ranges import Range1d
import addutils.palette as pal
import addutils.imagegrid as ig

digits = datasets.load_digits()

# plot the digits: each image is 8x8 pixels
images = [ digits.images[i][::-1, :] for i in range(40) ]
txt =    [ str(i) for i in range(10) ] * 4

fig = ig.imagegrid_figure(figure_plot_width=760, figure_plot_height=100,
                          figure_title=None,
                          images=images, grid_size=(20, 2), 
                          text=txt, text_font_size='9pt', text_color='red',
                          palette=Greys9[::-1], padding=0.2)
bk.show(fig)

1.3 Example: the "Blob" Generated Dataset¶

In [8]:

import seaborn as sns
cat_colors = list(map(pal.to_hex, sns.color_palette('Paired', 7)))

In [9]:

data, color_indices = datasets.make_blobs(n_samples=2000, n_features=2, centers=7,
                                   center_box=(-4.0, 6.0), cluster_std=0.5)

fig = bk.figure(title=None)
fig.circle(data[:,0], data[:,1],
            line_color='black', line_alpha=0.5, size=8,
            fill_color=pal.linear_map(color_indices, cat_colors,
                                      low=0, high=6))
bk.show(fig)

2 Using pandas¶

Although it is not required, in many cases it can be easier to manage the data pre-processing with pandas:

In [10]:

import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)

In [11]:

pd.options.display.notebook_repr_html = True
df = pd.DataFrame(d.data, columns=d.feature_names)
df['y'] = d.target
df.head(3)

Out[11]:

	sepal length (cm)	sepal width (cm)	petal length (cm)	petal width (cm)
0	5.1	3.5	1.4	0.2
1	4.9	3.0	1.4	0.2
2	4.7	3.2	1.3	0.2

In [12]:

pd.set_option('precision',3)
df[df.columns[:4]].describe().ix[[1,2,3,7]]

Out[12]:

	sepal length (cm)	sepal width (cm)	petal length (cm)	petal width (cm)
mean	5.84	3.05	3.76	1.20
std	0.83	0.43	1.76	0.76
min	4.30	2.00	1.00	0.10
max	7.90	4.40	6.90	2.50

In [13]:

x_feat, y_feat = 2, 3 # Choose the features to plot (0-3)
fig = bk.figure(title=None)
colors = ['#006CD1', '#26D100', '#D10000']
color_series = [ colors[i] for i in df['y'] ]
fig.scatter(df[df.columns[x_feat]], df[df.columns[y_feat]], 
            line_color='black', fill_color=color_series,
            radius=0.1)
fig.xaxis.axis_label = df.columns[x_feat]
fig.yaxis.axis_label = df.columns[y_feat]

bk.show(fig)

Pandas contains some utility functions and plotting functions that can help in previewing the data. In this case we use a scatter_matrix pandas plot to plot the four features one versus the other:

TODO - Check if it's possible to add class colors to the following scatter matrix with the property group_col.

In [14]:

%matplotlib inline
import matplotlib.pyplot as plt

pd.tools.plotting.scatter_matrix(df[df.columns[:4]], figsize=(10, 10),
                                 c=df['y'], diagonal='hist', marker='o')
plt.show()

3 Working with MATLAB files¶

v4 (Level 1.0), v6 and v7 to 7.2 matfiles are supported. To read matlab 7.3 format mat files an HDF5 python library is required. Please check the scipy documentation for more information.

The data can be generated with the following MATLAB code:

% Generate Regression Test Data
X = [1 2 3
     4 5 6
     7 8 9
     0 1 2] + 0.1;
y = sum(X,2);
feat_names = strvcat('Feature One', 'Feature Two', 'Feature Three');
save ('matlab_test_data_01', 'X','y', 'feat_names')

In [15]:

mat_data = scipy.io.loadmat('example_data/matlab_test_data_01.mat')

Variables names included in the .mat file are keys of the mat_data dictionary. Moreover the key '__header__' contains the mat-file information. Here we load the two variables in Pandas varialbles:

In [16]:

mat_data.keys()

Out[16]:

dict_keys(['__globals__', '__header__', 'X', 'y', '__version__', 'feat_names'])

In the following code the .strip() method is used to remove the trailing white spaces used by MATLAB to make all the variable names of the same lenght:

In [18]:

X = pd.DataFrame(mat_data['X'], columns=[s.strip() for s in list(mat_data['feat_names'])])
y = pd.DataFrame(mat_data['y'], columns=['measured'])
print(X, '\n\n', y)

   Feature One  Feature Two  Feature Three
0          1.1          2.1            3.1
1          4.1          5.1            6.1
2          7.1          8.1            9.1
3          0.1          1.1            2.1 

    measured
0       6.3
1      15.3
2      24.3
3       3.3

4 Preprocessing Data¶

Standardization of datasets is a common requirement for many machine learning estimators: they might behave badly if the individual feature do not more or less look like standard normally distributed data: Gaussian with zero mean and unit variance. In practice we often ignore the shape of the distribution and just transform the data to center and scale by dividing by the standard deviation.

If the input variables are combined via a distance function (such as Euclidean distance), standardizing inputs can be crucial. If one input has a range of 0 to 1, while another input has a range of 0 to 1,000,000, then the contribution of the first input to the distance will be swamped by the second input.

It is sometimes not enough to center and scale the features independently, since a downstream model can further make some assumption on the linear independence of the features. To address this issue you can use sklearn.decomposition.PCA or sklearn.decomposition.RandomizedPCA with whiten=True to further remove the linear correlation across features.

scale and StandardScaler work out-of-the-box with 1d arrays.

4.1 Standardizing = Mean Removal + Variance Scaling:¶

preprocessing.scale scales each column of the features matric to mean=0 and std=1. This is also called "STANDARDIZATION".

In [19]:

from sklearn import preprocessing
X = np.array([[ 10.,  1., 0.],
              [ 20.,  0., 2.]])

In [20]:

X_sc_1 = preprocessing.scale(X)
print(X)
print('\nScaled Values: ')
print(X_sc_1)

[[ 10.   1.   0.]
 [ 20.   0.   2.]]

Scaled Values: 
[[-1.  1. -1.]
 [ 1. -1.  1.]]

preprocessing.StandardScaler keeps the values of .mean_ and .std_ to allow lately transform and inverse_transform mean and std scaling can be controlled independently:

In [21]:

scaler = preprocessing.StandardScaler(copy=True, with_mean=True, with_std=True)
scaler.fit(X)
X_sc_2 = scaler.transform(X)
print('\nScaled Values: ')
print(X_sc_2)
print('\nStandard Scaler Mean: ', scaler.mean_)
print('Standard Scaler Std:  ', scaler.std_)
print('\nInverse Transform: ')
print(scaler.inverse_transform(X_sc_2))

Scaled Values: 
[[-1.  1. -1.]
 [ 1. -1.  1.]]

Standard Scaler Mean:  [ 15.    0.5   1. ]
Standard Scaler Std:   [ 5.   0.5  1. ]

Inverse Transform: 
[[ 10.   1.   0.]
 [ 20.   0.   2.]]

4.2 Using the preprocessing.StandardScaler with pandas:¶

In [22]:

X = np.array([[ 10.,  1., 0.],
              [ 20.,  0., 2.]])
df = pd.DataFrame(X)
df

Out[22]:

	0	1	2
0	10	1	0
1	20	0	2

In [23]:

df.describe().ix[[1,2]]

Out[23]:

	0	1	2
mean	15.00	0.50	1.00
std	7.07	0.71	1.41

In [24]:

scaler = preprocessing.StandardScaler(copy=False, with_mean=True, with_std=True).fit(df)
print('\nStandard Scaler Mean: ', scaler.mean_)
print('Standard Scaler Std:  ', scaler.std_)
df_sc_2 = pd.DataFrame(scaler.transform(df))
df_sc_2

Standard Scaler Mean:  [ 15.    0.5   1. ]
Standard Scaler Std:   [ 5.   0.5  1. ]

Out[24]:

	0	1	2
0	-1	1	-1
1	1	-1	1

In [25]:

df_inv =  pd.DataFrame(scaler.inverse_transform(df_sc_2))
df_inv

Out[25]:

	0	1	2
0	10	1	0
1	20	0	2

4.3 Normalizing = Dividing by a Norm of the Vector:¶

preprocessing.normalize sets thenorm of the vector =1.

By default (axis=1), so if it's necessary to normalize the columns and not the rows, axis must be set to 0.

In [26]:

X = np.array([[ 10.,  1., 0.],
              [ 20.,  0., 2.],
              [ 20.,  0., 0.]])
X_nrm = preprocessing.normalize(X, norm='l2', axis=0)
print('\nNormalized Values: ')
print(X_nrm)

Normalized Values: 
[[ 0.33333333  1.          0.        ]
 [ 0.66666667  0.          1.        ]
 [ 0.66666667  0.          0.        ]]

5 Features Extraction¶

Features extraction (module sklearn.feature_extraction) consists in transforming arbitrary data, such as text or images, into numerical features usable for machine learning.

5.1 Derived Features:¶

Are obtained by pre-processing the data to generate features that are somehow more informative. Derived features may be linear or nonlinear combinations of features (such as in Polynomial regression), or may be some more sophisticated transform of the features. The latter is often used in image processing.

For example, scikit-image provides a variety of feature extractors designed for image data: see the skimage.feature submodule.

5.2 DictVectorizer uses "one-hot" encoder for categorical features:¶

feature_extraction.DictVectorizer implements what is called one-of-K or “one-hot” coding for categorical (aka nominal, discrete) features.

In the following code the method vec.fit_transform(temp) returns a sparse matrix that is tranformed to dense by .toarray().

In [27]:

from sklearn import feature_extraction
t = [{'city': 'Dubai', 'temperature': 33.},
     {'city': 'London', 'temperature': 12.},
     {'city': 'San Fransisco', 'temperature': 18.},]

vec = feature_extraction.DictVectorizer()
t_vec = vec.fit_transform(t)
df = pd.DataFrame(t_vec.toarray(), columns=vec.get_feature_names())
df

Out[27]:

	city=Dubai	city=London	city=San Fransisco	temperature
0	1	0	0	33
1	0	1	0	12
2	0	0	1	18

5.3 The Bag of Words representation:¶

We call vectorization the general process of turning a collection of text documents into numerical feature vectors. This specific strategy (tokenization, counting and normalization) is called the Bag of Words or “Bag of n-grams” representation. Documents are described by word occurrences while completely ignoring the relative position information of the words in the document.

CountVectorizer implements both tokenization and occurrence counting in a single class:

In [28]:

feature_extraction.text.CountVectorizer?

In [29]:

corpus = ['This is the x-first document.', 'This is the second second document.',
          'And the third third third one.', 'Is this the first document?']
vec = feature_extraction.text.CountVectorizer()
X = vec.fit_transform(corpus)
df = pd.DataFrame(X.toarray(), columns=vec.get_feature_names())
df

Out[29]:

	and	document	first	is	one	second	the	third	this
0	0	1	1	1	0	0	1	0	1
1	0	1	0	1	0	2	1	0	1
2	1	0	0	0	1	0	1	3	0
3	0	1	1	1	0	0	1	0	1

The default configuration tokenizes the string by extracting words of at least 2 letters. The specific function that does this step can be requested explicitly:

In [30]:

analyzer = vec.build_analyzer()
analyzer("This is the x-first document.")

Out[30]:

['this', 'is', 'the', 'first', 'document']

Words that were not seen in the training corpus will be completely ignored in future calls to the transform method:

In [31]:

print(vec.transform(['Something completely new.']).toarray())

[[0 0 0 0 0 0 0 0 0]]

Note that in the previous corpus we lose the information that the last document is an interrogative form because each word is encoded individually. To preserve the local ordering information we can extract 2-grams of words in addition to the 1-grams (the word themselves):

In [32]:

vec_bi = feature_extraction.text.CountVectorizer(min_df=1, ngram_range=(1, 2))
X_bi = vec_bi.fit_transform(corpus)
pd.set_option('display.max_columns', None)
df_bi = pd.DataFrame(X_bi.toarray(), columns=vec_bi.get_feature_names())
df_bi

Out[32]:

	and	and the	document	first	first document	is	is the	is this	one	second	second document	second second	the	the first	the second	the third	third	third one	third third	this	this is	this the
0	0	0	1	1	1	1	1	0	0	0	0	0	1	1	0	0	0	0	0	1	1	0
1	0	0	1	0	0	1	1	0	0	2	1	1	1	0	1	0	0	0	0	1	1	0
2	1	1	0	0	0	0	0	0	1	0	0	0	1	0	0	1	3	1	2	0	0	0
3	0	0	1	1	1	1	0	1	0	0	0	0	1	1	0	0	0	0	0	1	0	1

In a large text corpus, some words will be very present (e.g. “the”, “a”, “is” in English) hence carrying very little meaningful information. For this reason it is very common to use the tf–idf (term-frequency / inverse document-frequency) transform: each row is normalized to have unit euclidean norm. The weights of each feature computed by the fit method call are stored in a model attribute:

In [33]:

corpus = [ 'aa aa cc dd', 'aa cc',
           'aa bb cc ff', "aa aa"]
tfidf = feature_extraction.text.TfidfVectorizer()
X = tfidf.fit_transform(corpus)
print(X.toarray())

[[ 0.66052121  0.          0.40395613  0.63287533  0.        ]
 [ 0.63295194  0.          0.77419109  0.          0.        ]
 [ 0.31878155  0.61087812  0.38991559  0.          0.61087812]
 [ 1.          0.          0.          0.          0.        ]]

Check scikit-learn documentation for additional information on feature extraction.

Visit www.add-for.com for more tutorials and updates.

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

	and	and the	document	first	first document	is	is the	is this	one	second	second document	second second	the	the first	the second	the third	third	third one	third third	this	this is	this the
0	0	0	1	1	1	1	1	0	0	0	0	0	1	1	0	0	0	0	0	1	1	0
1	0	0	1	0	0	1	1	0	0	2	1	1	1	0	1	0	0	0	0	1	1	0
2	1	1	0	0	0	0	0	0	1	0	0	0	1	0	0	1	3	1	2	0	0	0
3	0	0	1	1	1	1	0	1	0	0	0	0	1	1	0	0	0	0	0	1	0	1

	and	and the	document	first	first document	is	is the	is this	one	second	second document	second second	the	the first	the second	the third	third	third one	third third	this	this is	this the
0	0	0	1	1	1	1	1	0	0	0	0	0	1	1	0	0	0	0	0	1	1	0
1	0	0	1	0	0	1	1	0	0	2	1	1	1	0	1	0	0	0	0	1	1	0
2	1	1	0	0	0	0	0	0	1	0	0	0	1	0	0	1	3	1	2	0	0	0
3	0	0	1	1	1	1	0	1	0	0	0	0	1	1	0	0	0	0	0	1	0	1

	and	and the	document	first	first document	is	is the	is this	one	second	second document	second second	the	the first	the second	the third	third	third one	third third	this	this is	this the
0	0	0	1	1	1	1	1	0	0	0	0	0	1	1	0	0	0	0	0	1	1	0
1	0	0	1	0	0	1	1	0	0	2	1	1	1	0	1	0	0	0	0	1	1	0
2	1	1	0	0	0	0	0	0	1	0	0	0	1	0	0	1	3	1	2	0	0	0
3	0	0	1	1	1	1	0	1	0	0	0	0	1	1	0	0	0	0	0	1	0	1