import addutils.toc ; addutils.toc.js(ipy_notebook=True)
Data in scikit-learn, with very few exceptions, is assumed to be stored as a
two-dimensional array, of size [n_samples, n_features]
.
Most machine learning algorithms implemented in scikit-learn expect data to be stored in a
two-dimensional array or matrix. The arrays can be
either numpy
arrays, pandas DataFrame
, or in some cases scipy.sparse
matrices.
The size of the array is expected to be [n_samples, n_features]
The number of features must be fixed in advance. However it can be very high dimensional
(e.g. millions of features) with most of them being zeros for a given sample. This is a case
where scipy.sparse
matrices can be useful, in that they are
much more memory-efficient than numpy arrays.
import scipy.io
import numpy as np
import pandas as pd
from sklearn import datasets
from addutils import css_notebook
css_notebook()
Typical scikit-learn dataset are dictionary-like object that holds all the data and metadata.
.data
field in the form of a 2D array [n_samples, n_features]
..target
field in the form of a 1D array.Scikit-learn makes available a host of datasets for testing learning algorithms:
sklearn.datasets.load_*
sklearn.datasets.fetch_*
sklearn.datasets.make_*
Try by yourself:
datasets.load_<TAB>
datasets.fetch_<TAB>
datasets.make_<TAB>
#datasets.make_
Features in the Iris dataset:
Target classes to predict:
d = datasets.load_iris()
Try by yourself one of the following commands where 'd' is the variable containing the dataset:
print d.keys() # Structure of the contained data
print d.DESCR # A complete description of the dataset
print d.data.shape # [n_samples, n_features]
print d.target.shape # [n_samples,]
print d.feature_names
datasets.get_data_home() # This is where the datasets are stored
print(d.keys())
print(d.target_names)
print(d.feature_names)
dict_keys(['target_names', 'DESCR', 'data', 'feature_names', 'target']) ['setosa' 'versicolor' 'virginica'] ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']
The Digits contains 1797 samples made of 64 features: each feature represents the grey-scale value of a 8x8 digit image:
from bokeh.palettes import Greys9
from bokeh.models.ranges import Range1d
import addutils.palette as pal
import addutils.imagegrid as ig
digits = datasets.load_digits()
# plot the digits: each image is 8x8 pixels
images = [ digits.images[i][::-1, :] for i in range(40) ]
txt = [ str(i) for i in range(10) ] * 4
fig = ig.imagegrid_figure(figure_plot_width=760, figure_plot_height=100,
figure_title=None,
images=images, grid_size=(20, 2),
text=txt, text_font_size='9pt', text_color='red',
palette=Greys9[::-1], padding=0.2)
bk.show(fig)
import seaborn as sns
cat_colors = list(map(pal.to_hex, sns.color_palette('Paired', 7)))
data, color_indices = datasets.make_blobs(n_samples=2000, n_features=2, centers=7,
center_box=(-4.0, 6.0), cluster_std=0.5)
fig = bk.figure(title=None)
fig.circle(data[:,0], data[:,1],
line_color='black', line_alpha=0.5, size=8,
fill_color=pal.linear_map(color_indices, cat_colors,
low=0, high=6))
bk.show(fig)
Although it is not required, in many cases it can be easier to manage the data pre-processing with pandas:
import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)
pd.options.display.notebook_repr_html = True
df = pd.DataFrame(d.data, columns=d.feature_names)
df['y'] = d.target
df.head(3)
sepal length (cm) | sepal width (cm) | petal length (cm) | petal width (cm) | y | |
---|---|---|---|---|---|
0 | 5.1 | 3.5 | 1.4 | 0.2 | 0 |
1 | 4.9 | 3.0 | 1.4 | 0.2 | 0 |
2 | 4.7 | 3.2 | 1.3 | 0.2 | 0 |
pd.set_option('precision',3)
df[df.columns[:4]].describe().ix[[1,2,3,7]]
sepal length (cm) | sepal width (cm) | petal length (cm) | petal width (cm) | |
---|---|---|---|---|
mean | 5.84 | 3.05 | 3.76 | 1.20 |
std | 0.83 | 0.43 | 1.76 | 0.76 |
min | 4.30 | 2.00 | 1.00 | 0.10 |
max | 7.90 | 4.40 | 6.90 | 2.50 |
x_feat, y_feat = 2, 3 # Choose the features to plot (0-3)
fig = bk.figure(title=None)
colors = ['#006CD1', '#26D100', '#D10000']
color_series = [ colors[i] for i in df['y'] ]
fig.scatter(df[df.columns[x_feat]], df[df.columns[y_feat]],
line_color='black', fill_color=color_series,
radius=0.1)
fig.xaxis.axis_label = df.columns[x_feat]
fig.yaxis.axis_label = df.columns[y_feat]
bk.show(fig)
Pandas contains some utility functions and plotting functions that can help in previewing the data. In this case we use a scatter_matrix
pandas plot to plot the four features one versus the other:
TODO - Check if it's possible to add class colors to the following scatter matrix with the property group_col
.
%matplotlib inline
import matplotlib.pyplot as plt
pd.tools.plotting.scatter_matrix(df[df.columns[:4]], figsize=(10, 10),
c=df['y'], diagonal='hist', marker='o')
plt.show()
v4 (Level 1.0), v6 and v7 to 7.2 matfiles are supported. To read matlab 7.3 format mat files an HDF5 python library is required. Please check the scipy documentation for more information.
The data can be generated with the following MATLAB code:
% Generate Regression Test Data
X = [1 2 3
4 5 6
7 8 9
0 1 2] + 0.1;
y = sum(X,2);
feat_names = strvcat('Feature One', 'Feature Two', 'Feature Three');
save ('matlab_test_data_01', 'X','y', 'feat_names')
mat_data = scipy.io.loadmat('example_data/matlab_test_data_01.mat')
Variables names included in the .mat
file are keys of the mat_data
dictionary. Moreover the key '__header__'
contains the mat-file information.
Here we load the two variables in Pandas varialbles:
mat_data.keys()
dict_keys(['__globals__', '__header__', 'X', 'y', '__version__', 'feat_names'])
In the following code the .strip()
method is used to remove the trailing white spaces used by MATLAB to make all the variable names of the same lenght:
X = pd.DataFrame(mat_data['X'], columns=[s.strip() for s in list(mat_data['feat_names'])])
y = pd.DataFrame(mat_data['y'], columns=['measured'])
print(X, '\n\n', y)
Feature One Feature Two Feature Three 0 1.1 2.1 3.1 1 4.1 5.1 6.1 2 7.1 8.1 9.1 3 0.1 1.1 2.1 measured 0 6.3 1 15.3 2 24.3 3 3.3
Standardization of datasets is a common requirement for many machine learning estimators: they might behave badly if the individual feature do not more or less look like standard normally distributed data: Gaussian with zero mean and unit variance. In practice we often ignore the shape of the distribution and just transform the data to center and scale by dividing by the standard deviation.
If the input variables are combined via a distance function (such as Euclidean distance), standardizing inputs can be crucial. If one input has a range of 0 to 1, while another input has a range of 0 to 1,000,000, then the contribution of the first input to the distance will be swamped by the second input.
It is sometimes not enough to center and scale the features independently, since a downstream model can further make some assumption on the linear independence of the features. To address this issue you can use sklearn.decomposition.PCA
or sklearn.decomposition.RandomizedPCA
with whiten=True
to further remove the linear correlation across features.
scale
and StandardScaler
work out-of-the-box with 1d arrays.
preprocessing.scale
scales each column of the features matric to mean=0 and std=1. This is also called "STANDARDIZATION".
from sklearn import preprocessing
X = np.array([[ 10., 1., 0.],
[ 20., 0., 2.]])
X_sc_1 = preprocessing.scale(X)
print(X)
print('\nScaled Values: ')
print(X_sc_1)
[[ 10. 1. 0.] [ 20. 0. 2.]] Scaled Values: [[-1. 1. -1.] [ 1. -1. 1.]]
preprocessing.StandardScaler
keeps the values of .mean_
and .std_
to allow lately transform
and inverse_transform
mean and std scaling can be controlled independently:
scaler = preprocessing.StandardScaler(copy=True, with_mean=True, with_std=True)
scaler.fit(X)
X_sc_2 = scaler.transform(X)
print('\nScaled Values: ')
print(X_sc_2)
print('\nStandard Scaler Mean: ', scaler.mean_)
print('Standard Scaler Std: ', scaler.std_)
print('\nInverse Transform: ')
print(scaler.inverse_transform(X_sc_2))
Scaled Values: [[-1. 1. -1.] [ 1. -1. 1.]] Standard Scaler Mean: [ 15. 0.5 1. ] Standard Scaler Std: [ 5. 0.5 1. ] Inverse Transform: [[ 10. 1. 0.] [ 20. 0. 2.]]
X = np.array([[ 10., 1., 0.],
[ 20., 0., 2.]])
df = pd.DataFrame(X)
df
0 | 1 | 2 | |
---|---|---|---|
0 | 10 | 1 | 0 |
1 | 20 | 0 | 2 |
df.describe().ix[[1,2]]
0 | 1 | 2 | |
---|---|---|---|
mean | 15.00 | 0.50 | 1.00 |
std | 7.07 | 0.71 | 1.41 |
scaler = preprocessing.StandardScaler(copy=False, with_mean=True, with_std=True).fit(df)
print('\nStandard Scaler Mean: ', scaler.mean_)
print('Standard Scaler Std: ', scaler.std_)
df_sc_2 = pd.DataFrame(scaler.transform(df))
df_sc_2
Standard Scaler Mean: [ 15. 0.5 1. ] Standard Scaler Std: [ 5. 0.5 1. ]
0 | 1 | 2 | |
---|---|---|---|
0 | -1 | 1 | -1 |
1 | 1 | -1 | 1 |
df_inv = pd.DataFrame(scaler.inverse_transform(df_sc_2))
df_inv
0 | 1 | 2 | |
---|---|---|---|
0 | 10 | 1 | 0 |
1 | 20 | 0 | 2 |
preprocessing.normalize
sets thenorm of the vector =1.
By default (axis=1)
, so if it's necessary to normalize the columns and not the rows, axis
must be set to 0.
X = np.array([[ 10., 1., 0.],
[ 20., 0., 2.],
[ 20., 0., 0.]])
X_nrm = preprocessing.normalize(X, norm='l2', axis=0)
print('\nNormalized Values: ')
print(X_nrm)
Normalized Values: [[ 0.33333333 1. 0. ] [ 0.66666667 0. 1. ] [ 0.66666667 0. 0. ]]
Features extraction (module sklearn.feature_extraction
) consists in transforming arbitrary data, such as text or images, into numerical features usable for machine learning.
Are obtained by pre-processing the data to generate features that are somehow more informative. Derived features may be linear or nonlinear combinations of features (such as in Polynomial regression), or may be some more sophisticated transform of the features. The latter is often used in image processing.
For example, scikit-image provides a variety of feature
extractors designed for image data: see the skimage.feature
submodule.
feature_extraction.DictVectorizer
implements what is called one-of-K or “one-hot” coding for categorical (aka nominal, discrete) features.
In the following code the method vec.fit_transform(temp)
returns a sparse matrix that is tranformed to dense by .toarray()
.
from sklearn import feature_extraction
t = [{'city': 'Dubai', 'temperature': 33.},
{'city': 'London', 'temperature': 12.},
{'city': 'San Fransisco', 'temperature': 18.},]
vec = feature_extraction.DictVectorizer()
t_vec = vec.fit_transform(t)
df = pd.DataFrame(t_vec.toarray(), columns=vec.get_feature_names())
df
city=Dubai | city=London | city=San Fransisco | temperature | |
---|---|---|---|---|
0 | 1 | 0 | 0 | 33 |
1 | 0 | 1 | 0 | 12 |
2 | 0 | 0 | 1 | 18 |
We call vectorization the general process of turning a collection of text documents into numerical feature vectors. This specific strategy (tokenization, counting and normalization) is called the Bag of Words or “Bag of n-grams” representation. Documents are described by word occurrences while completely ignoring the relative position information of the words in the document.
CountVectorizer
implements both tokenization and occurrence counting in a single class:
feature_extraction.text.CountVectorizer?
corpus = ['This is the x-first document.', 'This is the second second document.',
'And the third third third one.', 'Is this the first document?']
vec = feature_extraction.text.CountVectorizer()
X = vec.fit_transform(corpus)
df = pd.DataFrame(X.toarray(), columns=vec.get_feature_names())
df
and | document | first | is | one | second | the | third | this | |
---|---|---|---|---|---|---|---|---|---|
0 | 0 | 1 | 1 | 1 | 0 | 0 | 1 | 0 | 1 |
1 | 0 | 1 | 0 | 1 | 0 | 2 | 1 | 0 | 1 |
2 | 1 | 0 | 0 | 0 | 1 | 0 | 1 | 3 | 0 |
3 | 0 | 1 | 1 | 1 | 0 | 0 | 1 | 0 | 1 |
The default configuration tokenizes the string by extracting words of at least 2 letters. The specific function that does this step can be requested explicitly:
analyzer = vec.build_analyzer()
analyzer("This is the x-first document.")
['this', 'is', 'the', 'first', 'document']
Words that were not seen in the training corpus will be completely ignored in future calls to the transform method:
print(vec.transform(['Something completely new.']).toarray())
[[0 0 0 0 0 0 0 0 0]]
Note that in the previous corpus we lose the information that the last document is an interrogative form because each word is encoded individually. To preserve the local ordering information we can extract 2-grams of words in addition to the 1-grams (the word themselves):
vec_bi = feature_extraction.text.CountVectorizer(min_df=1, ngram_range=(1, 2))
X_bi = vec_bi.fit_transform(corpus)
pd.set_option('display.max_columns', None)
df_bi = pd.DataFrame(X_bi.toarray(), columns=vec_bi.get_feature_names())
df_bi
and | and the | document | first | first document | is | is the | is this | one | second | second document | second second | the | the first | the second | the third | third | third one | third third | this | this is | this the | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 0 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 |
1 | 0 | 0 | 1 | 0 | 0 | 1 | 1 | 0 | 0 | 2 | 1 | 1 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 1 | 0 |
2 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 3 | 1 | 2 | 0 | 0 | 0 |
3 | 0 | 0 | 1 | 1 | 1 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 |
In a large text corpus, some words will be very present (e.g. “the”, “a”, “is” in English) hence carrying very little meaningful information. For this reason it is very common to use the tf–idf
(term-frequency / inverse document-frequency) transform: each row is normalized to have unit euclidean norm. The weights of each feature computed by the fit method call are stored in a model attribute:
corpus = [ 'aa aa cc dd', 'aa cc',
'aa bb cc ff', "aa aa"]
tfidf = feature_extraction.text.TfidfVectorizer()
X = tfidf.fit_transform(corpus)
print(X.toarray())
[[ 0.66052121 0. 0.40395613 0.63287533 0. ] [ 0.63295194 0. 0.77419109 0. 0. ] [ 0.31878155 0.61087812 0.38991559 0. 0.61087812] [ 1. 0. 0. 0. 0. ]]
Check scikit-learn documentation for additional information on feature extraction.
Visit www.add-for.com for more tutorials and updates.
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.