Chapter 3 - Modeling and prediction¶

In [2]:

%pylab inline

Populating the interactive namespace from numpy and matplotlib

The Titanic dataset¶

We use the Pandas library to import the Titanic survival dataset.

In [27]:

import pandas
data = pandas.read_csv("data/titanic.csv")
data[:5]

Out[27]:

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	male	22	1	A/5 21171	7.2500	NaN	S
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38	1	PC 17599	71.2833	C85	C
2	3	1	3	Heikkinen, Miss. Laina	female	26	0	STON/O2. 3101282	7.9250	NaN	S
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35	1	113803	53.1000	C123	S
4	5	0	3	Allen, Mr. William Henry	male	35	0	373450	8.0500	NaN	S

In [30]:

# We make a 80/20% train/test split of the data
data_train = data[:int(0.8*len(data))]
data_test = data[int(0.8*len(data)):]

Preparing the data¶

In [46]:

# The categorical-to-numerical function from chapter 2
# Changed to automatically add column names
def cat_to_num(data):
    categories = unique(data)
    features = {}
    for cat in categories:
        binary = (data == cat)
        features["%s=%s" % (data.name, cat)] = binary.astype("int")
    return pandas.DataFrame(features)

In [57]:

def prepare_data(data):
    """Takes a dataframe of raw data and returns ML model features
    """
    
    # Initially, we build a model only on the available numerical values
    features = data.drop(["PassengerId", "Survived", "Fare", "Name", "Sex", "Ticket", "Cabin", "Embarked"], axis=1)
    
    # Setting missing age values to -1
    features["Age"] = data["Age"].fillna(-1)
    
    # Adding the sqrt of the fare feature
    features["sqrt_Fare"] = sqrt(data["Fare"])
    
    # Adding gender categorical value
    features = features.join( cat_to_num(data['Sex']) )
    
    # Adding Embarked categorical value
    features = features.join( cat_to_num(data['Embarked']) )
    
    return features

Building a logistic regression classifier with Scikit-Learn¶

In [58]:

#cat_to_num(data['Sex'])
features = prepare_data(data_train)
features[:5]

Out[58]:

	Pclass	Age	SibSp	sqrt_Fare	Sex=female	Sex=male	Embarked=C	Embarked=S
0	3	22	1	2.692582	0	1	0	1
1	1	38	1	8.442944	1	0	1	0
2	3	26	0	2.815138	1	0	0	1
3	1	35	1	7.286975	1	0	0	1
4	3	35	0	2.837252	0	1	0	1

In [59]:

from sklearn.linear_model import LogisticRegression
model = LogisticRegression()
model.fit(features, data_train["Survived"])

Out[59]:

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr',
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0)

In [60]:

# Make predictions
model.predict(prepare_data(data_test))

Out[60]:

array([0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 0,
       0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0,
       0, 1, 0, 0, 0, 1, 0, 1, 1, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 1,
       1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 1, 0, 1, 0, 0,
       0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0,
       0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1,
       0, 0, 1, 1, 1, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0,
       0, 1, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0])

In [64]:

# The accuracy of the model on the test data
# (this will be introduced in more details in chapter 4)
model.score(prepare_data(data_test), data_test["Survived"])

Out[64]:

0.86033519553072624

Non-linear model with Support Vector Machines¶

In [62]:

from sklearn.svm import SVC
model = SVC()
model.fit(features, data_train["Survived"])

Out[62]:

SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0, degree=3, gamma=0.0,
  kernel='rbf', max_iter=-1, probability=False, random_state=None,
  shrinking=True, tol=0.001, verbose=False)

In [63]:

model.score(prepare_data(data_test), data_test["Survived"])

Out[63]:

0.86033519553072624

Classification with multiple classes: hand-written digits¶

We use the popular non-linear multi-class K-nearest neighbor algorithm to predict hand-written digits from the MNIST dataset.

In [65]:

mnist = pandas.read_csv("data/mnist_small.csv")
mnist_train = mnist[:int(0.8*len(mnist))]
mnist_test = mnist[int(0.8*len(mnist)):]

In [66]:

from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=10)
knn.fit(mnist_train.drop("label", axis=1), mnist_train['label'])

Out[66]:

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_neighbors=10, p=2, weights='uniform')

In [71]:

preds = knn.predict_proba(mnist_test.drop("label", axis=1))
pandas.DataFrame(preds[:5], index=["Digit %d"%(i+1) for i in range(5)])

Out[71]:

	0	3	4	5	6	7	8	9
Digit 1	0.0	0.0	0.0	1	0.0	0.0	0.0	0.0
Digit 2	0.0	1.0	0.0	0	0.0	0.0	0.0	0.0
Digit 3	0.3	0.0	0.0	0	0.6	0.0	0.1	0.0
Digit 4	0.0	0.1	0.0	0	0.0	0.5	0.0	0.4
Digit 5	0.0	0.0	0.7	0	0.0	0.0	0.0	0.3

In [73]:

knn.score(mnist_test.drop("label", axis=1), mnist_test['label'])

Out[73]:

0.81999999999999995

Predicting numerical values with a regression model¶

We use the the Linear Regression algorithm to predict miles-per-gallon of various automobiles.

In [79]:

auto = pandas.read_csv("data/auto-mpg.csv")

# Convert origin to categorical variable
auto = auto.join(cat_to_num(auto['origin']))
auto = auto.drop('origin', axis=1)

# Split in train/test set
auto_train = auto[:int(0.8*len(auto))]
auto_test = auto[int(0.8*len(auto)):]

auto[:5]

Out[79]:

	mpg	cylinders	displacement	horsepower	weight	acceleration	modelyear	origin=1
0	18	8	307	130	3504	12.0	70	1
1	15	8	350	165	3693	11.5	70	1
2	18	8	318	150	3436	11.0	70	1
3	16	8	304	150	3433	12.0	70	1
4	17	8	302	140	3449	10.5	70	1

In [80]:

from sklearn.linear_model import LinearRegression
reg = LinearRegression()
reg.fit(auto_train.drop('mpg', axis=1), auto_train["mpg"])

Out[80]:

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

In [83]:

pred_mpg = reg.predict(auto_test.drop('mpg',axis=1))

In [89]:

plot(auto_test.mpg, pred_mpg, 'o')
x = linspace(10,40,5)
plot(x, x, '-');