# imports for the tutorial
import numpy as np
import pandas as pd
import re # regex
from sklearn.tree import DecisionTreeClassifier
import matplotlib.pyplot as plt
%matplotlib notebook
20 Questions - one player is chosen to be the answerer. That person chooses a subject (object) but does not reveal this to the others. All other players are questioners. They each take turns asking a question which can be answered with a simple "Yes" or "No."
In this tutorial we deal with classification tasks, where the input can be continuous features, but what if we want to complete a regression task, that is, output a value and not a class?
Decision Trees are also capable of performing these tasks. We will not get into it, but the idea is the same.
Instead of predicting a class, the tree predicts a value which is determined by the mean of the values in that branch.
The CART algorithm is almost the same except for instead of minimizing the impurity, the MSE is minimized.
You can read more about it on Scikit-Learn's documentation, under DecisionTreeRegressor
.
tree_clf.predict_proba([list of samples])
criterion="entropy"
, the Entropy will be used to build the treeAs of September 2012, 800 extrasolar planets have been identified in our galaxy. Supersecret surveying spaceships sent to all these planets have established whether they are habitable for humans or not, but sending a spaceship to each planet is expensive. In this problem, you will come up with decision trees to predict if a planet is habitable based only on features observable using telescopes.
In the following table you are given the data from all 800 planets surveyed so far. The features observed by telescope are Size (“Big” or “Small”), and Orbit (“Near” or “Far”). Each row indicates the values of the features and habitability, and how many times that set of values was observed. So, for example, there were 20 “Big” planets “Near” their star that were habitable.
190+, 160- / 184+, 266- |
Derive and draw the decision tree on this data (use the maximum information gain (entropy) criterion for splits, don’t do any pruning).
Let's first calculate the entropy of starting state: $$ H(Habitable) = -\frac{20 + 170 + 139+45}{800} \log_2 \frac{20 + 170 + 139+45}{800} -\frac{130 + 30 + 11+255}{800} \log_2 \frac{130 + 30 + 11+255}{800}$$
$$ = -\frac{374}{800} \log_2 \frac{374}{800} -\frac{426}{800} \log_2 \frac{426}{800} = 0.4841 + 0.5183 \approx 1 $$
Let's calculate the Information Gain from each split:
$$ H(Habitable|Size) = \frac{190+160}{800}H(Size=Big) + \frac{184+266}{800}H(Size=Small) $$
$$ = \frac{350}{800} \cdot (-\frac{190}{350}\log_2 (\frac{190}{350}) -\frac{160}{350}\log_2 (\frac{160}{350})) + \frac{450}{800}\cdot(-\frac{184}{450}\log_2 (\frac{184}{450}) -\frac{266}{450}\log_2 (\frac{266}{450}))$$
$$ = 0.984 \rightarrow IG(Habitable|Size) = 1 - 0.984 = 0.016 $$
$$ = \frac{300}{800} \cdot (-\frac{159}{300}\log_2 (\frac{159}{300}) -\frac{141}{300}\log_2 (\frac{141}{300})) + \frac{500}{800}\cdot(-\frac{215}{500}\log_2 (\frac{215}{500}) -\frac{285}{500}\log_2 (\frac{282}{500})) $$
$$ = 0.9901 \rightarrow IG(Habitable|Orbit) = 1 - 0.9901 = 0.009 $$
So the Information Gain from the size is larger, thus, the first split will be based on the Size feature.
This section's material goes beyond the scope of this course, but it is important to understand why we can't really build an optimal tree and what is the computational cost of training and making predicitons using the built tree.
presort=True
), but this slows down training considerably for larger training sets.min_samples_leaf
min_samples_split
max_depth
(the default is None
which means unlimited)max_leaf_nodes
min_weight_fraction_leaf
and max_features
(more in the Scikit-Learn docs)The sinking of the RMS Titanic is one of the most infamous shipwrecks in history. On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. This sensational tragedy shocked the international community and led to better safety regulations for ships.
One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class.
We will build a decision tree to decide what sorts of people were likely to survive.
To read more about the dataset and the fields - Click Here
# let's load the titanic dataset and speratre it into train and test set
dataset = pd.read_csv('./datasets/titanic_dataset.csv')
# print the number of rows in the data set
number_of_rows = len(dataset)
num_train = int(0.8 * number_of_rows)
# reminder, the data looks like this
dataset.sample(10)
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
201 | 202 | 0 | 3 | Sage, Mr. Frederick | male | NaN | 8 | 2 | CA. 2343 | 69.5500 | NaN | S |
333 | 334 | 0 | 3 | Vander Planke, Mr. Leo Edmondus | male | 16.00 | 2 | 0 | 345764 | 18.0000 | NaN | S |
639 | 640 | 0 | 3 | Thorneycroft, Mr. Percival | male | NaN | 1 | 0 | 376564 | 16.1000 | NaN | S |
803 | 804 | 1 | 3 | Thomas, Master. Assad Alexander | male | 0.42 | 0 | 1 | 2625 | 8.5167 | NaN | C |
871 | 872 | 1 | 1 | Beckwith, Mrs. Richard Leonard (Sallie Monypeny) | female | 47.00 | 1 | 1 | 11751 | 52.5542 | D35 | S |
571 | 572 | 1 | 1 | Appleton, Mrs. Edward Dale (Charlotte Lamson) | female | 53.00 | 2 | 0 | 11769 | 51.4792 | C101 | S |
99 | 100 | 0 | 2 | Kantor, Mr. Sinai | male | 34.00 | 1 | 0 | 244367 | 26.0000 | NaN | S |
783 | 784 | 0 | 3 | Johnston, Mr. Andrew G | male | NaN | 1 | 2 | W./C. 6607 | 23.4500 | NaN | S |
73 | 74 | 0 | 3 | Chronopoulos, Mr. Apostolos | male | 26.00 | 1 | 0 | 2680 | 14.4542 | NaN | C |
653 | 654 | 1 | 3 | O'Leary, Miss. Hanora "Norah" | female | NaN | 0 | 0 | 330919 | 7.8292 | NaN | Q |
# preprocessing
original_ds = dataset.copy() # Using 'copy()' allows to clone the dataset, creating a different object with the same values
dataset['Has_Cabin'] = dataset["Cabin"].apply(lambda x: 0 if type(x) == float else 1)
# Create new feature FamilySize as a combination of SibSp and Parch
dataset['FamilySize'] = dataset['SibSp'] + dataset['Parch'] + 1
# Create new feature IsAlone from FamilySize
dataset['IsAlone'] = 0
dataset.loc[dataset['FamilySize'] == 1, 'IsAlone'] = 1
# Remove all NULLS in the Embarked column
dataset['Embarked'] = dataset['Embarked'].fillna('S')
# Remove all NULLS in the Fare column
dataset['Fare'] = dataset['Fare'].fillna(dataset['Fare'].median())
# Remove all NULLS in the Age column
age_avg = dataset['Age'].mean()
age_std = dataset['Age'].std()
age_null_count = dataset['Age'].isnull().sum()
age_null_random_list = np.random.randint(age_avg - age_std, age_avg + age_std, size=age_null_count)
# Next line has been improved to avoid warning
dataset.loc[np.isnan(dataset['Age']), 'Age'] = age_null_random_list
dataset['Age'] = dataset['Age'].astype(int)
# Define function to extract titles from passenger names
def get_title(name):
title_search = re.search(' ([A-Za-z]+)\.', name)
# If the title exists, extract and return it.
if title_search:
return title_search.group(1)
return ""
dataset['Title'] = dataset['Name'].apply(get_title)
# Group all non-common titles into one single grouping "Rare"
dataset['Title'] = dataset['Title'].replace(['Lady', 'Countess','Capt', 'Col','Don', 'Dr', 'Major',
'Rev', 'Sir', 'Jonkheer', 'Dona'], 'Rare')
dataset['Title'] = dataset['Title'].replace('Mlle', 'Miss')
dataset['Title'] = dataset['Title'].replace('Ms', 'Miss')
dataset['Title'] = dataset['Title'].replace('Mme', 'Mrs')
# mapping
# Mapping Sex
dataset['Sex'] = dataset['Sex'].map( {'female': 0, 'male': 1} ).astype(int)
# Mapping titles
title_mapping = {"Mr": 1, "Master": 2, "Mrs": 3, "Miss": 4, "Rare": 5}
dataset['Title'] = dataset['Title'].map(title_mapping)
dataset['Title'] = dataset['Title'].fillna(0)
# Mapping Embarked
dataset['Embarked'] = dataset['Embarked'].map( {'S': 0, 'C': 1, 'Q': 2} ).astype(int)
# Mapping Fare
dataset.loc[ dataset['Fare'] <= 7.91, 'Fare'] = 0
dataset.loc[(dataset['Fare'] > 7.91) & (dataset['Fare'] <= 14.454), 'Fare'] = 1
dataset.loc[(dataset['Fare'] > 14.454) & (dataset['Fare'] <= 31), 'Fare'] = 2
dataset.loc[ dataset['Fare'] > 31, 'Fare'] = 3
dataset['Fare'] = dataset['Fare'].astype(int)
# Mapping Age
dataset.loc[ dataset['Age'] <= 16, 'Age'] = 0
dataset.loc[(dataset['Age'] > 16) & (dataset['Age'] <= 32), 'Age'] = 1
dataset.loc[(dataset['Age'] > 32) & (dataset['Age'] <= 48), 'Age'] = 2
dataset.loc[(dataset['Age'] > 48) & (dataset['Age'] <= 64), 'Age'] = 3
dataset.loc[ dataset['Age'] > 64, 'Age']
# Feature selection: remove variables no longer containing relevant information
drop_elements = ['PassengerId', 'Name', 'Ticket', 'Cabin', 'SibSp']
dataset = dataset.drop(drop_elements, axis=1)
# show a sample of the preprocessed dataset
dataset.sample(10)
Survived | Pclass | Sex | Age | Parch | Fare | Embarked | Has_Cabin | FamilySize | IsAlone | Title | |
---|---|---|---|---|---|---|---|---|---|---|---|
359 | 1 | 3 | 0 | 1 | 0 | 0 | 2 | 0 | 1 | 1 | 4 |
505 | 0 | 1 | 1 | 1 | 0 | 3 | 1 | 1 | 2 | 0 | 1 |
566 | 0 | 3 | 1 | 1 | 0 | 0 | 0 | 0 | 1 | 1 | 1 |
878 | 0 | 3 | 1 | 1 | 0 | 0 | 0 | 0 | 1 | 1 | 1 |
843 | 0 | 3 | 1 | 2 | 0 | 0 | 1 | 0 | 1 | 1 | 1 |
884 | 0 | 3 | 1 | 1 | 0 | 0 | 0 | 0 | 1 | 1 | 1 |
685 | 0 | 2 | 1 | 1 | 2 | 3 | 1 | 0 | 4 | 0 | 1 |
851 | 0 | 3 | 1 | 74 | 0 | 0 | 0 | 0 | 1 | 1 | 1 |
64 | 0 | 1 | 1 | 2 | 0 | 2 | 1 | 0 | 1 | 1 | 1 |
373 | 0 | 1 | 1 | 1 | 0 | 3 | 1 | 0 | 1 | 1 | 1 |
# Take the relevant training data and labels
x = dataset.drop('Survived', axis=1).values
y = dataset['Survived'].values # 1 for Survived, 0 for Dead
# shuffle
rand_gen = np.random.RandomState(0)
shuffled_indices = rand_gen.permutation(np.arange(len(x)))
x_train = x[shuffled_indices[:num_train]]
y_train = y[shuffled_indices[:num_train]]
x_test = x[shuffled_indices[num_train:]]
y_test = y[shuffled_indices[num_train:]]
print("total training samples: {}, total test samples: {}".format(num_train, number_of_rows - num_train))
total training samples: 712, total test samples: 179
# let's build a tree with no limitations
tree_clf = DecisionTreeClassifier(criterion='gini')
tree_clf.fit(x_train, y_train)
DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None, max_features=None, max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, presort=False, random_state=None, splitter='best')
# let's check the accuracy on the test set
y_pred = tree_clf.predict(x_test)
accuracy = np.sum(y_pred == y_test) / len(y_test)
print("accuracy: {:.3f} %".format(accuracy * 100))
accuracy: 75.419 %
# let's add regularization
tree_clf = DecisionTreeClassifier(criterion='gini', max_depth=3)
tree_clf.fit(x_train, y_train)
# let's check the accuracy on the test set
y_pred = tree_clf.predict(x_test)
accuracy = np.sum(y_pred == y_test) / len(y_test)
print("accuracy: {:.3f} %".format(accuracy * 100))
accuracy: 83.799 %
# predicting probabilities
# for the first passenger in the test set, the probability of survival:
prob = tree_clf.predict_proba([x_train[0]])
print("passenger 0 chances of survival: {:.3f} %".format(prob[0][0] * 100))
passenger 0 chances of survival: 89.318 %
# generate an image graph
from sklearn.tree import export_graphviz
export_graphviz(tree_clf, out_file="./titanic_tree.dot", feature_names=dataset.columns.values[1:].tolist(),
class_names=['Dead', 'Survived'], rounded=True, filled=True)
# open the file with your favourite text editor and copy-pase it into http://www.webgraphviz.com/
from sklearn.ensemble import RandomForestClassifier
random_state= 42
rnd_clf = RandomForestClassifier(random_state=random_state, n_estimators=50, max_depth=3)
rnd_clf.fit(x_train, y_train)
# let's check the accuracy on the test set
y_pred = rnd_clf.predict(x_test)
accuracy = np.sum(y_pred == y_test) / len(y_test)
print("accuracy: {:.3f} %".format(accuracy * 100))
accuracy: 84.916 %