In this lab, we will work with classification and regression.
To complete this lab:
ok.grade()
functions) where asked. However, you should keep notes for your answers to the more reflective questions in a separate MS Word document (you can use this template for Lab 2).If we have a data set with two variables that depend on each other, then with the help of linear regression we can make a predictive model. We try to find a causal relationship between two variables, one of which depends on a number of independent variables. We will use a dataset that describes heights and weights of men and women.
Let's first set up the notebook by importing the components we will use in our code:
import pandas as pd
from sklearn import linear_model
import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from graphviz import Source
from sklearn import tree
import warnings
warnings.filterwarnings('ignore')
from client.api.notebook import Notebook
ok = Notebook('lab2.ok')
To read the dataset, run the following code cell:
body_stats = pd.read_csv("weight-height.csv")
body_stats.head()
This dataset is actually in Imperial units with height measured in inches and weight measured in pounds. Let's first changes this to metric values! Run the following cell that will do just that:
body_stats = pd.read_csv("weight-height.csv")
body_stats.Height = body_stats.Height.apply(lambda x: x * 2.54)
body_stats.Weight = body_stats.Weight.apply(lambda x: x / 2.2046)
body_stats
Q1. How many rows and columns are in the dataset?
body_stats_number_of_rows = ...
body_stats_number_of_columns = ...
_ = ok.grade('q21')
Now we can see the dataset shape and what the data points look like, let's get a feel of how data looks by visualizing it. Run the next code cell to create a scatterplot of the data:
_ = body_stats.plot.scatter(x='Weight', y='Height')
Q2. Can you you distinguish between the male and female groups in the dataset? Explain your observation.
One way that we can distinguish between distinct groups that we know about is to colour them differently in our plots. Run the next cell to create the same scatterplot but with male and female data distinguished by colour.
cmap = {'Male': 'blue', 'Female': 'pink'}
_ = body_stats.plot.scatter(x='Weight', y='Height', c=[cmap.get(c) for c in body_stats.Gender])
Now let's look at each category of data in isolation:
male_stats = body_stats[body_stats.Gender == 'Male']
_ = male_stats.plot.scatter(x='Weight', y='Height', c='blue')
female_stats = body_stats[body_stats.Gender == 'Female']
_ = female_stats.plot.scatter(x='Weight', y='Height', c='pink')
The reason we use linear regression is to find the right line that lies as close to all the points as possible to allow us to enter, for example, a height and then be able to infer the height (make a prediction). As you can see in the scatterplots, the data is a bit more spread out and it actually quite difficult to manually fit the right line $y = ax + b$ to represent linear relationship.
Luckily, we can use the LinerarRegression
model from the sklearn
Python package to build a linear regression model. First we fit the data. This trains the model based on the two variables, in this case male_stats.Weight
and male_stats.Height
:
lm_male = linear_model.LinearRegression()
lm_male.fit([[x] for x in male_stats.Weight], male_stats.Height)
m = lm_male.coef_[0]
b = lm_male.intercept_
print("slope=", m, "intercept=", b)
To see how the linear model we trained on the male weight and height data, we can plot the linear relationship over our scattergraph we created earlier using the slope and intercept we extracted from our linear model lm_male
:
plt.scatter(x=male_stats.Weight, y=male_stats.Height, c='blue')
predicted_values_m = [lm_male.coef_ * i + lm_male.intercept_ for i in male_stats.Weight]
plt.plot(male_stats.Weight, predicted_values_m, 'black')
plt.xlabel("Weight")
plt.ylabel("Height")
Run the next two cells to do the same for the female weight and height data:
lm_female = linear_model.LinearRegression()
lm_female.fit([[x] for x in female_stats.Weight], female_stats.Height)
m = lm_female.coef_[0]
b = lm_female.intercept_
print("slope=", m, "intercept=", b)
plt.scatter(female_stats.Weight, female_stats.Height, c='pink')
predicted_values_f = [lm_female.coef_ * i + lm_female.intercept_ for i in female_stats.Weight]
plt.plot(female_stats.Weight, predicted_values_f, 'purple')
plt.ylabel("Height")
plt.xlabel("Weight")
Finally, let's visualize both together:
plt.scatter(body_stats.Weight, body_stats.Height, c=[cmap.get(c) for c in body_stats.Gender])
plt.plot(male_stats.Weight, predicted_values_m, 'black')
plt.plot(female_stats.Weight, predicted_values_f, 'purple')
plt.xlabel("Weight")
plt.ylabel("Height")
Apart from plotting the fitted linear relationship over the scatterplot, our linear regression model also provides us with a function predict()
that can take an input variable and output a prediction. For example, assuming a fitted linear model named my_linear_model
we might use the code:
my_linear_model.predict([[64.7]])
to find out how tall a person of the weight 64.7kg, in cm.
Q3. Based on your linear models for male and female data, what are the predicted heights for a male weighing 80.6kg (the average weight of a man in Sweden) and a female weighing 64.7 kg (the average weight for a woman in Sweden)? Provide your answers to a minimum of 2 decimal places.
Write your own code in each of the two following cells to calculate this using the linear models.
... # use lm_male
... # use lm_female
predicted_male_height = ...
predicted_female_height = ...
_ = ok.grade('q23')
Next, we will explore classification by looking at a dataset relating to the demographics of the survivors of the Titantic disaster. If you are not familiar with the Titanic, you could watch the film but it is more than 3 hours long, so do not watch it during this lab. In this lab we will use a decision tree as a predictive model to predict if a person with particular features might have lived or died on the Titanic.
First, we load the dataset:
titantic_passengers = pd.read_csv("titanic.csv")
titantic_passengers
survival
- Whether the person survived or not (0 = No, 1 = Yes)pclass
- Passenger classsex
- Gender of the person (male or female)age
- Age in yearssibsp
- # of siblings/spouses aboard the Titanicparch
- # of parents/children aboard the Titanicticket
- Ticket ID numberfare
- Passenger farecabin
- Cabin numberembarked
- Port of embarkation (C = Cherbourg, Q = Queenstown, S = Southampton)Q4. According to this data set, how many passengers were on board the Titanic?
num_passengers = ...
_ = ok.grade('q24')
Before we build our classifier, let us explore the invidual features that might affect the survival outcome. We can extract the number of survivors according to this data by looking at the survived
columns. Note that for each record a survivor is represented with a 1 and a non-survivor represented with a 0 (zero). This means we can simply retrive the sum each category from the survived
column as follows using the value_counts()
function:
titantic_passengers.survived.value_counts()
We can additionally plot this quite easliy:
_ = titantic_passengers.survived.value_counts().plot.bar()
Q5. According to this dataset, what percentage of passengers perished when the Titanic sank? Give your answer to the nearest whole percentage.
percentage_perished = ...
_ = ok.grade('q25')
As we can see from this figure, we can hypothesize that if we had gone on the Titanic we may have been more than likely than not to have died, since more than half of the total number of passengers perished. This is not that exciting analysis and does not reveal who might have had a better chance of survival.
In this section we will do a little more specific predictions but with only one column that the outcome may be due to.
Let's look at the distribution based on gender:
titantic_passengers.sex.value_counts()
_ = titantic_passengers.sex.value_counts().plot.bar()
We can see that in general there were almost twice as many men as women. Let's check the distribution of survival rates based on gender:
survived_by_gender = titantic_passengers.groupby('sex').survived.mean()
_ = survived_by_gender.plot.bar()
survived_by_gender
Q6. What do you observe in the data and why do you think the distribution is as seen? Is there any possibility to reason why this is the case from only the data?
Let's look at the distribution based on age:
survived_by_age = titantic_passengers.groupby('age').survived.sum()
_ = survived_by_age.plot.bar()
survived_by_age.shape
Q7. As you can see, there are quite a few values that make the plot impossible to read. How many unique values occur in our age distribution (see the shape output above)?
num_unique_ages_in_distribution = ...
_ = ok.grade('q27')
To be able to do a better analysis, we can apply some age categorisations. For example, we can add a column in the dataset that categorises a passenger as a child or not. The next code cell adds this column to the original dataset, where we define a child as being a person under 18 years old:
titantic_passengers['age_range'] = pd.cut(titantic_passengers.age, [0, 15, 80], labels=['child', 'adult'])
titantic_passengers.head()
Now we can plot the proportion of adults and children who survived:
survived_by_age = titantic_passengers.groupby('age_range').survived.mean()
_ = survived_by_age.plot.bar()
survived_by_age
Q8. What does the plot showing the proportion of adults and children who survived indicate?
We could continue exploring all of the different features in the dataset like this, but instead we will do something a bit more interesting by building a classifier using decision trees.
So far we have tried to identify what might have been crucial for survival. However, there are other ways to identify these features. What we are going to do is build a decision tree model with which we can the use to make predictions.
First of all, we need to do some data cleaning in order to build our model correctly. By running the info()
function we can see what columns have missing (null) cells:
titantic_passengers.info()
From the output, we can see that there is actually quite a lot missing. However we will ignore some of the incomplete columns in our analysis, and fix the ones we need to keep.
Q9. Which columns are identified as having missing (null) values? Provide your answer by adding the incomplete columns to the list of strings in the code cell below.
incomplete_columns = ["age", "fare", ]
_ = ok.grade('q29')
In this case, we need to fix out age
and fare
columns and we can do that by filling the null values with values fitting the mean values of the column. This is called mean imputation.
"Mean imputation is the replacement of a missing observation with the mean of the non-missing observations for that variable."
Imputing the mean preserves the mean in the original data. If the missing data is missing completely at random, this ensure that the estimate of the mean remains unbiased. Also, by imputing the mean, you are able to preserve the full sample size, otherwise one may have to drop those rows with missing data. To understand this, let's compare in the original data the age column before and after imputation:
pd.DataFrame({
"original": titantic_passengers.age.describe(),
"imputed": titantic_passengers.age.fillna(titantic_passengers.age.mean()).describe()
})
You should observe that the total count of the data is increased in the imputed column, because we fill those NaN
values with the mean of the column. At the same time, we can see that most of the summary features of the column remain unchanged (mean, min, and max are all preseved).
Mean imputation is not without its issues, but for the purposes of this exercise we will not worry about that.
Run the following cell to impute the age
and fare
columns and modify the titantic_passengers
DataFrame:
titantic_passengers.age = titantic_passengers.age.fillna(titantic_passengers.age.median())
titantic_passengers.fare = titantic_passengers.fare.fillna(titantic_passengers.fare.median())
titantic_passengers.info()
Our decision tree model from sklearn
needs all of the input data to be encoded numerically. This means that the categorical information that is not already encoded as numbers need to also be converted. The following line of code converts the sex
column into 0 for female and 1 for male, adding a new column sex_male
to support this encoding.
titantic_passengers = pd.get_dummies(titantic_passengers, columns=['sex'], drop_first=True)
titantic_passengers
You should notice that we now have a new column called sex_male
with a numerical representation in it. Now we can drop the parts of the table we are no longer interested in:
survived_data = titantic_passengers.survived # save this for training later
titantic_passengers = titantic_passengers[['sex_male', 'fare', 'age', 'sibsp']]
titantic_passengers.info()
In order to train our model we need to split the data into a training dataset and a test dataset. This means that we can train our classifier on the training dataset and the validate it after training using the test dataset that was kept separate during the training process.
We use components from the sklearn
package to help us. We split two datasets: (1) The full set of features and (2) the target labels (survived or not). We split the data 75% / 25% since we wish to use as much data as possible to train the classifier but preserve enough data to validate the classifier after training.
X_train, X_test, y_train, y_test = train_test_split(titantic_passengers, survived_data, test_size=0.25)
print("Our training data has {} rows".format(len(X_train)))
print("Our test data has {} rows".format(len(X_test)))
Next, we train our DecisionTreeClassifier
on the training data by asking sklearn
to fit the features found in X_train
(the 75% sample from the titanic_passengers
data) against y_train
(the 25% sample).
classifier = DecisionTreeClassifier(max_depth=3)
classifier.fit(X_train.values, y_train.values)
Now our decision tree has been created. We can test this by taking some test data that we split before into X_test
. Note that the records in X_test
were not used during the training process, which is why we can use it to validate our trained model.
The next cell takes the first 10 records (for convenience) as a sample table, then we run the classifier.predict()
function on the sample and add it to the sample table for us to view the results:
sample = X_test.head(10)
sample['survived'] = classifier.predict(sample)
sample
Q2.10. Inspect the table created by our prediction on the sample
records. How many men survived and how many women survived in our prediction?
number_of_predicted_male_survivors = ...
number_of_predicted_female_survivors = ...
_ = ok.grade('q210')
Q10. How well do you think the model makes it's predicted classifications?
Even though we can see the effects by running some data through the classifer to view the predicted outcomes, we cannot see how the decision tree model makes its decisions. We can visualize the model we trained.
Run the following cell to visualize our classifier:
tree_plot = Source(tree.export_graphviz(classifier, out_file=None,
feature_names=X_train.columns, class_names=['Dead', 'Alive'],
filled=True, rounded=True, special_characters=True))
tree_plot
Q11. Take a look at the decision tree that was created (ignore the variables that we have not discussed such as gini
, samples
, value
). What do you think about how the decision tree has logically rationalised the selection of survivors?
When you are finished, let a teacher know you are finished and to check if you have any questions about the lab.
If you wish to save your work in this notebook, choose Save and Checkpoint from the File menu, then choose Download as Notebook from the File menu and save it to your computer or USB stick.