In this notebook, we go over these main concepts:
We'll build on the machine learning models of the previous notebooks to exemplify these concepts.
import pandas as pd
import numpy as np
import sqlite3
import sklearn
from sklearn.linear_model import LogisticRegression, LinearRegression
# Database Connection
DB = 'ncdoc.db'
conn = sqlite3.connect(DB)
cur = conn.cursor()
Recap: Our Machine Learning Problem
Of all prisoners released, we would like to predict who is likely to reenter jail within 5 years of the day we make our prediction. For instance, say it is Jan 1, 2009 and we want to identify which prisoners are likely to re-enter jail between now and end of 2013. We can run our predictive model and identify who is most likely at risk. The is an example of a binary classification problem.
We need to munge our data into labels (1_Machine_Learning_Labels.ipynb) and features (2_Machine_Learning_Features.ipynb) before we can train and evaluate machine learning models (3_Machine_Learning_Models.ipynb).
This notebook assumes that you have already worked through the 1_Machine_Learning_Labels
and 2_Machine_Learning_Features
notebooks. If that is not the case, the following functions allow you to bring in the functions developed in those notebooks from .py
scripts.
# We are bringing in the create_labels and create_features functions covered in previous notebooks
# Note that these are brought in from scripts rather than from external packages
from create_labels import create_labels
from create_features import create_features
# These functions make sure that the tables have been created in the database.
create_labels('2008-12-31', '2009-01-01', '2013-12-31', conn)
create_features('2008-12-31', '2009-01-01', '2013-12-31', conn)
We create a training set that takes people at the beginning of 2009 and defines the outcome based on data from 2009-2013. The features for each person are based on data up to the end of 2008.
sql_string = "drop table if exists train_matrix;"
cur.execute(sql_string)
sql_string = "create table train_matrix as "
sql_string += "select l.inmate_doc_number, l.recidivism, f.num_admits, f.length_longest_sentence, f.age_first_admit, f.age "
sql_string += "from recidivism_labels_2009_2013 l "
sql_string += "left join features_2000_2008 f on f.inmate_doc_number = l.inmate_doc_number "
sql_string += ";"
cur.execute(sql_string)
We then load the training data into df_training
.
sql_string = "SELECT *"
sql_string += "FROM train_matrix "
sql_string += ";"
df_training = pd.read_sql(sql_string, con = conn)
df_training.head(5)
As in the previous notebooks, we'll drop any rows that have age < 14 and > 99.
drop = (df_training['age_first_admit'] < 14) | (df_training['age_first_admit'] > 99)
df_training = df_training[-drop]
Sometimes, we have variables with missing (or unknown) data. Instead of dropping those values, there are methods to fill those in, in order to be able to use the data.
Keep in mind that these imputed values will be approximations, and must be treated as such. If you choose to impute missing values in your project or future work, you must acknowledge your process and clearly state which variables you imputed values for.
First, let's see how many missing values we have for each of our features.
len(df_training) - df_training['num_admits'].count()
len(df_training) - df_training['length_longest_sentence'].count()
len(df_training) - df_training['age_first_admit'].count()
len(df_training) - df_training['age'].count()
We will try to impute the unknown data using the following methods:
We start by creating indicator variables that take the value 1 if we observe a missing value for a given feature (and 0 otherwise).
for col in df_training[["num_admits", "length_longest_sentence", "age_first_admit", "age"]].columns:
df_training[col+"_missing"] = df_training[col].isnull()
One of the simplest ways of imputing values is by taking the mean and filling it in. It's possible to do this by using the overall mean, as well as by certain subgroups.
First, we find the mean for each feature.
mean_admits = df_training.num_admits.mean()
mean_longest_sentence = df_training.length_longest_sentence.mean()
mean_age_first = df_training.age_first_admit.mean()
mean_age = df_training.age.mean()
Create a copy of our training data for imputation.
df_training_m = df_training.copy()
We can fill the missing values with the means we computed above.
df_training_m["num_admits"].fillna(mean_admits, inplace = True)
df_training_m["length_longest_sentence"].fillna(mean_longest_sentence, inplace = True)
df_training_m["age_first_admit"].fillna(mean_age_first, inplace = True)
df_training_m["age"].fillna(mean_age, inplace = True)
df_training_m.tail(10)
We could also do this by subgroup. So, for example, we can compute the mean age for each value of number of admits.
mean_age_by_admits = df_training[["num_admits", "age"]].groupby('num_admits').mean()
mean_age_by_admits
Here we select our (mean-imputed) features, the outcome variable and create X- and y-training objects.
sel_features = ['num_admits', 'length_longest_sentence', 'age_first_admit', 'age' ,
'num_admits_missing', 'length_longest_sentence_missing', 'age_first_admit_missing', 'age_missing']
sel_label = 'recidivism'
X_train = df_training_m[sel_features].values
y_train = df_training_m[sel_label].values
We can now build a prediction model that learns the relationship between our predictors (X_train) and recidivism (y_train), e.g., by using logistic regression.
model = LogisticRegression(penalty = 'none')
model.fit( X_train, y_train )
print(model)
Let's take a look at the regression coefficients that were learned based on the imputed data.
model.coef_[0]
model.intercept_
We can also use regression to try to get more accurate values. We build a regression equation from the observations for which we know the age, then use the equation to essentially predict the missing values. This is, in effect, an extension of the mean imputation by subgroup. Here, we will use the number of admits as well as the length of the longest sentence in order to impute age.
# Remove missing rows first
df_training_nona = df_training.dropna()
The model creation process for a linear regression is very similar to that of the ML models when using scikit-learn
. We create the model object, then give it the data, then use the model object to generate our predictions. The model object essentially contains all of the instructions on how to fit the model, and when we give it the data, it fits the model to that data.
# Create model object
ols = LinearRegression()
# Predictors and Outcome
predictors = df_training_nona[["num_admits", "length_longest_sentence"]]
outcome = df_training_nona.age
# Fit the model
ols.fit(X = predictors, y = outcome)
Now that we've fit our model, we can find the predicted values for age.
missing_x = df_training.loc[df_training.age.isna() & df_training.num_admits.notna() & df_training.length_longest_sentence.notna(),['num_admits','length_longest_sentence']]
missing_id = df_training.loc[df_training.age.isna() & df_training.num_admits.notna() & df_training.length_longest_sentence.notna(),'INMATE_DOC_NUMBER']
missing_ages = pd.DataFrame({'INMATE_DOC_NUMBER':missing_id, 'age':ols.predict(missing_x)})
missing_ages.head()
Create a copy of our training data for imputation.
df_training_r = df_training_m.copy()
Now we can fill the missing values of age with the predicted age values.
df_training_r.loc[df_training_r['INMATE_DOC_NUMBER'].isin(missing_id), 'age'] = missing_ages.age
df_training_r.tail(10)
To impute values, we can also use the machine learning algorithm called the K-nearest Neighbors
. The principle behind it is quite simple: the missing values can be imputed by values of "closest neighbors" - as approximated by other, known, features.
The algorithm calculates the distance between the input values (the missing values) and helps to identify the nearest possible value based on other features.
As before, we select our (now mean- and regression-imputed) features, the outcome variable and create X- and y-training objects.
sel_features = ['num_admits', 'length_longest_sentence', 'age_first_admit', 'age' ,
'num_admits_missing', 'length_longest_sentence_missing', 'age_first_admit_missing', 'age_missing']
sel_label = 'recidivism'
X_train = df_training_r[sel_features].values
y_train = df_training_r[sel_label].values
Again, fit the logistic regression model.
model = LogisticRegression(penalty = 'none')
model.fit( X_train, y_train )
print(model)
Print the regression coefficients that were learned based on the newly imputed data.
model.coef_[0]
model.intercept_