In [1]:

import pandas as pd
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import GridSearchCV
from sklearn import metrics
from keras.models import Sequential
from keras.layers import Dense, Dropout, BatchNormalization, MaxPool1D, Flatten, Conv1D
from keras.utils import to_categorical
import numpy as np

Using TensorFlow backend.

In [2]:

from IPython.core.display import display, HTML
display(HTML("<style>.container { width:100% !important; }</style>"))

The Data and Question of Interest¶

Let's take a look at the UCI Adult Data Set. This data set was extrated from Census data with the goal of prediction who makes over $50,000.

I would like to use these data as a means of exploring various machine learning algorithms that will increase in complexity to see how the compare on various evaluation metrics. Additonally, it will be interesting to see how much there is to gain by spending some time fine-tuning these algorithms.

We will look at the following algorithms:

And evaluate them with the following metrics:

Let's go ahead and read in the data and take a look.

In [4]:

names = ['age', 'workclass', 'fnlwgt', 'education', 'educationnum', 'maritalstatus', 'occupation', 'relationship', 'race',
        'sex', 'capitalgain', 'capitalloss', 'hoursperweek', 'nativecountry', 'label']
train_df = pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data",
                      header=None, names=names)
test_df = pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.test",
                      header=None, names=names, skiprows=[0])
all_df = pd.concat([train_df, test_df])

In [5]:

all_df.head()

Out[5]:

	age	workclass	fnlwgt	education	educationnum	maritalstatus	occupation	relationship	race	sex	capitalgain	hoursperweek	nativecountry	label
0	39	State-gov	77516	Bachelors	13	Never-married	Adm-clerical	Not-in-family	White	Male	2174	40	United-States	<=50K
1	50	Self-emp-not-inc	83311	Bachelors	13	Married-civ-spouse	Exec-managerial	Husband	White	Male	0	13	United-States	<=50K
2	38	Private	215646	HS-grad	9	Divorced	Handlers-cleaners	Not-in-family	White	Male	0	40	United-States	<=50K
3	53	Private	234721	11th	7	Married-civ-spouse	Handlers-cleaners	Husband	Black	Male	0	40	United-States	<=50K
4	28	Private	338409	Bachelors	13	Married-civ-spouse	Prof-specialty	Wife	Black	Female	0	40	Cuba	<=50K

It looks like we have 14 columns to help us predict our classification. We will drop fnlwgt and education and then convert our categorical features to dummy variables. We will also convert our label to 0 and 1 where 1 means the person made more than $50k

In [6]:

all_df.shape

Out[6]:

(48842, 15)

In [7]:

drop_columns = ['fnlwgt', 'education']
continuous_features = ['age', 'capitalgain', 'capitalloss', 'hoursperweek']
cat_features =['educationnum', 'workclass', 'maritalstatus', 'occupation', 'relationship', 'race', 'sex', 'nativecountry']

In [8]:

all_df_dummies = pd.get_dummies(all_df, columns=cat_features)

In [9]:

all_df_dummies.drop(drop_columns, 1, inplace=True)

In [10]:

y = all_df_dummies['label'].apply(lambda x: 0 if '<' in x else 1)
X = all_df_dummies.drop(['label'], 1)

In [11]:

y.value_counts(normalize=True)

Out[11]:

0    0.760718
1    0.239282
Name: label, dtype: float64

Looks like we don't have balanced classes, so good thing we are looking at other metrics than accuracy. Now let's split into training and testing with 1/3 for testing.

In [12]:

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

In [13]:

X_train.shape

Out[13]:

(32724, 106)

Cleaning Pipeline¶

The goal of this project is not to focus on cleaning / data exploration / feature engineering. So we will define a very simple cleaning pipeline that fills any missing values with the median and then scales ever column.

In [14]:

clean_pipeline = Pipeline([('imputer', preprocessing.Imputer(strategy="median")),
                           ('std_scaler', preprocessing.StandardScaler()),])

In [15]:

X_train_clean = clean_pipeline.fit_transform(X_train)

In [16]:

X_test_clean = clean_pipeline.transform(X_test)

Metrics¶

A simple function to calculate our metrics of interest

In [17]:

def evaluate(true, pred):
    f1 = metrics.f1_score(true, pred)
    roc_auc = metrics.roc_auc_score(true, pred)
    accuracy = metrics.accuracy_score(true, pred)
    print("F1: {0}\nROC_AUC: {1}\nACCURACY: {2}".format(f1, roc_auc, accuracy))
    return f1, roc_auc, accuracy

Logistic Regression¶

The first model up is a simple logistic regression with the default hyperparameters.

In [18]:

clf = LogisticRegression()
clf.fit(X_train, y_train)

Out[18]:

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [19]:

lr_predictions = clf.predict(X_test)

In [20]:

lr_f1, lr_roc_auc, lr_acc = evaluate(y_test, lr_predictions)

F1: 0.6507094739859539
ROC_AUC: 0.7574953226590644
ACCURACY: 0.8488025809653803

Tuned Logistic Regression¶

Now lets spend a bit of time tuning our regularization.

In [21]:

lr_grid = {'penalty': ['l1', 'l2'], 'C': [0.001, 0.01, 0.1, 1, 10, 100, 1000] }
tuned_lr = GridSearchCV(LogisticRegression(), lr_grid, scoring='f1', n_jobs=10)
tuned_lr.fit(X_train, y_train)

Out[21]:

GridSearchCV(cv=None, error_score='raise',
       estimator=LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False),
       fit_params={}, iid=True, n_jobs=10,
       param_grid={'penalty': ['l1', 'l2'], 'C': [0.001, 0.01, 0.1, 1, 10, 100, 1000]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring='f1', verbose=0)

Here are our best parameters

In [25]:

tuned_lr.best_params_

Out[25]:

{'C': 1, 'penalty': 'l1'}

In [26]:

tuned_lr_predictions = tuned_lr.predict(X_test)
tuned_lr_f1, tuned_lr_roc_auc, tuned_lr_acc = evaluate(y_test, tuned_lr_predictions)

F1: 0.6512027491408934
ROC_AUC: 0.7578833983412963
ACCURACY: 0.8488646234024072

Gradient Boosted Trees¶

Now an out of the box boosted tree

In [27]:

gbt = GradientBoostingClassifier()
gbt.fit(X_train, y_train)
gbt_predictions = clf.predict(X_test)
gbt_f1, gbt_roc_auc, gbt_acc = evaluate(y_test, gbt_predictions)

F1: 0.6507094739859539
ROC_AUC: 0.7574953226590644
ACCURACY: 0.8488025809653803

GBT Tuned¶

And now a tuned boosted tree. I ran the grid shown below to get my final parameters, but for speed's sake I now just show the best.

In [28]:

#gbt_grid = {'learning_rate': [.01], 'n_estimators': [250, 500, 1000], 'max_depth': [3, 4, 5]}
gbt_tuned = GradientBoostingClassifier(learning_rate=.01, n_estimators=1000, max_depth=5)
gbt_tuned.fit(X_train, y_train)

Out[28]:

GradientBoostingClassifier(criterion='friedman_mse', init=None,
              learning_rate=0.01, loss='deviance', max_depth=5,
              max_features=None, max_leaf_nodes=None,
              min_impurity_split=1e-07, min_samples_leaf=1,
              min_samples_split=2, min_weight_fraction_leaf=0.0,
              n_estimators=1000, presort='auto', random_state=None,
              subsample=1.0, verbose=0, warm_start=False)

In [29]:

gbt_tuned_predictions = gbt_tuned.predict(X_test)
gbt_tuned_f1, gbt_tunded_roc_auc, gbt_tuned_acc = evaluate(y_test, gbt_tuned_predictions)

F1: 0.7042577675489067
ROC_AUC: 0.7885511539729889
ACCURACY: 0.8724407494726393

Deep Learning Simple¶

Now we have all heard the amazing power of deep learning. So let's take a look at how well it fares with our task. There are a fair amout of hyperparameters with deep nets, but I will pick some reasonable values as our starting point.

In [30]:

model_simple = Sequential()
model_simple.add(Dense(1024, activation='relu' , input_dim = X_train.shape[1]))
model_simple.add(Dropout(0.5))
model_simple.add(Dense(2, activation='softmax', name='softmax'))

In [31]:

y_train_cat = to_categorical(y_train.values, 2)

In [32]:

model_simple.compile(loss='categorical_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])

In [33]:

model_simple.fit(X_train.values, y_train_cat, batch_size=32, epochs=25)

Epoch 1/25
32724/32724 [==============================] - 2s - loss: 1.3490 - acc: 0.7843     
Epoch 2/25
32724/32724 [==============================] - 1s - loss: 1.3434 - acc: 0.7985     
Epoch 3/25
32724/32724 [==============================] - 1s - loss: 1.4843 - acc: 0.7937     
Epoch 4/25
32724/32724 [==============================] - 1s - loss: 1.4806 - acc: 0.7947     
Epoch 5/25
32724/32724 [==============================] - 1s - loss: 1.4770 - acc: 0.7953     
Epoch 6/25
32724/32724 [==============================] - 1s - loss: 1.4761 - acc: 0.7964     
Epoch 7/25
32724/32724 [==============================] - 1s - loss: 1.4755 - acc: 0.7977     
Epoch 8/25
32724/32724 [==============================] - 1s - loss: 1.4750 - acc: 0.7977     
Epoch 9/25
32724/32724 [==============================] - 1s - loss: 1.4744 - acc: 0.7968     
Epoch 10/25
32724/32724 [==============================] - 1s - loss: 1.4738 - acc: 0.7981     
Epoch 11/25
32724/32724 [==============================] - 1s - loss: 1.4722 - acc: 0.7975     
Epoch 12/25
32724/32724 [==============================] - 1s - loss: 1.4725 - acc: 0.7995     
Epoch 13/25
32724/32724 [==============================] - 1s - loss: 1.4714 - acc: 0.7986     
Epoch 14/25
32724/32724 [==============================] - 1s - loss: 1.4709 - acc: 0.7988     
Epoch 15/25
32724/32724 [==============================] - 1s - loss: 1.4713 - acc: 0.7979     
Epoch 16/25
32724/32724 [==============================] - 1s - loss: 1.4698 - acc: 0.7982     
Epoch 17/25
32724/32724 [==============================] - 1s - loss: 1.4704 - acc: 0.7988     
Epoch 18/25
32724/32724 [==============================] - 1s - loss: 1.4705 - acc: 0.7994     
Epoch 19/25
32724/32724 [==============================] - 1s - loss: 1.4707 - acc: 0.7981     
Epoch 20/25
32724/32724 [==============================] - 1s - loss: 1.4695 - acc: 0.8000     
Epoch 21/25
32724/32724 [==============================] - 1s - loss: 1.4700 - acc: 0.8002     
Epoch 22/25
32724/32724 [==============================] - 1s - loss: 1.4687 - acc: 0.8006     
Epoch 23/25
32724/32724 [==============================] - 1s - loss: 1.4689 - acc: 0.8006     
Epoch 24/25
32724/32724 [==============================] - 1s - loss: 1.4692 - acc: 0.7994     
Epoch 25/25
32724/32724 [==============================] - 1s - loss: 1.4672 - acc: 0.8003

Out[33]:

<keras.callbacks.History at 0x7f92117bfe10>

In [34]:

deep_predictions_simple = model_simple.predict(X_test.values)
deep_simple_f1, deep_simple_roc_auc, deep_simple_acc = evaluate(np.argmax(deep_predictions_simple, 1), y_test)

F1: 0.4076755973931933
ROC_AUC: 0.753604522328225
ACCURACY: 0.7969971460478967

Deep Learning Tuned A Bit¶

Then I spent about 30 minutes playing with different architectures so see how far I could push a deep net and this is what I got. Note: this is not to say that there isn't a better or even much better architecture, but after trying a fair amount of normal options, nothing better appeared.

In [35]:

model = Sequential()
model.add(Dense(1024, activation='elu', kernel_initializer='glorot_normal', input_dim = X_train.shape[1]))
model.add(BatchNormalization())
model.add(Dense(128, activation='elu', kernel_initializer='glorot_normal'))
model.add(BatchNormalization())
model.add(Dense(64, activation='elu', kernel_initializer='glorot_normal'))
model.add(Dropout(0.2))
model.add(Dense(2, activation='softmax', name='softmax'))

In [36]:

model.compile(loss='categorical_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])

In [37]:

model.fit(X_train.values, y_train_cat, batch_size=512, epochs=40)

Epoch 1/40
32724/32724 [==============================] - 0s - loss: 0.3969 - acc: 0.8129     
Epoch 2/40
32724/32724 [==============================] - 0s - loss: 0.3418 - acc: 0.8409     
Epoch 3/40
32724/32724 [==============================] - 0s - loss: 0.3360 - acc: 0.8401     
Epoch 4/40
32724/32724 [==============================] - 0s - loss: 0.3305 - acc: 0.8430     
Epoch 5/40
32724/32724 [==============================] - 0s - loss: 0.3270 - acc: 0.8458     
Epoch 6/40
32724/32724 [==============================] - 0s - loss: 0.3274 - acc: 0.8465     
Epoch 7/40
32724/32724 [==============================] - 0s - loss: 0.3223 - acc: 0.8495     
Epoch 8/40
32724/32724 [==============================] - 0s - loss: 0.3165 - acc: 0.8524     
Epoch 9/40
32724/32724 [==============================] - 0s - loss: 0.3231 - acc: 0.8477     
Epoch 10/40
32724/32724 [==============================] - 0s - loss: 0.3177 - acc: 0.8527     
Epoch 11/40
32724/32724 [==============================] - 0s - loss: 0.3158 - acc: 0.8522     
Epoch 12/40
32724/32724 [==============================] - 0s - loss: 0.3174 - acc: 0.8505     
Epoch 13/40
32724/32724 [==============================] - 0s - loss: 0.3148 - acc: 0.8527     
Epoch 14/40
32724/32724 [==============================] - 0s - loss: 0.3114 - acc: 0.8539     
Epoch 15/40
32724/32724 [==============================] - 0s - loss: 0.3106 - acc: 0.8552     
Epoch 16/40
32724/32724 [==============================] - 0s - loss: 0.3094 - acc: 0.8555     
Epoch 17/40
32724/32724 [==============================] - 0s - loss: 0.3074 - acc: 0.8548     
Epoch 18/40
32724/32724 [==============================] - 0s - loss: 0.3089 - acc: 0.8559     
Epoch 19/40
32724/32724 [==============================] - 0s - loss: 0.3088 - acc: 0.8555     
Epoch 20/40
32724/32724 [==============================] - 0s - loss: 0.3097 - acc: 0.8564     
Epoch 21/40
32724/32724 [==============================] - 0s - loss: 0.3088 - acc: 0.8554     
Epoch 22/40
32724/32724 [==============================] - 0s - loss: 0.3085 - acc: 0.8547     
Epoch 23/40
32724/32724 [==============================] - 0s - loss: 0.3037 - acc: 0.8589     
Epoch 24/40
32724/32724 [==============================] - 0s - loss: 0.3076 - acc: 0.8555     
Epoch 25/40
32724/32724 [==============================] - 0s - loss: 0.3056 - acc: 0.8581     
Epoch 26/40
32724/32724 [==============================] - 0s - loss: 0.3037 - acc: 0.8587     
Epoch 27/40
32724/32724 [==============================] - 0s - loss: 0.3056 - acc: 0.8567     
Epoch 28/40
32724/32724 [==============================] - 0s - loss: 0.3021 - acc: 0.8588     
Epoch 29/40
32724/32724 [==============================] - 0s - loss: 0.3026 - acc: 0.8584     
Epoch 30/40
32724/32724 [==============================] - 0s - loss: 0.3033 - acc: 0.8600     
Epoch 31/40
32724/32724 [==============================] - 0s - loss: 0.3027 - acc: 0.8574     
Epoch 32/40
32724/32724 [==============================] - 0s - loss: 0.3019 - acc: 0.8585     
Epoch 33/40
32724/32724 [==============================] - 0s - loss: 0.3002 - acc: 0.8593     
Epoch 34/40
32724/32724 [==============================] - 0s - loss: 0.3002 - acc: 0.8603     
Epoch 35/40
32724/32724 [==============================] - 0s - loss: 0.2968 - acc: 0.8619     
Epoch 36/40
32724/32724 [==============================] - 0s - loss: 0.3010 - acc: 0.8598     
Epoch 37/40
32724/32724 [==============================] - 0s - loss: 0.2998 - acc: 0.8609     
Epoch 38/40
32724/32724 [==============================] - 0s - loss: 0.2980 - acc: 0.8619     
Epoch 39/40
32724/32724 [==============================] - 0s - loss: 0.2974 - acc: 0.8616     
Epoch 40/40
32724/32724 [==============================] - 0s - loss: 0.2970 - acc: 0.8623

Out[37]:

<keras.callbacks.History at 0x7f91f4692780>

In [38]:

deep_predictions = model.predict(X_test.values)

In [39]:

deep_f1, deep_roc_auc, deep_acc = evaluate(np.argmax(deep_predictions, 1), y_test)

F1: 0.6730386300278773
ROC_AUC: 0.795070458358415
ACCURACY: 0.8471894776026803

Final Results¶

So what did we end up with and what did we learn?

In [43]:

model_names = ["LR", "Tuned LR", "GBT", "Tuned GBT", "Deep", "Deep Tuned"]
metrics_of_interest = ["F1", "ROC_AUC", "ACCURACY"]
f1s = [lr_f1, tuned_lr_f1, gbt_f1, gbt_tuned_f1, deep_simple_f1, deep_f1]
roc_aucs = [lr_roc_auc, tuned_lr_roc_auc, gbt_roc_auc, gbt_tunded_roc_auc, deep_simple_roc_auc, deep_roc_auc]
accuracy = [lr_acc, tuned_lr_acc, gbt_acc, gbt_tuned_acc, deep_simple_acc, deep_acc]

In [44]:

results_df = pd.DataFrame(columns=metrics_of_interest, index=model_names, data=np.array([f1s, roc_aucs, accuracy]).T)

In [45]:

results_df

Out[45]:

	F1	ROC_AUC	ACCURACY
LR	0.650709	0.757495	0.848803
Tuned LR	0.651203	0.757883	0.848865
GBT	0.650709	0.757495	0.848803
Tuned GBT	0.704258	0.788551	0.872441
Deep	0.407676	0.753605	0.796997
Deep Tuned	0.673039	0.795070	0.847189

First off, the out of the box logistic regression does basically as well as the tuned version. Tuning helped a bit, but didn't make much of a difference. The out of the box GBT did slightly worse, but basically as well as the tuned logistic regression. Which for some might seem surprising given the successes of XGBoost on Kaggle. That being said, once you spend a bit of time tuning, GBTs do significantly better with a jump across the board and about a a 7.5% increase in F1.

The deep networks are interesting indeed. The first naive pass does very poorly. The ROC_AUC and Accuracy look okay, but the F1 score points to the issue: it learned that most things are a 0 and overfit to that. As we can see below:

In [50]:

from collections import Counter

In [51]:

Counter(np.argmax(deep_predictions_simple, 1))

Out[51]:

Counter({0: 14508, 1: 1610})

In [52]:

Counter(y_test)

Out[52]:

Counter({0: 12204, 1: 3914})

That being said, after spending some time tuning, we are able to boost the deep net's performance a lot. Even geting to the best ROC_AUC score and a competitve F1 and accuracy. So what are the main take aways?

Logistic regression is a nice baseline that may not require a lot of tuning and even if it does need some it is very fast to train
GBTs are powerful algorithms, but without tuning may not beat a baseline by much. That being said with a fairly standard grid search across a few values one can see good improvements. This grid search can take some time, though, as GBTs are slower to train the logistic regression.
Deep nets can achieve competitve results even outside of text, image, and audio fields. Training a "standard deep net", though, without any tuning can lead to very poor results. To really maximize the value of deep network time needs to be spent experiment with architectures. For example, how deep? how wide? regularization? normalization? what kind of initalization? etc. There are tons of options and perhaps the path to tuning is less clear than GBTs. In addition, deep nets can be slow to train, so all of this iteration takes time.

In conclusion, there really doesn't seeem to be a free lunch. You can get better results with more complex models, but those models do take time and understanding to tune and even then might not provide significant improvements. Lastly, this is clearly just one data set and may not genearlize at all. It would be interesting to run similar tests on other data sets to see if there is a trend.