import pandas as pd
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import GridSearchCV
from sklearn import metrics
from keras.models import Sequential
from keras.layers import Dense, Dropout, BatchNormalization, MaxPool1D, Flatten, Conv1D
from keras.utils import to_categorical
import numpy as np
Using TensorFlow backend.
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:100% !important; }</style>"))
Let's take a look at the UCI Adult Data Set. This data set was extrated from Census data with the goal of prediction who makes over $50,000.
I would like to use these data as a means of exploring various machine learning algorithms that will increase in complexity to see how the compare on various evaluation metrics. Additonally, it will be interesting to see how much there is to gain by spending some time fine-tuning these algorithms.
We will look at the following algorithms:
And evaluate them with the following metrics:
Let's go ahead and read in the data and take a look.
names = ['age', 'workclass', 'fnlwgt', 'education', 'educationnum', 'maritalstatus', 'occupation', 'relationship', 'race',
'sex', 'capitalgain', 'capitalloss', 'hoursperweek', 'nativecountry', 'label']
train_df = pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data",
header=None, names=names)
test_df = pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.test",
header=None, names=names, skiprows=[0])
all_df = pd.concat([train_df, test_df])
all_df.head()
age | workclass | fnlwgt | education | educationnum | maritalstatus | occupation | relationship | race | sex | capitalgain | capitalloss | hoursperweek | nativecountry | label | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 39 | State-gov | 77516 | Bachelors | 13 | Never-married | Adm-clerical | Not-in-family | White | Male | 2174 | 0 | 40 | United-States | <=50K |
1 | 50 | Self-emp-not-inc | 83311 | Bachelors | 13 | Married-civ-spouse | Exec-managerial | Husband | White | Male | 0 | 0 | 13 | United-States | <=50K |
2 | 38 | Private | 215646 | HS-grad | 9 | Divorced | Handlers-cleaners | Not-in-family | White | Male | 0 | 0 | 40 | United-States | <=50K |
3 | 53 | Private | 234721 | 11th | 7 | Married-civ-spouse | Handlers-cleaners | Husband | Black | Male | 0 | 0 | 40 | United-States | <=50K |
4 | 28 | Private | 338409 | Bachelors | 13 | Married-civ-spouse | Prof-specialty | Wife | Black | Female | 0 | 0 | 40 | Cuba | <=50K |
It looks like we have 14 columns to help us predict our classification. We will drop fnlwgt and education and then convert our categorical features to dummy variables. We will also convert our label to 0 and 1 where 1 means the person made more than $50k
all_df.shape
(48842, 15)
drop_columns = ['fnlwgt', 'education']
continuous_features = ['age', 'capitalgain', 'capitalloss', 'hoursperweek']
cat_features =['educationnum', 'workclass', 'maritalstatus', 'occupation', 'relationship', 'race', 'sex', 'nativecountry']
all_df_dummies = pd.get_dummies(all_df, columns=cat_features)
all_df_dummies.drop(drop_columns, 1, inplace=True)
y = all_df_dummies['label'].apply(lambda x: 0 if '<' in x else 1)
X = all_df_dummies.drop(['label'], 1)
y.value_counts(normalize=True)
0 0.760718 1 0.239282 Name: label, dtype: float64
Looks like we don't have balanced classes, so good thing we are looking at other metrics than accuracy. Now let's split into training and testing with 1/3 for testing.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
X_train.shape
(32724, 106)
The goal of this project is not to focus on cleaning / data exploration / feature engineering. So we will define a very simple cleaning pipeline that fills any missing values with the median and then scales ever column.
clean_pipeline = Pipeline([('imputer', preprocessing.Imputer(strategy="median")),
('std_scaler', preprocessing.StandardScaler()),])
X_train_clean = clean_pipeline.fit_transform(X_train)
X_test_clean = clean_pipeline.transform(X_test)
A simple function to calculate our metrics of interest
def evaluate(true, pred):
f1 = metrics.f1_score(true, pred)
roc_auc = metrics.roc_auc_score(true, pred)
accuracy = metrics.accuracy_score(true, pred)
print("F1: {0}\nROC_AUC: {1}\nACCURACY: {2}".format(f1, roc_auc, accuracy))
return f1, roc_auc, accuracy
The first model up is a simple logistic regression with the default hyperparameters.
clf = LogisticRegression()
clf.fit(X_train, y_train)
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True, intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1, penalty='l2', random_state=None, solver='liblinear', tol=0.0001, verbose=0, warm_start=False)
lr_predictions = clf.predict(X_test)
lr_f1, lr_roc_auc, lr_acc = evaluate(y_test, lr_predictions)
F1: 0.6507094739859539 ROC_AUC: 0.7574953226590644 ACCURACY: 0.8488025809653803
Now lets spend a bit of time tuning our regularization.
lr_grid = {'penalty': ['l1', 'l2'], 'C': [0.001, 0.01, 0.1, 1, 10, 100, 1000] }
tuned_lr = GridSearchCV(LogisticRegression(), lr_grid, scoring='f1', n_jobs=10)
tuned_lr.fit(X_train, y_train)
GridSearchCV(cv=None, error_score='raise', estimator=LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True, intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1, penalty='l2', random_state=None, solver='liblinear', tol=0.0001, verbose=0, warm_start=False), fit_params={}, iid=True, n_jobs=10, param_grid={'penalty': ['l1', 'l2'], 'C': [0.001, 0.01, 0.1, 1, 10, 100, 1000]}, pre_dispatch='2*n_jobs', refit=True, return_train_score=True, scoring='f1', verbose=0)
Here are our best parameters
tuned_lr.best_params_
{'C': 1, 'penalty': 'l1'}
tuned_lr_predictions = tuned_lr.predict(X_test)
tuned_lr_f1, tuned_lr_roc_auc, tuned_lr_acc = evaluate(y_test, tuned_lr_predictions)
F1: 0.6512027491408934 ROC_AUC: 0.7578833983412963 ACCURACY: 0.8488646234024072
Now an out of the box boosted tree
gbt = GradientBoostingClassifier()
gbt.fit(X_train, y_train)
gbt_predictions = clf.predict(X_test)
gbt_f1, gbt_roc_auc, gbt_acc = evaluate(y_test, gbt_predictions)
F1: 0.6507094739859539 ROC_AUC: 0.7574953226590644 ACCURACY: 0.8488025809653803
And now a tuned boosted tree. I ran the grid shown below to get my final parameters, but for speed's sake I now just show the best.
#gbt_grid = {'learning_rate': [.01], 'n_estimators': [250, 500, 1000], 'max_depth': [3, 4, 5]}
gbt_tuned = GradientBoostingClassifier(learning_rate=.01, n_estimators=1000, max_depth=5)
gbt_tuned.fit(X_train, y_train)
GradientBoostingClassifier(criterion='friedman_mse', init=None, learning_rate=0.01, loss='deviance', max_depth=5, max_features=None, max_leaf_nodes=None, min_impurity_split=1e-07, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, n_estimators=1000, presort='auto', random_state=None, subsample=1.0, verbose=0, warm_start=False)
gbt_tuned_predictions = gbt_tuned.predict(X_test)
gbt_tuned_f1, gbt_tunded_roc_auc, gbt_tuned_acc = evaluate(y_test, gbt_tuned_predictions)
F1: 0.7042577675489067 ROC_AUC: 0.7885511539729889 ACCURACY: 0.8724407494726393
Now we have all heard the amazing power of deep learning. So let's take a look at how well it fares with our task. There are a fair amout of hyperparameters with deep nets, but I will pick some reasonable values as our starting point.
model_simple = Sequential()
model_simple.add(Dense(1024, activation='relu' , input_dim = X_train.shape[1]))
model_simple.add(Dropout(0.5))
model_simple.add(Dense(2, activation='softmax', name='softmax'))
y_train_cat = to_categorical(y_train.values, 2)
model_simple.compile(loss='categorical_crossentropy',
optimizer='adam',
metrics=['accuracy'])
model_simple.fit(X_train.values, y_train_cat, batch_size=32, epochs=25)
Epoch 1/25 32724/32724 [==============================] - 2s - loss: 1.3490 - acc: 0.7843 Epoch 2/25 32724/32724 [==============================] - 1s - loss: 1.3434 - acc: 0.7985 Epoch 3/25 32724/32724 [==============================] - 1s - loss: 1.4843 - acc: 0.7937 Epoch 4/25 32724/32724 [==============================] - 1s - loss: 1.4806 - acc: 0.7947 Epoch 5/25 32724/32724 [==============================] - 1s - loss: 1.4770 - acc: 0.7953 Epoch 6/25 32724/32724 [==============================] - 1s - loss: 1.4761 - acc: 0.7964 Epoch 7/25 32724/32724 [==============================] - 1s - loss: 1.4755 - acc: 0.7977 Epoch 8/25 32724/32724 [==============================] - 1s - loss: 1.4750 - acc: 0.7977 Epoch 9/25 32724/32724 [==============================] - 1s - loss: 1.4744 - acc: 0.7968 Epoch 10/25 32724/32724 [==============================] - 1s - loss: 1.4738 - acc: 0.7981 Epoch 11/25 32724/32724 [==============================] - 1s - loss: 1.4722 - acc: 0.7975 Epoch 12/25 32724/32724 [==============================] - 1s - loss: 1.4725 - acc: 0.7995 Epoch 13/25 32724/32724 [==============================] - 1s - loss: 1.4714 - acc: 0.7986 Epoch 14/25 32724/32724 [==============================] - 1s - loss: 1.4709 - acc: 0.7988 Epoch 15/25 32724/32724 [==============================] - 1s - loss: 1.4713 - acc: 0.7979 Epoch 16/25 32724/32724 [==============================] - 1s - loss: 1.4698 - acc: 0.7982 Epoch 17/25 32724/32724 [==============================] - 1s - loss: 1.4704 - acc: 0.7988 Epoch 18/25 32724/32724 [==============================] - 1s - loss: 1.4705 - acc: 0.7994 Epoch 19/25 32724/32724 [==============================] - 1s - loss: 1.4707 - acc: 0.7981 Epoch 20/25 32724/32724 [==============================] - 1s - loss: 1.4695 - acc: 0.8000 Epoch 21/25 32724/32724 [==============================] - 1s - loss: 1.4700 - acc: 0.8002 Epoch 22/25 32724/32724 [==============================] - 1s - loss: 1.4687 - acc: 0.8006 Epoch 23/25 32724/32724 [==============================] - 1s - loss: 1.4689 - acc: 0.8006 Epoch 24/25 32724/32724 [==============================] - 1s - loss: 1.4692 - acc: 0.7994 Epoch 25/25 32724/32724 [==============================] - 1s - loss: 1.4672 - acc: 0.8003
<keras.callbacks.History at 0x7f92117bfe10>
deep_predictions_simple = model_simple.predict(X_test.values)
deep_simple_f1, deep_simple_roc_auc, deep_simple_acc = evaluate(np.argmax(deep_predictions_simple, 1), y_test)
F1: 0.4076755973931933 ROC_AUC: 0.753604522328225 ACCURACY: 0.7969971460478967
Then I spent about 30 minutes playing with different architectures so see how far I could push a deep net and this is what I got. Note: this is not to say that there isn't a better or even much better architecture, but after trying a fair amount of normal options, nothing better appeared.
model = Sequential()
model.add(Dense(1024, activation='elu', kernel_initializer='glorot_normal', input_dim = X_train.shape[1]))
model.add(BatchNormalization())
model.add(Dense(128, activation='elu', kernel_initializer='glorot_normal'))
model.add(BatchNormalization())
model.add(Dense(64, activation='elu', kernel_initializer='glorot_normal'))
model.add(Dropout(0.2))
model.add(Dense(2, activation='softmax', name='softmax'))
model.compile(loss='categorical_crossentropy',
optimizer='adam',
metrics=['accuracy'])
model.fit(X_train.values, y_train_cat, batch_size=512, epochs=40)
Epoch 1/40 32724/32724 [==============================] - 0s - loss: 0.3969 - acc: 0.8129 Epoch 2/40 32724/32724 [==============================] - 0s - loss: 0.3418 - acc: 0.8409 Epoch 3/40 32724/32724 [==============================] - 0s - loss: 0.3360 - acc: 0.8401 Epoch 4/40 32724/32724 [==============================] - 0s - loss: 0.3305 - acc: 0.8430 Epoch 5/40 32724/32724 [==============================] - 0s - loss: 0.3270 - acc: 0.8458 Epoch 6/40 32724/32724 [==============================] - 0s - loss: 0.3274 - acc: 0.8465 Epoch 7/40 32724/32724 [==============================] - 0s - loss: 0.3223 - acc: 0.8495 Epoch 8/40 32724/32724 [==============================] - 0s - loss: 0.3165 - acc: 0.8524 Epoch 9/40 32724/32724 [==============================] - 0s - loss: 0.3231 - acc: 0.8477 Epoch 10/40 32724/32724 [==============================] - 0s - loss: 0.3177 - acc: 0.8527 Epoch 11/40 32724/32724 [==============================] - 0s - loss: 0.3158 - acc: 0.8522 Epoch 12/40 32724/32724 [==============================] - 0s - loss: 0.3174 - acc: 0.8505 Epoch 13/40 32724/32724 [==============================] - 0s - loss: 0.3148 - acc: 0.8527 Epoch 14/40 32724/32724 [==============================] - 0s - loss: 0.3114 - acc: 0.8539 Epoch 15/40 32724/32724 [==============================] - 0s - loss: 0.3106 - acc: 0.8552 Epoch 16/40 32724/32724 [==============================] - 0s - loss: 0.3094 - acc: 0.8555 Epoch 17/40 32724/32724 [==============================] - 0s - loss: 0.3074 - acc: 0.8548 Epoch 18/40 32724/32724 [==============================] - 0s - loss: 0.3089 - acc: 0.8559 Epoch 19/40 32724/32724 [==============================] - 0s - loss: 0.3088 - acc: 0.8555 Epoch 20/40 32724/32724 [==============================] - 0s - loss: 0.3097 - acc: 0.8564 Epoch 21/40 32724/32724 [==============================] - 0s - loss: 0.3088 - acc: 0.8554 Epoch 22/40 32724/32724 [==============================] - 0s - loss: 0.3085 - acc: 0.8547 Epoch 23/40 32724/32724 [==============================] - 0s - loss: 0.3037 - acc: 0.8589 Epoch 24/40 32724/32724 [==============================] - 0s - loss: 0.3076 - acc: 0.8555 Epoch 25/40 32724/32724 [==============================] - 0s - loss: 0.3056 - acc: 0.8581 Epoch 26/40 32724/32724 [==============================] - 0s - loss: 0.3037 - acc: 0.8587 Epoch 27/40 32724/32724 [==============================] - 0s - loss: 0.3056 - acc: 0.8567 Epoch 28/40 32724/32724 [==============================] - 0s - loss: 0.3021 - acc: 0.8588 Epoch 29/40 32724/32724 [==============================] - 0s - loss: 0.3026 - acc: 0.8584 Epoch 30/40 32724/32724 [==============================] - 0s - loss: 0.3033 - acc: 0.8600 Epoch 31/40 32724/32724 [==============================] - 0s - loss: 0.3027 - acc: 0.8574 Epoch 32/40 32724/32724 [==============================] - 0s - loss: 0.3019 - acc: 0.8585 Epoch 33/40 32724/32724 [==============================] - 0s - loss: 0.3002 - acc: 0.8593 Epoch 34/40 32724/32724 [==============================] - 0s - loss: 0.3002 - acc: 0.8603 Epoch 35/40 32724/32724 [==============================] - 0s - loss: 0.2968 - acc: 0.8619 Epoch 36/40 32724/32724 [==============================] - 0s - loss: 0.3010 - acc: 0.8598 Epoch 37/40 32724/32724 [==============================] - 0s - loss: 0.2998 - acc: 0.8609 Epoch 38/40 32724/32724 [==============================] - 0s - loss: 0.2980 - acc: 0.8619 Epoch 39/40 32724/32724 [==============================] - 0s - loss: 0.2974 - acc: 0.8616 Epoch 40/40 32724/32724 [==============================] - 0s - loss: 0.2970 - acc: 0.8623
<keras.callbacks.History at 0x7f91f4692780>
deep_predictions = model.predict(X_test.values)
deep_f1, deep_roc_auc, deep_acc = evaluate(np.argmax(deep_predictions, 1), y_test)
F1: 0.6730386300278773 ROC_AUC: 0.795070458358415 ACCURACY: 0.8471894776026803
So what did we end up with and what did we learn?
model_names = ["LR", "Tuned LR", "GBT", "Tuned GBT", "Deep", "Deep Tuned"]
metrics_of_interest = ["F1", "ROC_AUC", "ACCURACY"]
f1s = [lr_f1, tuned_lr_f1, gbt_f1, gbt_tuned_f1, deep_simple_f1, deep_f1]
roc_aucs = [lr_roc_auc, tuned_lr_roc_auc, gbt_roc_auc, gbt_tunded_roc_auc, deep_simple_roc_auc, deep_roc_auc]
accuracy = [lr_acc, tuned_lr_acc, gbt_acc, gbt_tuned_acc, deep_simple_acc, deep_acc]
results_df = pd.DataFrame(columns=metrics_of_interest, index=model_names, data=np.array([f1s, roc_aucs, accuracy]).T)
results_df
F1 | ROC_AUC | ACCURACY | |
---|---|---|---|
LR | 0.650709 | 0.757495 | 0.848803 |
Tuned LR | 0.651203 | 0.757883 | 0.848865 |
GBT | 0.650709 | 0.757495 | 0.848803 |
Tuned GBT | 0.704258 | 0.788551 | 0.872441 |
Deep | 0.407676 | 0.753605 | 0.796997 |
Deep Tuned | 0.673039 | 0.795070 | 0.847189 |
First off, the out of the box logistic regression does basically as well as the tuned version. Tuning helped a bit, but didn't make much of a difference. The out of the box GBT did slightly worse, but basically as well as the tuned logistic regression. Which for some might seem surprising given the successes of XGBoost on Kaggle. That being said, once you spend a bit of time tuning, GBTs do significantly better with a jump across the board and about a a 7.5% increase in F1.
The deep networks are interesting indeed. The first naive pass does very poorly. The ROC_AUC and Accuracy look okay, but the F1 score points to the issue: it learned that most things are a 0 and overfit to that. As we can see below:
from collections import Counter
Counter(np.argmax(deep_predictions_simple, 1))
Counter({0: 14508, 1: 1610})
Counter(y_test)
Counter({0: 12204, 1: 3914})
That being said, after spending some time tuning, we are able to boost the deep net's performance a lot. Even geting to the best ROC_AUC score and a competitve F1 and accuracy. So what are the main take aways?
In conclusion, there really doesn't seeem to be a free lunch. You can get better results with more complex models, but those models do take time and understanding to tune and even then might not provide significant improvements. Lastly, this is clearly just one data set and may not genearlize at all. It would be interesting to run similar tests on other data sets to see if there is a trend.