License

Copyright 2020 Patrick Hall and the H2O.ai team

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

DISCLAIMER: This notebook is not legal compliance advice.

Building the Case for Complexity

With automatic parsimonious hybrids (autoPH)

This notebook uses an automated process to train and select accurate and interpretable machine learning models.

It is roughly based on two recent white papers that propose the use of Shapley values in credit underwriting:

And a model selection process introduced at the 2004 KDD Cup:
KDD-Cup 2004: Results and Analysis

The notebook first trains a GLM for initial feature selection, then attempts a forward selection process with a more complex monotonic GBM model, and then trains an even more complex unconstrained GBM also with forward feature selection. The notebook ends with a bonus section that illustrates an automated, heuristic method for selecting monotonicity constraints for certain features and automatically training a parsimonious hybrid between the constrained and unconstrained GBM. For each trained model the notebook displays detailed assessment and diagnostic information, enabling practitioners to make deliberate, informed tradeoffs between accuracy, explainability, interpretability, and fairness.

Contents

  1. Download, Explore, and Prepare UCI Credit Card Default Data
  2. Investigate Pair-wise Pearson Correlations for DEFAULT_NEXT_MONTH
  3. Train Elastic Net Logistic GLM for Initial Feature Selection
    • Elastic Net Forward Step-wise Training
    • Model Details for Model Documentation
    • Compare Global Model Weights for Alternative Models
    • Partial Dependence and ICE for Model Documentation
    • Local Model Weights (for Adverse Action Notices)
    • Discrimination (Fair Lending) Testing
    • Estimate Business Impact
  4. Train Monotonic GBM with Forward Feature Selection
    • Forward Step-wise Training
    • Compare Global Model Weights for Alternative Models
    • Perform Cross-validated Ranking to Select Best MGBM Against Alternative Models
    • Model Details for Model Documentation
    • Partial Dependence and ICE for Model Documentation
    • Compare Local Model Weights (for Adverse Action Notices)
    • Discrimination (Fair Lending) Testing
    • Estimate Business Impact
  5. Train GBM with Forward Feature Selection
    • Forward Step-wise Training
    • Compare Global Model Weights for Alternative Models
    • Perform Cross-validated Ranking to Select Best GBM Against Alternative Models
    • Model Details for Model Documentation
    • Partial Dependence and ICE for Model Documentation
    • Compare Local Model Weights (for Adverse Action Notices)
    • Discrimination (Fair Lending) Testing
    • Estimate Business Impact
  6. Bonus: Automatically Training a Parsimonious Hybrid of Previous Models
    • Select Monotonicity Constraints Automatically
    • Forward Step-wise Training
    • Compare Global Model Weights for Alternative Models
    • Perform Cross-validated Ranking to Select Best Hybrid Against Alternative Models
    • Model Details for Model Documentation
    • Partial Dependence and ICE for Model Documentation
    • Compare Local Model Weights (for Adverse Action Notices)
    • Discrimination (Fair Lending) Testing
    • Estimate Business Impact

Global hyperpameters

In [1]:
SEED                    = 12345   # global random seed for better reproducibility
GLM_SELECTION_THRESHOLD = 0.001   # threshold above which a GLM coefficient is considered "selected"
MONO_THRESHOLD          = 6       # lower == more monotone constraints
TRUE_POSITIVE_AMOUNT    = 0       # revenue for rejecting a defaulting customer
TRUE_NEGATIVE_AMOUNT    = 20000   # revenue for accepting a paying customer, ~ customer LTV
FALSE_POSITIVE_AMOUNT   = -20000  # revenue for rejecting a paying customer, ~ -customer LTV 
FALSE_NEGATIVE_AMOUNT   = -100000 # revenue for accepting a defaulting customer, ~ -mean(LIMIT_BAL)

Python imports and inits

In [2]:
import auto_ph                                                    # simple module for training and eval
import h2o                                                        # import h2o python bindings to java server
import numpy as np                                                # array, vector, matrix calculations
import operator                                                   # for sorting dictionaries
import pandas as pd                                               # DataFrame handling
import time                                                       # for timers

import matplotlib.pyplot as plt      # plotting
pd.options.display.max_columns = 999 # enable display of all columns in notebook

# enables display of plots in notebook
%matplotlib inline

np.random.seed(SEED)                     # set random seed for better reproducibility

h2o.init(max_mem_size='24G', nthreads=4) # start h2o with plenty of memory and threads
h2o.remove_all()                         # clears h2o memory
h2o.no_progress()                        # turn off h2o progress indicators    
Checking whether there is an H2O instance running at http://localhost:54321 ..... not found.
Attempting to start a local H2O server...
  Java Version: openjdk version "1.8.0_252"; OpenJDK Runtime Environment (build 1.8.0_252-8u252-b09-1~18.04-b09); OpenJDK 64-Bit Server VM (build 25.252-b09, mixed mode)
  Starting server from /home/patrickh/Workspace/interpretable_machine_learning_with_python/env_iml/lib/python3.6/site-packages/h2o/backend/bin/h2o.jar
  Ice root: /tmp/tmpwmreop0t
  JVM stdout: /tmp/tmpwmreop0t/h2o_patrickh_started_from_python.out
  JVM stderr: /tmp/tmpwmreop0t/h2o_patrickh_started_from_python.err
  Server is running at http://127.0.0.1:54321
Connecting to H2O server at http://127.0.0.1:54321 ... successful.
Warning: Your H2O cluster version is too old (9 months and 9 days)! Please download and install the latest version from http://h2o.ai/download/
H2O cluster uptime: 01 secs
H2O cluster timezone: America/New_York
H2O data parsing timezone: UTC
H2O cluster version: 3.26.0.3
H2O cluster version age: 9 months and 9 days !!!
H2O cluster name: H2O_from_python_patrickh_txnqwv
H2O cluster total nodes: 1
H2O cluster free memory: 21.33 Gb
H2O cluster total cores: 24
H2O cluster allowed cores: 4
H2O cluster status: accepting new members, healthy
H2O connection url: http://127.0.0.1:54321
H2O connection proxy: None
H2O internal security: False
H2O API Extensions: Amazon S3, XGBoost, Algos, AutoML, Core V3, Core V4
Python version: 3.6.9 final

Start global timer

In [3]:
big_tic = time.time()

1. Download, Explore, and Prepare UCI Credit Card Default Data

UCI credit card default data: https://archive.ics.uci.edu/ml/datasets/default+of+credit+card+clients

The UCI credit card default data contains demographic and payment information about credit card customers in Taiwan in the year 2005. The data set contains 23 input features:

  • LIMIT_BAL: Amount of given credit (NT dollar)
  • SEX: 1 = male; 2 = female
  • EDUCATION: 1 = graduate school; 2 = university; 3 = high school; 4 = others
  • MARRIAGE: 1 = married; 2 = single; 3 = others
  • AGE: Age in years
  • PAY_0, PAY_2 - PAY_6: History of past payment; PAY_0 = the repayment status in September, 2005; PAY_2 = the repayment status in August, 2005; ...; PAY_6 = the repayment status in April, 2005. The measurement scale for the repayment status is: -1 = pay duly; 1 = payment delay for one month; 2 = payment delay for two months; ...; 8 = payment delay for eight months; 9 = payment delay for nine months and above.
  • BILL_AMT1 - BILL_AMT6: Amount of bill statement (NT dollar). BILL_AMNT1 = amount of bill statement in September, 2005; BILL_AMT2 = amount of bill statement in August, 2005; ...; BILL_AMT6 = amount of bill statement in April, 2005.
  • PAY_AMT1 - PAY_AMT6: Amount of previous payment (NT dollar). PAY_AMT1 = amount paid in September, 2005; PAY_AMT2 = amount paid in August, 2005; ...; PAY_AMT6 = amount paid in April, 2005.

Import data and clean

In [4]:
# import XLS file
path = 'default_of_credit_card_clients.xls'
data = pd.read_excel(path,
                     skiprows=1) # skip the first row of the spreadsheet

# remove spaces from target column name 
data = data.rename(columns={'default payment next month': 'DEFAULT_NEXT_MONTH'}) 

Recode categorical features into strings

In [5]:
def recode_cc_data(frame):
    
    """ Recodes numeric categorical variables into categorical character variables
    with more transparent values. 
    
    Args:
        frame: Pandas DataFrame version of UCI credit card default data.
        
    Returns: 
        H2OFrame with recoded values.
        
    """
    
    # define recoded values
    sex_dict = {1:'male', 2:'female'}
    education_dict = {0:'other', 1:'graduate school', 2:'university', 3:'high school', 
                      4:'other', 5:'other', 6:'other'}
    marriage_dict = {0:'other', 1:'married', 2:'single', 3:'divorced'}
    
    # recode values using Pandas apply() and anonymous function
    frame['SEX'] = frame['SEX'].apply(lambda i: sex_dict[i])
    frame['EDUCATION'] = frame['EDUCATION'].apply(lambda i: education_dict[i])    
    frame['MARRIAGE'] = frame['MARRIAGE'].apply(lambda i: marriage_dict[i])           
                
    return frame

data = recode_cc_data(data)

Assign modeling roles

In [6]:
# assign target and inputs for models
y_name = 'DEFAULT_NEXT_MONTH'
x_names = [name for name in data.columns if name not in [y_name, 'ID', 'SEX', 'EDUCATION', 'MARRIAGE', 'AGE']]
print('y_name =', y_name)
print('x_names =', x_names)
y_name = DEFAULT_NEXT_MONTH
x_names = ['LIMIT_BAL', 'PAY_0', 'PAY_2', 'PAY_3', 'PAY_4', 'PAY_5', 'PAY_6', 'BILL_AMT1', 'BILL_AMT2', 'BILL_AMT3', 'BILL_AMT4', 'BILL_AMT5', 'BILL_AMT6', 'PAY_AMT1', 'PAY_AMT2', 'PAY_AMT3', 'PAY_AMT4', 'PAY_AMT5', 'PAY_AMT6']

Display descriptive statistics

In [7]:
data[x_names + [y_name]].describe() 
Out[7]:
LIMIT_BAL PAY_0 PAY_2 PAY_3 PAY_4 PAY_5 PAY_6 BILL_AMT1 BILL_AMT2 BILL_AMT3 BILL_AMT4 BILL_AMT5 BILL_AMT6 PAY_AMT1 PAY_AMT2 PAY_AMT3 PAY_AMT4 PAY_AMT5 PAY_AMT6 DEFAULT_NEXT_MONTH
count 30000.000000 30000.000000 30000.000000 30000.000000 30000.000000 30000.000000 30000.000000 30000.000000 30000.000000 3.000000e+04 30000.000000 30000.000000 30000.000000 30000.000000 3.000000e+04 30000.00000 30000.000000 30000.000000 30000.000000 30000.000000
mean 167484.322667 -0.016700 -0.133767 -0.166200 -0.220667 -0.266200 -0.291100 51223.330900 49179.075167 4.701315e+04 43262.948967 40311.400967 38871.760400 5663.580500 5.921163e+03 5225.68150 4826.076867 4799.387633 5215.502567 0.221200
std 129747.661567 1.123802 1.197186 1.196868 1.169139 1.133187 1.149988 73635.860576 71173.768783 6.934939e+04 64332.856134 60797.155770 59554.107537 16563.280354 2.304087e+04 17606.96147 15666.159744 15278.305679 17777.465775 0.415062
min 10000.000000 -2.000000 -2.000000 -2.000000 -2.000000 -2.000000 -2.000000 -165580.000000 -69777.000000 -1.572640e+05 -170000.000000 -81334.000000 -339603.000000 0.000000 0.000000e+00 0.00000 0.000000 0.000000 0.000000 0.000000
25% 50000.000000 -1.000000 -1.000000 -1.000000 -1.000000 -1.000000 -1.000000 3558.750000 2984.750000 2.666250e+03 2326.750000 1763.000000 1256.000000 1000.000000 8.330000e+02 390.00000 296.000000 252.500000 117.750000 0.000000
50% 140000.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 22381.500000 21200.000000 2.008850e+04 19052.000000 18104.500000 17071.000000 2100.000000 2.009000e+03 1800.00000 1500.000000 1500.000000 1500.000000 0.000000
75% 240000.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 67091.000000 64006.250000 6.016475e+04 54506.000000 50190.500000 49198.250000 5006.000000 5.000000e+03 4505.00000 4013.250000 4031.500000 4000.000000 0.000000
max 1000000.000000 8.000000 8.000000 8.000000 8.000000 8.000000 8.000000 964511.000000 983931.000000 1.664089e+06 891586.000000 927171.000000 961664.000000 873552.000000 1.684259e+06 896040.00000 621000.000000 426529.000000 528666.000000 1.000000

2. Investigate Pair-wise Pearson Correlations for DEFAULT_NEXT_MONTH

Calculate Pearson correlation

In [8]:
# Pearson correlation between inputs and target
# is last column of correlation matrix
corr = pd.DataFrame(data[x_names + [y_name]].corr()[y_name]).iloc[:-1]
corr.columns = ['Pearson Correlation Coefficient']
corr
Out[8]:
Pearson Correlation Coefficient
LIMIT_BAL -0.153520
PAY_0 0.324794
PAY_2 0.263551
PAY_3 0.235253
PAY_4 0.216614
PAY_5 0.204149
PAY_6 0.186866
BILL_AMT1 -0.019644
BILL_AMT2 -0.014193
BILL_AMT3 -0.014076
BILL_AMT4 -0.010156
BILL_AMT5 -0.006760
BILL_AMT6 -0.005372
PAY_AMT1 -0.072929
PAY_AMT2 -0.058579
PAY_AMT3 -0.056250
PAY_AMT4 -0.056827
PAY_AMT5 -0.055124
PAY_AMT6 -0.053183

Plot Pearson correlation

In [9]:
# display correlation to target as barchart
fig, ax_ = plt.subplots(figsize=(8, 6))
_ = pd.DataFrame(data[x_names + [y_name]].corr()[y_name]).iloc[:-1].plot(kind='barh', ax=ax_, colormap='gnuplot')

3. Train Elastic Net Logistic GLM for Initial Feature Selection

3.1 Elastic Net Forward Step-wise Training

Split data into training and validation partitions

In [10]:
split_ratio = 0.7 # 70%/30% train/test split

# execute split
split = np.random.rand(len(data)) < split_ratio
train = data[split]
valid = data[~split]

# summarize split
print('Train data rows = %d, columns = %d' % (train.shape[0], train.shape[1]))
print('Validation data rows = %d, columns = %d' % (valid.shape[0], valid.shape[1]))
Train data rows = 20946, columns = 25
Validation data rows = 9054, columns = 25

Train penalized GLM for initial benchmark and feature selection

In [11]:
# train penalized GLM w/ alpha and lambda grid search
best_glm = auto_ph.glm_grid(x_names, y_name, h2o.H2OFrame(train),
                            h2o.H2OFrame(valid), SEED)

# output results
print('Best penalized GLM AUC: %.2f' % 
      best_glm.auc(valid=True))

# print selected coefficients
print('Best penalized GLM coefficients:')
for c_name, c_val in sorted(best_glm.coef().items(), key=operator.itemgetter(1)):
    if abs(c_val) > GLM_SELECTION_THRESHOLD:
        print('%s %s' % (str(c_name + ':').ljust(25), c_val))
Best penalized GLM AUC: 0.73
Best penalized GLM coefficients:
Intercept:                -1.0553055885519234
PAY_6:                    0.012282515281903404
PAY_4:                    0.02548663430338296
PAY_5:                    0.04613937054483111
PAY_3:                    0.07909158701433701
PAY_2:                    0.08471364623597981
PAY_0:                    0.5371954715199951

3.2 Model Details for Model Documentation

Display best GLM information

In [12]:
best_glm 
Model Details
=============
H2OGeneralizedLinearEstimator :  Generalized Linear Modeling
Model Key:  Grid_GLM_Key_Frame__upload_a63e007f1d8a0eed8507dfc545944fcb.hex_model_python_1591114897536_1_model_1


GLM Model: summary
family link regularization lambda_search number_of_predictors_total number_of_active_predictors number_of_iterations training_frame
0 binomial logit Elastic Net (alpha = 0.01, lambda = 0.005908 ) nlambda = 100, lambda.max = 13.333, lambda.min = 0.005908, lambda.... 19 19 109 Key_Frame__upload_a63e007f1d8a0eed8507dfc545944fcb.hex
ModelMetricsBinomialGLM: glm
** Reported on train data. **

MSE: 0.14649158915954694
RMSE: 0.3827421967324049
LogLoss: 0.4685812636002607
Null degrees of freedom: 20945
Residual degrees of freedom: 20926
Null deviance: 22178.75361964548
Residual deviance: 19629.80629474212
AIC: 19669.80629474212
AUC: 0.7182752479663853
pr_auc: 0.5010322496049208
Gini: 0.4365504959327706

Confusion Matrix (Act/Pred) for max f1 @ threshold = 0.2498310673824375: 
0 1 Error Rate
0 0 13778.0 2518.0 0.1545 (2518.0/16296.0)
1 1 2168.0 2482.0 0.4662 (2168.0/4650.0)
2 Total 15946.0 5000.0 0.2237 (4686.0/20946.0)
Maximum Metrics: Maximum metrics at their respective thresholds
metric threshold value idx
0 max f1 0.249831 0.514404 205.0
1 max f2 0.054654 0.594059 377.0
2 max f0point5 0.399178 0.567555 137.0
3 max accuracy 0.418922 0.817053 128.0
4 max precision 0.706802 0.797414 34.0
5 max recall 0.001281 1.000000 399.0
6 max specificity 0.989212 0.999570 0.0
7 max absolute_mcc 0.399178 0.396395 137.0
8 max min_per_class_accuracy 0.221641 0.658925 237.0
9 max mean_per_class_accuracy 0.245211 0.690483 210.0
Gains/Lift Table: Avg response rate: 22.20 %, avg score: 22.20 %
group cumulative_data_fraction lower_threshold lift cumulative_lift response_rate score cumulative_response_rate cumulative_score capture_rate cumulative_capture_rate gain cumulative_gain
0 1 0.010026 7.192943e-01 3.539263 3.539263 0.785714 0.816889 0.785714 0.816889 0.035484 0.035484 253.926267 253.926267
1 2 0.020004 6.109553e-01 2.952721 3.246692 0.655502 0.659815 0.720764 0.738540 0.029462 0.064946 195.272110 224.669182
2 3 0.030030 5.904669e-01 3.303312 3.265595 0.733333 0.600522 0.724960 0.692461 0.033118 0.098065 230.331183 226.559516
3 4 0.040008 5.652261e-01 3.469986 3.316571 0.770335 0.576997 0.736277 0.663664 0.034624 0.132688 246.998611 231.657094
4 5 0.050033 5.378222e-01 3.024461 3.258037 0.671429 0.552092 0.723282 0.641307 0.030323 0.163011 202.446083 225.803743
5 6 0.100019 4.442181e-01 2.916965 3.087582 0.647564 0.484561 0.685442 0.562971 0.145806 0.308817 191.696460 208.758242
6 7 0.150005 3.537487e-01 2.090922 2.755468 0.464183 0.405265 0.611712 0.510419 0.104516 0.413333 109.092153 175.546785
7 8 0.200038 2.655100e-01 1.427003 2.423193 0.316794 0.294731 0.537947 0.456471 0.071398 0.484731 42.700320 142.319316
8 9 0.300010 2.382281e-01 1.015345 1.954060 0.225406 0.248630 0.433800 0.387213 0.101505 0.586237 1.534461 95.405967
9 10 0.400029 2.224902e-01 0.672990 1.633754 0.149403 0.229921 0.362692 0.347885 0.067312 0.653548 -32.701024 63.375397
10 11 0.500000 2.030962e-01 0.537788 1.414624 0.119389 0.213314 0.314046 0.320979 0.053763 0.707312 -46.221154 41.462366
11 12 0.600019 1.763763e-01 0.485929 1.259817 0.107876 0.190730 0.279679 0.299267 0.048602 0.755914 -51.407129 25.981653
12 13 0.699990 1.357259e-01 0.686218 1.177896 0.152340 0.157265 0.261492 0.278986 0.068602 0.824516 -31.378193 17.789625
13 14 0.800010 1.134964e-01 0.724593 1.121223 0.160859 0.123743 0.248911 0.259578 0.072473 0.896989 -27.540719 12.122318
14 15 0.900076 6.434736e-02 0.477100 1.049612 0.105916 0.093489 0.233013 0.241113 0.047742 0.944731 -52.289953 4.961223
15 16 1.000000 1.134804e-08 0.553111 1.000000 0.122790 0.049839 0.221999 0.222000 0.055269 1.000000 -44.688932 0.000000
ModelMetricsBinomialGLM: glm
** Reported on validation data. **

MSE: 0.14363466020798762
RMSE: 0.3789916360660056
LogLoss: 0.4617337838828072
Null degrees of freedom: 9053
Residual degrees of freedom: 9034
Null deviance: 9526.71172610569
Residual deviance: 8361.075358549871
AIC: 8401.075358549871
AUC: 0.7303402396287311
pr_auc: 0.5061676572465437
Gini: 0.46068047925746214

Confusion Matrix (Act/Pred) for max f1 @ threshold = 0.26072851982555734: 
0 1 Error Rate
0 0 6083.0 985.0 0.1394 (985.0/7068.0)
1 1 921.0 1065.0 0.4637 (921.0/1986.0)
2 Total 7004.0 2050.0 0.2105 (1906.0/9054.0)
Maximum Metrics: Maximum metrics at their respective thresholds
metric threshold value idx
0 max f1 0.260729 0.527750 193.0
1 max f2 0.116430 0.592436 329.0
2 max f0point5 0.400670 0.576923 134.0
3 max accuracy 0.433434 0.822288 120.0
4 max precision 0.572358 0.743386 68.0
5 max recall 0.007410 1.000000 397.0
6 max specificity 0.989045 0.999859 0.0
7 max absolute_mcc 0.370892 0.413902 147.0
8 max min_per_class_accuracy 0.225699 0.672205 230.0
9 max mean_per_class_accuracy 0.246401 0.699498 206.0
Gains/Lift Table: Avg response rate: 21.94 %, avg score: 22.66 %
group cumulative_data_fraction lower_threshold lift cumulative_lift response_rate score cumulative_response_rate cumulative_score capture_rate cumulative_capture_rate gain cumulative_gain
0 1 0.010051 7.207631e-01 3.106072 3.106072 0.681319 0.813734 0.681319 0.813734 0.031219 0.031219 210.607218 210.607218
1 2 0.020102 6.229843e-01 3.156170 3.131121 0.692308 0.663736 0.686813 0.738735 0.031722 0.062941 215.617011 213.112114
2 3 0.030042 5.993452e-01 3.545821 3.268338 0.777778 0.607299 0.716912 0.695245 0.035247 0.098187 254.582075 226.833792
3 4 0.040093 5.753741e-01 3.757345 3.390927 0.824176 0.587697 0.743802 0.668284 0.037764 0.135952 275.734537 239.092657
4 5 0.050033 5.509826e-01 3.241893 3.361317 0.711111 0.562950 0.737307 0.647357 0.032226 0.168177 224.189325 236.131730
5 6 0.100066 4.545088e-01 2.968828 3.165073 0.651214 0.496126 0.694260 0.571741 0.148540 0.316717 196.882815 216.507273
6 7 0.149989 3.741888e-01 2.208854 2.846802 0.484513 0.414835 0.624448 0.519516 0.110272 0.426989 120.885357 184.680243
7 8 0.200022 2.740947e-01 1.489446 2.507276 0.326711 0.315894 0.549972 0.468582 0.074522 0.501511 48.944599 150.727595
8 9 0.299978 2.407348e-01 0.997420 2.004176 0.218785 0.253042 0.439617 0.396762 0.099698 0.601208 -0.258049 100.417577
9 10 0.400044 2.249573e-01 0.719563 1.682845 0.157837 0.232429 0.369133 0.355656 0.072004 0.673212 -28.043657 68.284535
10 11 0.500000 2.058437e-01 0.539010 1.454179 0.118232 0.215713 0.318975 0.327680 0.053877 0.727090 -46.099047 45.417925
11 12 0.599956 1.811296e-01 0.443298 1.285761 0.097238 0.194132 0.282032 0.305430 0.044310 0.771400 -55.670244 28.576100
12 13 0.700022 1.408759e-01 0.533383 1.178211 0.116998 0.162818 0.258441 0.285044 0.053374 0.824773 -46.661731 17.821055
13 14 0.799978 1.142562e-01 0.800958 1.131074 0.175691 0.125883 0.248102 0.265157 0.080060 0.904834 -19.904191 13.107353
14 15 0.899934 6.452901e-02 0.438260 1.054123 0.096133 0.094816 0.231222 0.246237 0.043807 0.948640 -56.173991 5.412260
15 16 1.000000 1.276339e-12 0.513255 1.000000 0.112583 0.049514 0.219351 0.226552 0.051360 1.000000 -48.674496 0.000000
Scoring History: 
timestamp duration iteration lambda predictors deviance_train deviance_test
0 2020-06-02 12:21:42 0.000 sec 1 .13E2 1 1.058854 1.052210
1 2020-06-02 12:21:42 0.046 sec 2 .12E2 2 1.058595 1.051942
2 2020-06-02 12:21:42 0.076 sec 3 .11E2 2 1.058312 1.051649
3 2020-06-02 12:21:42 0.112 sec 4 .1E2 3 1.057846 1.051165
4 2020-06-02 12:21:42 0.136 sec 5 .92E1 4 1.057200 1.050494
5 2020-06-02 12:21:42 0.166 sec 6 .84E1 5 1.056328 1.049589
6 2020-06-02 12:21:42 0.186 sec 7 .76E1 6 1.055193 1.048412
7 2020-06-02 12:21:43 0.203 sec 8 .7E1 7 1.053849 1.047023
8 2020-06-02 12:21:43 0.256 sec 9 .63E1 7 1.052407 1.045532
9 2020-06-02 12:21:43 0.270 sec 10 .58E1 8 1.050810 1.043879
10 2020-06-02 12:21:43 0.282 sec 11 .53E1 8 1.049070 1.042077
11 2020-06-02 12:21:43 0.299 sec 12 .48E1 8 1.047225 1.040165
12 2020-06-02 12:21:43 0.315 sec 13 .44E1 8 1.045279 1.038148
13 2020-06-02 12:21:43 0.329 sec 14 .4E1 8 1.043223 1.036017
14 2020-06-02 12:21:43 0.345 sec 15 .36E1 8 1.041064 1.033777
15 2020-06-02 12:21:43 0.364 sec 16 .33E1 8 1.038800 1.031428
16 2020-06-02 12:21:43 0.385 sec 18 .3E1 9 1.036416 1.028957
17 2020-06-02 12:21:43 0.409 sec 20 .27E1 9 1.033911 1.026363
18 2020-06-02 12:21:43 0.435 sec 22 .25E1 9 1.031311 1.023671
19 2020-06-02 12:21:43 0.462 sec 24 .23E1 9 1.028635 1.020899
See the whole table with table.as_data_frame()
Out[12]:

Plot penalized GLM coefficient regularization path

In [13]:
# collect regularization paths from dict in DataFrame
reg_path_dict = best_glm.getGLMRegularizationPath(best_glm)
reg_path_frame = pd.DataFrame(columns=reg_path_dict['coefficients'][0].keys())
for i in range(0, len(reg_path_dict['coefficients'])): 
    reg_path_frame = reg_path_frame.append(reg_path_dict['coefficients'][i], 
                                           ignore_index=True)

###########################################    
# establish benchmark feature selection:  #
#           glm_selected                  #
# used frequently in further calculations #
###########################################

glm_selected = list(reg_path_frame.iloc[-1, :][reg_path_frame.iloc[-1, :] > GLM_SELECTION_THRESHOLD].index)

# plot regularization paths
fig, ax_ = plt.subplots(figsize=(8, 6))
_ = reg_path_frame[glm_selected].plot(kind='line', ax=ax_, title='Penalized GLM Regularization Paths',
                                      colormap='gnuplot')
_ = ax_.set_xlabel('Iteration')
_ = ax_.set_ylabel('Coefficient Value')
_ = ax_.axhline(c='k', lw=1, xmin=0.045, xmax=0.955)
_ = plt.legend(bbox_to_anchor=(1.05, 0),
               loc=3, 
               borderaxespad=0.)

3.3 Compare Global Model Weights Against Alternative Model

In [14]:
"""
# collect Pearson correlation and GLM coefficients into same DataFrame
glm_selected_coef = pd.DataFrame.from_dict(best_glm.coef(), orient='index', columns=['Penalized GLM Coefficient'])
zcorr_glm = pd.concat([corr, glm_selected_coef.iloc[1:]], axis=1)

# plot
fig, ax_ = plt.subplots(figsize=(8, 6))
_ = corr_glm.plot(kind='barh', ax=ax_, colormap='gnuplot')
"""
Out[14]:
"\n# collect Pearson correlation and GLM coefficients into same DataFrame\nglm_selected_coef = pd.DataFrame.from_dict(best_glm.coef(), orient='index', columns=['Penalized GLM Coefficient'])\nzcorr_glm = pd.concat([corr, glm_selected_coef.iloc[1:]], axis=1)\n\n# plot\nfig, ax_ = plt.subplots(figsize=(8, 6))\n_ = corr_glm.plot(kind='barh', ax=ax_, colormap='gnuplot')\n"
In [15]:
# collect Pearson correlation and GLM contributions into same DataFrame
glm_contrib_frame = pd.concat([valid[x_names].abs().mean(axis=0), 
                               pd.DataFrame.from_dict(best_glm.coef(), orient='index', 
                                                      columns=['Penalized GLM Coefficient']).drop('Intercept')],
                              axis=1, sort=True)
glm_contrib_frame['Penalized GLM Contribution'] = glm_contrib_frame.iloc[:, 1] * glm_contrib_frame.iloc[:, 1]
corr_glm = pd.concat([corr.abs(), glm_contrib_frame.iloc[:, 2]], axis=1, sort=True)
corr_glm.columns = ['Absolute ' + name for name in corr_glm.columns]
# another approach is to calculate Shapley values for GLM directly


# plot
fig, ax_ = plt.subplots(figsize=(8, 6))
_ = corr_glm.plot(kind='barh', ax=ax_, colormap='gnuplot')

3.4 Partial Dependence and ICE for Model Documentation

Calculate partial dependence for each feature in best GLM

In [16]:
# init dict to hold partial dependence and ICE values
# for each feature
# for glm
glm_pd_ice_dict = {}

# calculate partial dependence for each selected feature
for xs in glm_selected: 
    glm_pd_ice_dict[xs] = auto_ph.pd_ice(xs, valid, best_glm)

Find some percentiles of yhat in the validation data

In [17]:
# merge GLM predictions onto test data
glm_yhat_valid = pd.concat([valid.reset_index(drop=True),
                            best_glm.predict(h2o.H2OFrame(valid))['p1'].as_data_frame()],
                           axis=1)

# rename yhat column
glm_yhat_valid = glm_yhat_valid.rename(columns={'p1':'p_DEFAULT_NEXT_MONTH'})

# find percentiles of predictions
glm_percentile_dict = auto_ph.get_percentile_dict('p_DEFAULT_NEXT_MONTH', glm_yhat_valid, 'ID')

# display percentiles dictionary
# key=percentile, val=row_id
glm_percentile_dict
Out[17]:
{0: 28717,
 99: 13713,
 10: 27519,
 20: 29714,
 30: 1012,
 40: 12100,
 50: 2849,
 60: 3518,
 70: 26302,
 80: 5763,
 90: 7083}

Calculate ICE curve values

In [18]:
# loop through selected variables
for xs in glm_selected: 

    # collect bins used in partial dependence
    bins = list(glm_pd_ice_dict[xs][xs])
    
    # calculate ICE at percentiles 
    # using partial dependence bins
    # for each selected feature
    for i in sorted(glm_percentile_dict.keys()):
        col_name = 'Percentile_' + str(i)
        glm_pd_ice_dict[xs][col_name] = auto_ph.pd_ice(xs, # x_names used here b/c all features have small coef in GLM
                                                       valid[valid['ID'] == int(glm_percentile_dict[i])][x_names], 
                                                       best_glm, 
                                                       bins=bins)['partial_dependence']
       

Assess partial dependence and ICE for each feature in best GLM

In [19]:
for xs in glm_selected: 
    auto_ph.hist_mean_pd_ice_plot(xs, y_name, valid, glm_pd_ice_dict)

3.5 Local Model Weights (for Adverse Action Notices)

Create global data structure for local coefficients

In [20]:
local_coef_dict = {10: pd.DataFrame(columns = ['GLM Contribution'], index=x_names),
                   50: pd.DataFrame(columns = ['GLM Contribution'], index=x_names),
                   90: pd.DataFrame(columns = ['GLM Contribution'], index=x_names)}

Calculate local contributions for best GLM at three percentiles of p_DEFAULT_NEXT_MONTH

(Another option would be to calculate Shapley values for GLM directly.)

In [21]:
for name in x_names:
    for percentile in [10, 50, 90]:
    
        # local contributions = beta_j * x_i,j
        local_coef_dict[percentile].loc[name, 'GLM Contribution'] =\
            best_glm.coef()[name] *\
            valid[valid['ID'] == int(glm_percentile_dict[percentile])][name].values[0]
    

Plot best GLM local contributions at three percentiles of p_DEFAULT_NEXT_MONTH

In [22]:
fig, (ax0, ax1, ax2) = plt.subplots(ncols=3, sharey=True)
plt.tight_layout()
plt.subplots_adjust(left=0, right=2, wspace=0.1)

_ = local_coef_dict[10].plot(kind='bar', color='#ffff00', ax=ax0,
                             title='10th PCTL of p_DEFAULT_NEXT_MONTH')

_ = local_coef_dict[50].plot(kind='bar', color='#ffff00', ax=ax1,
                             title='50th PCTL of p_DEFAULT_NEXT_MONTH')

_ = local_coef_dict[90].plot(kind='bar', color='#ffff00', ax=ax2,
                             title='90th PCTL of p_DEFAULT_NEXT_MONTH')

3.6 Discrimination (Fair Lending) Testing

Standardized mean difference for SEX = male and SEX = female

In [23]:
print('Standardized mean difference: %.2f' % auto_ph.smd(glm_yhat_valid, 'SEX', 'p_DEFAULT_NEXT_MONTH', 'male', 'female'))
Male mean yhat: 0.23
Female mean yhat: 0.22
P_Default_Next_Month std. dev.:  0.15
Standardized mean difference: -0.08

Determine a probability cutoff

In [24]:
best_glm_cut = best_glm.mcc(valid=True)[0][0]
best_glm_cut
Out[24]:
0.37089169526976923

Calculate confusion matrices

In [25]:
glm_male_cm = auto_ph.get_confusion_matrix(glm_yhat_valid, y_name, 'p_DEFAULT_NEXT_MONTH', by='SEX',
                                           level='male', cutoff=best_glm_cut)

glm_female_cm = auto_ph.get_confusion_matrix(glm_yhat_valid, y_name, 'p_DEFAULT_NEXT_MONTH', by='SEX',
                                             level='female', cutoff=best_glm_cut)

glm_cm_dict = {'male': glm_male_cm, 'female': glm_female_cm}

Confusion matrix by SEX = male

In [26]:
glm_male_cm
Out[26]:
actual: 1 actual: 0
predicted: 1 379 211
predicted: 0 464 2538

Confusion matrix by SEX = female

In [27]:
glm_female_cm
Out[27]:
actual: 1 actual: 0
predicted: 1 480 308
predicted: 0 663 4011

Adverse impact ratio for SEX = male and SEX = female

In [28]:
print('Adverse impact ratio: %.2f' % auto_ph.air(glm_cm_dict, 'male', 'female'))
Male proportion accepted: 0.836
Female proportion accepted: 0.856
Adverse impact ratio: 1.02

Marginal effect for SEX = male and SEX = female

In [29]:
print('Marginal effect: %.2f%%' % auto_ph.marginal_effect(glm_cm_dict, 'male', 'female'))
Male accepted: 83.57%
Female accepted: 85.57%
Marginal effect: -2.00%

3.7 Estimated Business Impact

Calculate overall confusion matrix

In [30]:
glm_cm = auto_ph.get_confusion_matrix(glm_yhat_valid, y_name, 'p_DEFAULT_NEXT_MONTH', cutoff=best_glm_cut)
glm_cm
Out[30]:
actual: 1 actual: 0
predicted: 1 859 519
predicted: 0 1127 6549

Estimate business impact

In [31]:
glm_business_impact = glm_cm.iloc[0, 0]*TRUE_POSITIVE_AMOUNT +\
                      glm_cm.iloc[0, 1]*FALSE_POSITIVE_AMOUNT +\
                      glm_cm.iloc[1, 0]*FALSE_NEGATIVE_AMOUNT +\
                      glm_cm.iloc[1, 1]*TRUE_NEGATIVE_AMOUNT

print('Estimated business impact $%.2f' % glm_business_impact)
Estimated business impact $7900000.00

4. Train Monotonic GBM with Forward Feature Selection

4.1 Forward Step-wise Training

In [33]:
# initialize data structures needed to compare correlation coefficients,
# penalized glm coefficients, and MGBM Shapley values
# as features are added into the MGBM
abs_corr = corr.copy(deep=True)
abs_corr['Pearson Correlation Coefficient'] = corr['Pearson Correlation Coefficient'].abs()

# create a list of features to add into MGBM
# list is ordered by correlation between X_j and y
next_list = [name for name in list(abs_corr.sort_values(by='Pearson Correlation Coefficient',
                                                        ascending=False).index) if name not in glm_selected]

# create a DataFrame to store new MGBM SHAP values
# for comparison to correlation and penalized glm coefficients
abs_corr_glm_mgbm_shap = corr_glm.copy(deep=True).abs()
#abs_corr_glm_mgbm_shap.columns = ['Absolute ' + name for name in abs_corr_glm_mgbm_shap.columns]
abs_corr_glm_mgbm_shap['Monotonic GBM Mean SHAP Value'] = 0

# start local timer
tic = time.time()

# forward stepwise MGBM training
mgbm_train_results = auto_ph.gbm_forward_select_train(glm_selected, 
                                                      y_name, 
                                                      train, 
                                                      valid, 
                                                      SEED, 
                                                      next_list,
                                                      abs_corr_glm_mgbm_shap, 
                                                      'Monotonic GBM Mean SHAP Value',
                                                      monotone=True)

mgbm_models = mgbm_train_results['MODELS']
corr_glm_mgbm_shap_coefs = mgbm_train_results['GLOBAL_COEFS']
mgbm_shap = mgbm_train_results['LOCAL_COEFS']

# end local timer
toc = time.time()-tic
print('Task completed in %.2f s.' % (toc))

# 2 threads  - 695 s
# 4 threads  - 691 s
# 8 threads  - 692 s
Starting grid search 1/14 ...
Input features = ['PAY_0', 'PAY_2', 'PAY_3', 'PAY_4', 'PAY_5', 'PAY_6']
Monotone constraints = {'PAY_0': 1, 'PAY_2': 1, 'PAY_3': 1, 'PAY_4': 1, 'PAY_5': 1, 'PAY_6': 1}
Completed grid search 1/14 with AUC: 0.74 ...
--------------------------------------------------------------------------------
Starting grid search 2/14 ...
Input features = ['PAY_0', 'PAY_2', 'PAY_3', 'PAY_4', 'PAY_5', 'PAY_6', 'LIMIT_BAL']
Monotone constraints = {'PAY_0': 1, 'PAY_2': 1, 'PAY_3': 1, 'PAY_4': 1, 'PAY_5': 1, 'PAY_6': 1, 'LIMIT_BAL': -1}
Completed grid search 2/14 with AUC: 0.76 ...
--------------------------------------------------------------------------------
Starting grid search 3/14 ...
Input features = ['PAY_0', 'PAY_2', 'PAY_3', 'PAY_4', 'PAY_5', 'PAY_6', 'LIMIT_BAL', 'PAY_AMT1']
Monotone constraints = {'PAY_0': 1, 'PAY_2': 1, 'PAY_3': 1, 'PAY_4': 1, 'PAY_5': 1, 'PAY_6': 1, 'LIMIT_BAL': -1, 'PAY_AMT1': -1}
Completed grid search 3/14 with AUC: 0.77 ...
--------------------------------------------------------------------------------
Starting grid search 4/14 ...
Input features = ['PAY_0', 'PAY_2', 'PAY_3', 'PAY_4', 'PAY_5', 'PAY_6', 'LIMIT_BAL', 'PAY_AMT1', 'PAY_AMT2']
Monotone constraints = {'PAY_0': 1, 'PAY_2': 1, 'PAY_3': 1, 'PAY_4': 1, 'PAY_5': 1, 'PAY_6': 1, 'LIMIT_BAL': -1, 'PAY_AMT1': -1, 'PAY_AMT2': -1}
Completed grid search 4/14 with AUC: 0.77 ...
--------------------------------------------------------------------------------
Starting grid search 5/14 ...
Input features = ['PAY_0', 'PAY_2', 'PAY_3', 'PAY_4', 'PAY_5', 'PAY_6', 'LIMIT_BAL', 'PAY_AMT1', 'PAY_AMT2', 'PAY_AMT4']
Monotone constraints = {'PAY_0': 1, 'PAY_2': 1, 'PAY_3': 1, 'PAY_4': 1, 'PAY_5': 1, 'PAY_6': 1, 'LIMIT_BAL': -1, 'PAY_AMT1': -1, 'PAY_AMT2': -1, 'PAY_AMT4': -1}
Completed grid search 5/14 with AUC: 0.78 ...
--------------------------------------------------------------------------------
Starting grid search 6/14 ...
Input features = ['PAY_0', 'PAY_2', 'PAY_3', 'PAY_4', 'PAY_5', 'PAY_6', 'LIMIT_BAL', 'PAY_AMT1', 'PAY_AMT2', 'PAY_AMT4', 'PAY_AMT3']
Monotone constraints = {'PAY_0': 1, 'PAY_2': 1, 'PAY_3': 1, 'PAY_4': 1, 'PAY_5': 1, 'PAY_6': 1, 'LIMIT_BAL': -1, 'PAY_AMT1': -1, 'PAY_AMT2': -1, 'PAY_AMT4': -1, 'PAY_AMT3': -1}
Completed grid search 6/14 with AUC: 0.78 ...
--------------------------------------------------------------------------------
Starting grid search 7/14 ...
Input features = ['PAY_0', 'PAY_2', 'PAY_3', 'PAY_4', 'PAY_5', 'PAY_6', 'LIMIT_BAL', 'PAY_AMT1', 'PAY_AMT2', 'PAY_AMT4', 'PAY_AMT3', 'PAY_AMT5']
Monotone constraints = {'PAY_0': 1, 'PAY_2': 1, 'PAY_3': 1, 'PAY_4': 1, 'PAY_5': 1, 'PAY_6': 1, 'LIMIT_BAL': -1, 'PAY_AMT1': -1, 'PAY_AMT2': -1, 'PAY_AMT4': -1, 'PAY_AMT3': -1, 'PAY_AMT5': -1}
Completed grid search 7/14 with AUC: 0.78 ...
--------------------------------------------------------------------------------
Starting grid search 8/14 ...
Input features = ['PAY_0', 'PAY_2', 'PAY_3', 'PAY_4', 'PAY_5', 'PAY_6', 'LIMIT_BAL', 'PAY_AMT1', 'PAY_AMT2', 'PAY_AMT4', 'PAY_AMT3', 'PAY_AMT5', 'PAY_AMT6']
Monotone constraints = {'PAY_0': 1, 'PAY_2': 1, 'PAY_3': 1, 'PAY_4': 1, 'PAY_5': 1, 'PAY_6': 1, 'LIMIT_BAL': -1, 'PAY_AMT1': -1, 'PAY_AMT2': -1, 'PAY_AMT4': -1, 'PAY_AMT3': -1, 'PAY_AMT5': -1, 'PAY_AMT6': -1}
Completed grid search 8/14 with AUC: 0.78 ...
--------------------------------------------------------------------------------
Starting grid search 9/14 ...
Input features = ['PAY_0', 'PAY_2', 'PAY_3', 'PAY_4', 'PAY_5', 'PAY_6', 'LIMIT_BAL', 'PAY_AMT1', 'PAY_AMT2', 'PAY_AMT4', 'PAY_AMT3', 'PAY_AMT5', 'PAY_AMT6', 'BILL_AMT1']
Monotone constraints = {'PAY_0': 1, 'PAY_2': 1, 'PAY_3': 1, 'PAY_4': 1, 'PAY_5': 1, 'PAY_6': 1, 'LIMIT_BAL': -1, 'PAY_AMT1': -1, 'PAY_AMT2': -1, 'PAY_AMT4': -1, 'PAY_AMT3': -1, 'PAY_AMT5': -1, 'PAY_AMT6': -1, 'BILL_AMT1': -1}
Completed grid search 9/14 with AUC: 0.78 ...
--------------------------------------------------------------------------------
Starting grid search 10/14 ...
Input features = ['PAY_0', 'PAY_2', 'PAY_3', 'PAY_4', 'PAY_5', 'PAY_6', 'LIMIT_BAL', 'PAY_AMT1', 'PAY_AMT2', 'PAY_AMT4', 'PAY_AMT3', 'PAY_AMT5', 'PAY_AMT6', 'BILL_AMT1', 'BILL_AMT2']
Monotone constraints = {'PAY_0': 1, 'PAY_2': 1, 'PAY_3': 1, 'PAY_4': 1, 'PAY_5': 1, 'PAY_6': 1, 'LIMIT_BAL': -1, 'PAY_AMT1': -1, 'PAY_AMT2': -1, 'PAY_AMT4': -1, 'PAY_AMT3': -1, 'PAY_AMT5': -1, 'PAY_AMT6': -1, 'BILL_AMT1': -1, 'BILL_AMT2': -1}
Completed grid search 10/14 with AUC: 0.78 ...
--------------------------------------------------------------------------------
Starting grid search 11/14 ...
Input features = ['PAY_0', 'PAY_2', 'PAY_3', 'PAY_4', 'PAY_5', 'PAY_6', 'LIMIT_BAL', 'PAY_AMT1', 'PAY_AMT2', 'PAY_AMT4', 'PAY_AMT3', 'PAY_AMT5', 'PAY_AMT6', 'BILL_AMT1', 'BILL_AMT2', 'BILL_AMT3']
Monotone constraints = {'PAY_0': 1, 'PAY_2': 1, 'PAY_3': 1, 'PAY_4': 1, 'PAY_5': 1, 'PAY_6': 1, 'LIMIT_BAL': -1, 'PAY_AMT1': -1, 'PAY_AMT2': -1, 'PAY_AMT4': -1, 'PAY_AMT3': -1, 'PAY_AMT5': -1, 'PAY_AMT6': -1, 'BILL_AMT1': -1, 'BILL_AMT2': -1, 'BILL_AMT3': -1}
Completed grid search 11/14 with AUC: 0.78 ...
--------------------------------------------------------------------------------
Starting grid search 12/14 ...
Input features = ['PAY_0', 'PAY_2', 'PAY_3', 'PAY_4', 'PAY_5', 'PAY_6', 'LIMIT_BAL', 'PAY_AMT1', 'PAY_AMT2', 'PAY_AMT4', 'PAY_AMT3', 'PAY_AMT5', 'PAY_AMT6', 'BILL_AMT1', 'BILL_AMT2', 'BILL_AMT3', 'BILL_AMT4']
Monotone constraints = {'PAY_0': 1, 'PAY_2': 1, 'PAY_3': 1, 'PAY_4': 1, 'PAY_5': 1, 'PAY_6': 1, 'LIMIT_BAL': -1, 'PAY_AMT1': -1, 'PAY_AMT2': -1, 'PAY_AMT4': -1, 'PAY_AMT3': -1, 'PAY_AMT5': -1, 'PAY_AMT6': -1, 'BILL_AMT1': -1, 'BILL_AMT2': -1, 'BILL_AMT3': -1, 'BILL_AMT4': -1}
Completed grid search 12/14 with AUC: 0.78 ...
--------------------------------------------------------------------------------
Starting grid search 13/14 ...
Input features = ['PAY_0', 'PAY_2', 'PAY_3', 'PAY_4', 'PAY_5', 'PAY_6', 'LIMIT_BAL', 'PAY_AMT1', 'PAY_AMT2', 'PAY_AMT4', 'PAY_AMT3', 'PAY_AMT5', 'PAY_AMT6', 'BILL_AMT1', 'BILL_AMT2', 'BILL_AMT3', 'BILL_AMT4', 'BILL_AMT5']
Monotone constraints = {'PAY_0': 1, 'PAY_2': 1, 'PAY_3': 1, 'PAY_4': 1, 'PAY_5': 1, 'PAY_6': 1, 'LIMIT_BAL': -1, 'PAY_AMT1': -1, 'PAY_AMT2': -1, 'PAY_AMT4': -1, 'PAY_AMT3': -1, 'PAY_AMT5': -1, 'PAY_AMT6': -1, 'BILL_AMT1': -1, 'BILL_AMT2': -1, 'BILL_AMT3': -1, 'BILL_AMT4': -1, 'BILL_AMT5': -1}
Completed grid search 13/14 with AUC: 0.78 ...
--------------------------------------------------------------------------------
Starting grid search 14/14 ...
Input features = ['PAY_0', 'PAY_2', 'PAY_3', 'PAY_4', 'PAY_5', 'PAY_6', 'LIMIT_BAL', 'PAY_AMT1', 'PAY_AMT2', 'PAY_AMT4', 'PAY_AMT3', 'PAY_AMT5', 'PAY_AMT6', 'BILL_AMT1', 'BILL_AMT2', 'BILL_AMT3', 'BILL_AMT4', 'BILL_AMT5', 'BILL_AMT6']
Monotone constraints = {'PAY_0': 1, 'PAY_2': 1, 'PAY_3': 1, 'PAY_4': 1, 'PAY_5': 1, 'PAY_6': 1, 'LIMIT_BAL': -1, 'PAY_AMT1': -1, 'PAY_AMT2': -1, 'PAY_AMT4': -1, 'PAY_AMT3': -1, 'PAY_AMT5': -1, 'PAY_AMT6': -1, 'BILL_AMT1': -1, 'BILL_AMT2': -1, 'BILL_AMT3': -1, 'BILL_AMT4': -1, 'BILL_AMT5': -1, 'BILL_AMT6': -1}
Completed grid search 14/14 with AUC: 0.78 ...
--------------------------------------------------------------------------------
Done.
Task completed in 745.83 s.

4.2 Compare Global Model Weights for Alternative Models

In [34]:
auto_ph.plot_coefs(corr_glm_mgbm_shap_coefs,
                   mgbm_models, 
                   'MGBM',
                   ['Absolute Pearson Correlation Coefficient',
                    'Monotonic GBM Mean SHAP Value',
                    'Absolute Penalized GLM Contribution'])