Rulefit algorithm aims for a compromise between interpretability and complexity of the resulting model. While simpler ML algorithms usually miss interaction effects or require advanced methods to uncover interaction effects, rulefit learns a sparse linear model that include automatically detected interaction effects in the form of decision rules. After that, new features are created in the form of decision rules and a transparent model is built using these features.
Example: IF the number of rooms > 2 AND the age of the house < 15 THEN 1 ELSE 0 (lower than medium)
The general algorithm flow:
The response variable is the price of the houses and the goal is to produce a model with significant variables that can predict the house price using the given explanatory variables. Each record describes a Boston suburb or town. The data was created from the Boston Standard Metropolitan Statistical Area (SMSA) in the 70s. The attributes are defined as follows:
import h2o
from h2o.estimators.random_forest import H2ORandomForestEstimator
h2o.init()
versionFromGradle='3.36.1',projectVersion='3.36.1.99999',branch='zuzana/rulefit_boston_demo',lastCommitHash='49ae4e81f2be4ece461ddccfb5ed1ec923cb4415',gitDescribe='jenkins-3.36.1.2-7-g49ae4e81f2-dirty',compiledOn='2022-06-02 14:19:03',compiledBy='zuzanaolajcova' Checking whether there is an H2O instance running at http://localhost:54321 . connected.
H2O_cluster_uptime: | 08 secs |
H2O_cluster_timezone: | Europe/Prague |
H2O_data_parsing_timezone: | UTC |
H2O_cluster_version: | 3.36.1.99999 |
H2O_cluster_version_age: | 8 minutes |
H2O_cluster_name: | zuzanaolajcova |
H2O_cluster_total_nodes: | 1 |
H2O_cluster_free_memory: | 3.548 Gb |
H2O_cluster_total_cores: | 12 |
H2O_cluster_allowed_cores: | 12 |
H2O_cluster_status: | locked, healthy |
H2O_connection_url: | http://localhost:54321 |
H2O_connection_proxy: | {"http": null, "https": null} |
H2O_internal_security: | False |
Python_version: | 3.8.1 final |
train = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/gbm_test/BostonHousing.csv")
x = train.columns
y = "medv"
x.remove(y)
x.remove("b")
Parse progress: |████████████████████████████████████████████████████████████████| (done) 100%
train['medv'].hist()
breaks | counts | mids_true | mids | widths |
---|---|---|---|---|
9.5 | nan | nan | nan | nan |
14 | 21 | 2.5 | 11.75 | 4.5 |
18.5 | 55 | 4.75 | 16.25 | 4.5 |
23 | 82 | 7 | 20.75 | 4.5 |
27.5 | 154 | 9.25 | 25.25 | 4.5 |
32 | 84 | 11.5 | 29.75 | 4.5 |
36.5 | 41 | 13.75 | 34.25 | 4.5 |
41 | 30 | 16 | 38.75 | 4.5 |
45.5 | 8 | 18.25 | 43.25 | 4.5 |
50 | 31 | 20.65 | 47.75 | 4.5 |
train.head()
crim | zn | indus | chas | nox | rm | age | dis | rad | tax | ptratio | b | lstat | medv |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0.00632 | 18 | 2.31 | 0 | 0.538 | 6.575 | 65.2 | 4.09 | 1 | 296 | 15.3 | 396.9 | 4.98 | 24 |
0.02731 | 0 | 7.07 | 0 | 0.469 | 6.421 | 78.9 | 4.9671 | 2 | 242 | 17.8 | 396.9 | 9.14 | 21.6 |
0.02729 | 0 | 7.07 | 0 | 0.469 | 7.185 | 61.1 | 4.9671 | 2 | 242 | 17.8 | 392.83 | 4.03 | 34.7 |
0.03237 | 0 | 2.18 | 0 | 0.458 | 6.998 | 45.8 | 6.0622 | 3 | 222 | 18.7 | 394.63 | 2.94 | 33.4 |
0.06905 | 0 | 2.18 | 0 | 0.458 | 7.147 | 54.2 | 6.0622 | 3 | 222 | 18.7 | 396.9 | 5.33 | 36.2 |
0.02985 | 0 | 2.18 | 0 | 0.458 | 6.43 | 58.7 | 6.0622 | 3 | 222 | 18.7 | 394.12 | 5.21 | 28.7 |
0.08829 | 12.5 | 7.87 | 0 | 0.524 | 6.012 | 66.6 | 5.5605 | 5 | 311 | 15.2 | 395.6 | 12.43 | 22.9 |
0.14455 | 12.5 | 7.87 | 0 | 0.524 | 6.172 | 96.1 | 5.9505 | 5 | 311 | 15.2 | 396.9 | 19.15 | 27.1 |
0.21124 | 12.5 | 7.87 | 0 | 0.524 | 5.631 | 100 | 6.0821 | 5 | 311 | 15.2 | 386.63 | 29.93 | 16.5 |
0.17004 | 12.5 | 7.87 | 0 | 0.524 | 6.004 | 85.9 | 6.5921 | 5 | 311 | 15.2 | 386.71 | 17.1 | 18.9 |
Before training a rulefit model, we will train a tree based model for comparison. By default, rulefit uses H2ORandomForestEstimator in the following configuration:
drf = H2ORandomForestEstimator(ntrees=50, seed=123, max_depth=3)
drf.train(y=y,x=x,training_frame=train)
print("RMSE: ")
print(drf.training_model_metrics()['RMSE'])
drf.varimp_plot()
drf Model Build progress: |██████████████████████████████████████████████████████| (done) 100% RMSE: 4.239620141024194
<h2o.plot._plot_result._MObject at 0x7fd2ba4bfc40>
<Figure size 432x288 with 0 Axes>
As we can see, RM and LSTAT are the most significant features affecting the house price. Now, let's create a rulefit model to a create custom rules for more explainability and interpretability:
from h2o.estimators.rulefit import H2ORuleFitEstimator
rf = H2ORuleFitEstimator(seed=123, lambda_=.5 )#, model_type="rules")
rf.train(y=y, x=x, training_frame=train)
print("RMSE: ")
print(rf.training_model_metrics()['RMSE'])
rf.rule_importance()
rulefit Model Build progress: |██████████████████████████████████████████████████| (done) 100% RMSE: 4.816685889071532 Rule Importance:
variable | coefficient | support | rule | ||
---|---|---|---|---|---|
0 | linear.rm | 3.555456 | 1.000000 | ||
1 | linear.chas | 1.750286 | 1.000000 | ||
2 | M0T41N18 | 1.455534 | 0.132411 | (age < 95.2587890625 or age is NA) & (nox < 0.6593242287635803 or ... | |
3 | M0T42N16 | 1.200294 | 0.084980 | (crim < 7.391515254974365 or crim is NA) & (lstat < 5.127500057220... | |
4 | linear.ptratio | -0.632223 | 1.000000 | ||
5 | linear.lstat | -0.520512 | 1.000000 | ||
6 | M0T31N21 | -0.395165 | 0.264822 | (dis >= 1.2369916439056396 or dis is NA) & (ptratio >= 19.90244102... | |
7 | M0T20N21 | -0.277795 | 0.199605 | (indus >= 6.66726541519165 or indus is NA) & (lstat >= 4.278124809... | |
8 | M0T27N22 | -0.210425 | 0.199605 | (indus >= 6.66726541519165 or indus is NA) & (lstat >= 4.702812671... | |
9 | M0T22N14 | 0.163021 | 0.092885 | (crim < 6.696437835693359 or crim is NA) & (ptratio < 18.346485137... | |
10 | linear.dis | -0.083196 | 1.000000 | ||
11 | linear.crim | -0.025105 | 1.000000 |
Rulefits's rule importance table shows all the predictors with non-zero LASSO coefficients. Predictors can be linear (variable name prefixed with "linear.") or a rule (rule identificator as a variable name, e.g. M0T43N16 means this rule comes from the tree based model n.0, tree n.43, node n.16). This table holds values of LASSO coefficient and support of undrlying predictor as a factors of predictor importances. In case of rule predictor, also it's language representation is present.
Let's look closer on resulting predictors. From the table we can see that the rules selected by rulefit as most important have LSTAT or RM variables present. Those which were by far the top 2 most important variables of previous H2ORandomForestEstimator model.
Let's try to interpret significant rules:
The full text of M0T41N18 is:
rf.rule_importance()[4][2]
'(age < 95.2587890625 or age is NA) & (nox < 0.6593242287635803 or nox is NA) & (rm >= 6.940098762512207)'
Which means that the most expensive houses have 7 or more rooms and are in areas with low nitric oxides concentration and with possibly high proportion of historical buildings.
The full text of M0T42N16 is:
rf.rule_importance()[4][3]
'(crim < 7.391515254974365 or crim is NA) & (lstat < 5.127500057220459 or lstat is NA) & (rm >= 6.940098762512207)'
Which means that big houses (7 or more rooms) in areas with very-low-to-zero crime rate and very-low-to-zero percentage of lower status of the population are more expensive to live in.
The full text of M0T31N21 is:
rf.rule_importance()[4][6]
'(dis >= 1.2369916439056396 or dis is NA) & (ptratio >= 19.902441024780273) & (rm >= 5.864699363708496 or rm is NA)'
Which can be interpreted as: "with increasing distance from Boston employment centers and increasing pupil-teacher ratio, the price of majority of the house degrades" (since majority of the houses in our dataset have 6 or more rooms).
We can also find out at which specific rows the certain rules apply or not:
predicted_rules = rf.predict_rules(train, ["M0T31N21", "M0T42N16", "linear.rm"])
predicted_rules.head()
M0T31N21 | M0T42N16 | linear.rm |
---|---|---|
0 | 0 | 1 |
0 | 0 | 1 |
0 | 1 | 1 |
0 | 1 | 1 |
0 | 0 | 1 |
0 | 0 | 1 |
0 | 0 | 1 |
0 | 0 | 1 |
0 | 0 | 1 |
0 | 0 | 1 |
Please note, that linear predictor applies to all the observations and that col-wise sums of this output divided by the number of observations represents a support of each rule.
Apart of that, Friedman and Popescu defines (https://arxiv.org/abs/0811.1679) the rulefit-specific global measure of predictors importance as follows:
def FPimportance(data, lasso_coef, support, is_rule):
if is_rule:
import math
return abs(lasso_coef) * math.sqrt(support * (1 - support))
else:
return abs(lasso_coef) * data.sd()[0]
Hence, we can get global importances calculated out of combined importance factors like:
def calculate_FPimportance(input, data):
result = dict()
for x in range(len(input[1])):
if input[1][x].startswith('linear.'):
result[input[1][x]] = FPimportance(data[input[1][x][len("linear."):]], input[2][x], input[3][x], False)
else:
result[input[1][x]] = FPimportance(None, input[2][x], input[3][x], True)
return result
FPimportances = calculate_FPimportance(rf.rule_importance(), train)
import operator
sorted_FPimportances = sorted(FPimportances.items(), key=operator.itemgetter(1), reverse=True)
Which gives us a slightly reordered list of importances:
sorted_FPimportances
[('linear.lstat', 3.7170067016740407), ('linear.rm', 2.498124117997926), ('linear.ptratio', 1.368729326964665), ('M0T41N18', 0.49333452353437895), ('linear.chas', 0.4445622162226629), ('M0T42N16', 0.33470481934298546), ('linear.crim', 0.21593853846749478), ('linear.dis', 0.1751857632048295), ('M0T31N21', 0.1743620600544106), ('M0T20N21', 0.11103558646592059), ('M0T27N22', 0.0841076264935881), ('M0T22N14', 0.04732040851755857)]
Stay tuned for the future improvements, mainly rulefit-specific tools for importance and interaction examination!