Rulefit Boston Housing Demo¶

Rulefit algorithm aims for a compromise between interpretability and complexity of the resulting model. While simpler ML algorithms usually miss interaction effects or require advanced methods to uncover interaction effects, rulefit learns a sparse linear model that include automatically detected interaction effects in the form of decision rules. After that, new features are created in the form of decision rules and a transparent model is built using these features.

Example: IF the number of rooms > 2 AND the age of the house < 15 THEN 1 ELSE 0 (lower than medium)

The general algorithm flow:

Algorithm fits a tree ensemble to the data, builds a rule ensemble by traversing each tree. This results in many rules but majority of them not informative.
After that, it evaluates the rules on the data to build a rule feature set and fits a sparse linear model (LASSO) to the rule feature set joined with the original feature set, to select the best ones.

Boston house prices dataset:¶

The response variable is the price of the houses and the goal is to produce a model with significant variables that can predict the house price using the given explanatory variables. Each record describes a Boston suburb or town. The data was created from the Boston Standard Metropolitan Statistical Area (SMSA) in the 70s. The attributes are defined as follows:

CRIM: per capita crime rate by town
ZN: proportion of residential land zoned for lots over 25000 sq. ft.
INDUS: proportion of non retail business acres per town
CHAS: Charles River dummy var (= 1 if tract bounds rivers; 0 othervise)
NOX: nitric oxides concentration (parts per 10 milion)
RM: average number of rooms per dweling
AGE: proportion of owner-occupied units built prior to 1940
DIS: weighed distances to five Boston employment centers
RAD: index of accesibility to radial highways
TAX: full value property=tax rate per 10000 USD
PTRATIO: pupil-teacher ratio by town
B: 1000(Bk - 0.63)2 where Bk is the proportion of blacks by town
LSTAT: % lower status of the population
MEDV: Median value of owner-occupied homes in 1000s USD

In [1]:

import h2o
from h2o.estimators.random_forest import H2ORandomForestEstimator
h2o.init()

versionFromGradle='3.36.1',projectVersion='3.36.1.99999',branch='zuzana/rulefit_boston_demo',lastCommitHash='49ae4e81f2be4ece461ddccfb5ed1ec923cb4415',gitDescribe='jenkins-3.36.1.2-7-g49ae4e81f2-dirty',compiledOn='2022-06-02 14:19:03',compiledBy='zuzanaolajcova'
Checking whether there is an H2O instance running at http://localhost:54321 . connected.

H2O_cluster_uptime:	08 secs
H2O_cluster_timezone:	Europe/Prague
H2O_data_parsing_timezone:	UTC
H2O_cluster_version:	3.36.1.99999
H2O_cluster_version_age:	8 minutes
H2O_cluster_name:	zuzanaolajcova
H2O_cluster_total_nodes:	1
H2O_cluster_free_memory:	3.548 Gb
H2O_cluster_total_cores:	12
H2O_cluster_allowed_cores:	12
H2O_cluster_status:	locked, healthy
H2O_connection_url:	http://localhost:54321
H2O_connection_proxy:	{"http": null, "https": null}
H2O_internal_security:	False
Python_version:	3.8.1 final

In [2]:

train = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/gbm_test/BostonHousing.csv")
x = train.columns
y = "medv"
x.remove(y)
x.remove("b")

Parse progress: |████████████████████████████████████████████████████████████████| (done) 100%

In [16]:

train['medv'].hist()

breaks	counts	mids_true	mids	widths
9.5	nan	nan	nan	nan
14	21	2.5	11.75	4.5
18.5	55	4.75	16.25	4.5
23	82	7	20.75	4.5
27.5	154	9.25	25.25	4.5
32	84	11.5	29.75	4.5
36.5	41	13.75	34.25	4.5
41	30	16	38.75	4.5
45.5	8	18.25	43.25	4.5
50	31	20.65	47.75	4.5

Out[16]:

In [17]:

train.head()

crim	zn	indus	nox	rm	age	dis	rad	tax	ptratio	b	lstat	medv
0.00632	18	2.31	0.538	6.575	65.2	4.09	1	296	15.3	396.9	4.98	24
0.02731	0	7.07	0.469	6.421	78.9	4.9671	2	242	17.8	396.9	9.14	21.6
0.02729	0	7.07	0.469	7.185	61.1	4.9671	2	242	17.8	392.83	4.03	34.7
0.03237	0	2.18	0.458	6.998	45.8	6.0622	3	222	18.7	394.63	2.94	33.4
0.06905	0	2.18	0.458	7.147	54.2	6.0622	3	222	18.7	396.9	5.33	36.2
0.02985	0	2.18	0.458	6.43	58.7	6.0622	3	222	18.7	394.12	5.21	28.7
0.08829	12.5	7.87	0.524	6.012	66.6	5.5605	5	311	15.2	395.6	12.43	22.9
0.14455	12.5	7.87	0.524	6.172	96.1	5.9505	5	311	15.2	396.9	19.15	27.1
0.21124	12.5	7.87	0.524	5.631	100	6.0821	5	311	15.2	386.63	29.93	16.5
0.17004	12.5	7.87	0.524	6.004	85.9	6.5921	5	311	15.2	386.71	17.1	18.9

Out[17]:

Before training a rulefit model, we will train a tree based model for comparison. By default, rulefit uses H2ORandomForestEstimator in the following configuration:

In [18]:

drf = H2ORandomForestEstimator(ntrees=50, seed=123, max_depth=3)
drf.train(y=y,x=x,training_frame=train)
print("RMSE: ")
print(drf.training_model_metrics()['RMSE'])
drf.varimp_plot()

drf Model Build progress: |██████████████████████████████████████████████████████| (done) 100%
RMSE: 
4.239620141024194

Out[18]:

<h2o.plot._plot_result._MObject at 0x7fd2ba4bfc40>

<Figure size 432x288 with 0 Axes>

As we can see, RM and LSTAT are the most significant features affecting the house price. Now, let's create a rulefit model to a create custom rules for more explainability and interpretability:

In [19]:

from h2o.estimators.rulefit import H2ORuleFitEstimator
rf = H2ORuleFitEstimator(seed=123, lambda_=.5 )#,  model_type="rules")
rf.train(y=y, x=x, training_frame=train)
print("RMSE: ")
print(rf.training_model_metrics()['RMSE'])
rf.rule_importance()

rulefit Model Build progress: |██████████████████████████████████████████████████| (done) 100%
RMSE: 
4.816685889071532

Rule Importance:

	variable	coefficient	support	rule
0	linear.rm	3.555456	1.000000
1	linear.chas	1.750286	1.000000
2	M0T41N18	1.455534	0.132411	(age < 95.2587890625 or age is NA) & (nox < 0.6593242287635803 or ...
3	M0T42N16	1.200294	0.084980	(crim < 7.391515254974365 or crim is NA) & (lstat < 5.127500057220...
4	linear.ptratio	-0.632223	1.000000
5	linear.lstat	-0.520512	1.000000
6	M0T31N21	-0.395165	0.264822	(dis >= 1.2369916439056396 or dis is NA) & (ptratio >= 19.90244102...
7	M0T20N21	-0.277795	0.199605	(indus >= 6.66726541519165 or indus is NA) & (lstat >= 4.278124809...
8	M0T27N22	-0.210425	0.199605	(indus >= 6.66726541519165 or indus is NA) & (lstat >= 4.702812671...
9	M0T22N14	0.163021	0.092885	(crim < 6.696437835693359 or crim is NA) & (ptratio < 18.346485137...
10	linear.dis	-0.083196	1.000000
11	linear.crim	-0.025105	1.000000

Out[19]:

Rulefits's rule importance table shows all the predictors with non-zero LASSO coefficients. Predictors can be linear (variable name prefixed with "linear.") or a rule (rule identificator as a variable name, e.g. M0T43N16 means this rule comes from the tree based model n.0, tree n.43, node n.16). This table holds values of LASSO coefficient and support of undrlying predictor as a factors of predictor importances. In case of rule predictor, also it's language representation is present.

Let's look closer on resulting predictors. From the table we can see that the rules selected by rulefit as most important have LSTAT or RM variables present. Those which were by far the top 2 most important variables of previous H2ORandomForestEstimator model.

Let's try to interpret significant rules:

The full text of M0T41N18 is:

In [20]:

rf.rule_importance()[4][2]

Out[20]:

'(age < 95.2587890625 or age is NA) & (nox < 0.6593242287635803 or nox is NA) & (rm >= 6.940098762512207)'

Which means that the most expensive houses have 7 or more rooms and are in areas with low nitric oxides concentration and with possibly high proportion of historical buildings.

The full text of M0T42N16 is:

In [21]:

rf.rule_importance()[4][3]

Out[21]:

'(crim < 7.391515254974365 or crim is NA) & (lstat < 5.127500057220459 or lstat is NA) & (rm >= 6.940098762512207)'

Which means that big houses (7 or more rooms) in areas with very-low-to-zero crime rate and very-low-to-zero percentage of lower status of the population are more expensive to live in.

The full text of M0T31N21 is:

In [22]:

rf.rule_importance()[4][6]

Out[22]:

'(dis >= 1.2369916439056396 or dis is NA) & (ptratio >= 19.902441024780273) & (rm >= 5.864699363708496 or rm is NA)'

Which can be interpreted as: "with increasing distance from Boston employment centers and increasing pupil-teacher ratio, the price of majority of the house degrades" (since majority of the houses in our dataset have 6 or more rooms).

We can also find out at which specific rows the certain rules apply or not:

In [23]:

predicted_rules = rf.predict_rules(train, ["M0T31N21", "M0T42N16", "linear.rm"])
predicted_rules.head()

M0T31N21	M0T42N16	linear.rm
0	0	1
0	0	1
0	1	1
0	1	1
0	0	1
0	0	1
0	0	1
0	0	1
0	0	1
0	0	1

Out[23]:

Please note, that linear predictor applies to all the observations and that col-wise sums of this output divided by the number of observations represents a support of each rule.

Apart of that, Friedman and Popescu defines (https://arxiv.org/abs/0811.1679) the rulefit-specific global measure of predictors importance as follows:

In [24]:

def FPimportance(data, lasso_coef, support, is_rule):
    if is_rule:
        import math
        return abs(lasso_coef) * math.sqrt(support * (1 - support))
    else:
        return abs(lasso_coef) * data.sd()[0]

Hence, we can get global importances calculated out of combined importance factors like:

In [25]:

def calculate_FPimportance(input, data):
    result = dict()
    for x in range(len(input[1])):
        if input[1][x].startswith('linear.'):
            result[input[1][x]] = FPimportance(data[input[1][x][len("linear."):]], input[2][x], input[3][x], False)
        else:
            result[input[1][x]] = FPimportance(None, input[2][x], input[3][x], True)
    return result  

In [26]:

FPimportances = calculate_FPimportance(rf.rule_importance(), train)

In [27]:

import operator
sorted_FPimportances = sorted(FPimportances.items(), key=operator.itemgetter(1), reverse=True)

Which gives us a slightly reordered list of importances:

In [28]:

sorted_FPimportances

Out[28]:

[('linear.lstat', 3.7170067016740407),
 ('linear.rm', 2.498124117997926),
 ('linear.ptratio', 1.368729326964665),
 ('M0T41N18', 0.49333452353437895),
 ('linear.chas', 0.4445622162226629),
 ('M0T42N16', 0.33470481934298546),
 ('linear.crim', 0.21593853846749478),
 ('linear.dis', 0.1751857632048295),
 ('M0T31N21', 0.1743620600544106),
 ('M0T20N21', 0.11103558646592059),
 ('M0T27N22', 0.0841076264935881),
 ('M0T22N14', 0.04732040851755857)]

Stay tuned for the future improvements, mainly rulefit-specific tools for importance and interaction examination!