Inverse Probability Weighting Model¶

Inverse probability weighting is a basic model to obtain average effect estimation.

It calculates the probability of each sample to belong to its group,
and use its inverse as the weight of that sample: $$ w_i = \frac{1}{\Pr[A=a_i | X_i]} $$

In [1]:

%matplotlib inline
from causallib.datasets import load_nhefs
from causallib.estimation import IPW
from causallib.evaluation import evaluate
from sklearn.linear_model import LogisticRegression
from matplotlib import pyplot as plt

Data:¶

The effect of quitting to smoke on weight loss.
Data example is taken from Hernan and Robins Causal Inference Book

In [2]:

data = load_nhefs()
data.X.join(data.a).join(data.y).head()

Out[2]:

	age	race	sex	smokeintensity	smokeyrs	wt71	active_1	education_2	exercise_1	exercise_2	age^2	wt71^2	smokeintensity^2	smokeyrs^2	wt82_71
0	42	1	0	30	29	79.04	0	0	0	1	1764	6247.3216	900	841	-10.093960
1	36	0	0	20	24	58.63	0	1	0	0	1296	3437.4769	400	576	2.604970
2	56	1	1	20	26	56.81	0	1	0	1	3136	3227.3761	400	676	9.414486
3	68	1	0	3	53	59.42	1	0	0	1	4624	3530.7364	9	2809	4.990117
4	40	0	0	20	19	87.09	1	1	1	0	1600	7584.6681	400	361	4.989251

Model:¶

The causal model has a machine learning model at its core, provided as learner parameter.
This ML model will be used to predict probability of quit smoking given the covariates.
These probabilities will be used to obtain $w_i$.
Then, we'll estimate average balanced outcome using Horvitz–Thompson estimator: $$ \hat{E}[Y^a] = \frac{1}{\sum_{i:A_i=a}w_i} \cdot \sum_{i:A_i=a}w_i y_i $$

Lastly, we'll use these average counterfactual outcome estimation to predict the effect: $$ \hat{E}[Y^1] - \hat{E}[Y^0] $$

In [3]:

# Train:
learner = LogisticRegression(solver="liblinear")
ipw = IPW(learner)
ipw.fit(data.X, data.a)

Out[3]:

IPW(clip_max=None, clip_min=None, use_stabilized=False, verbose=False,
    learner=LogisticRegression(solver='liblinear'))

In [4]:

# We can now preict the weight of each individual:
ipw.compute_weights(data.X, data.a).head()

Out[4]:

0    1.149089
1    1.198450
2    1.200264
3    1.959673
4    1.336338
dtype: float64

In [5]:

# Estimate average outcome
outcomes = ipw.estimate_population_outcome(data.X, data.a, data.y)
outcomes

Out[5]:

0    1.750404
1    5.261713
dtype: float64

In [6]:

# Estimate the effect:
effect = ipw.estimate_effect(outcomes[1], outcomes[0])
effect

Out[6]:

diff    3.511308
dtype: float64

We can see that on average, indiviudals who quit smoking gained 3.5 Lbs
on the course of 11 years

Non-default parameters¶

We just saw a simple example hiding many model's parameters.
We now dig a bit deeper to every stage.

Model definition¶

Machine learning model:
Any scikit-learn model can be specified (even pipelines)

In [7]:

learner = LogisticRegression(penalty="l1", C=0.01, max_iter=500, solver='liblinear')

IPW model has two additional parameters:

clip_min, clip_max: caliper values to trim very small or very large probabilities
stabilized: Whether to scale weights with treatment prevalence

In [8]:

clip_min = 0.2
clip_max = 0.8
ipw = IPW(learner, clip_min=clip_min, clip_max=clip_max, use_stabilized=False)

In [9]:

ipw.fit(data.X, data.a);

Weight prediction options¶

Now we can predict the probability of quit smoking (treatment_values=1)
and validate our truncation worked:

In [10]:

probs = ipw.compute_propensity(data.X, data.a, treatment_values=1)
probs.between(clip_min, clip_max).all()

Out[10]:

True

During the "predict" phase (i.e. computing weights or probabilities),
We can alter the parameters we placed during initiation:

In [11]:

probs = ipw.compute_propensity(data.X, data.a, treatment_values=1, clip_min=0.0, clip_max=1.0)
probs.between(clip_min, clip_max).all()

Out[11]:

False

We can even predict stabilized weights.
However, we will get a warning.
This is because treatment prevalence is an estimation of the trainning data.
During fit, when the model had it's initial values, use_stabilized was False (default).
So when coming to compute_weights now, the model will use the prevalence from the provided data to estimate treatment prevalence.
This is not a big deal here, since we compute on the same data we trained on, but this does not have to be the general case.
(This warning would not exists if we redefine the model with use_stabilized=True and re-train it)

In [12]:

stabilized_weights = ipw.compute_weights(data.X, data.a, treatment_values=1, 
                                         clip_min=0.0, clip_max=1.0, use_stabilized=True)
weights = ipw.compute_weights(data.X, data.a, treatment_values=1, 
                              clip_min=0.0, clip_max=1.0)
stabilized_weights.eq(weights).all()

C:\Users\204048756\Miniconda3\envs\causallib_080_install_test\lib\site-packages\causallib\estimation\ipw.py:137: RuntimeWarning: Stabilized is asked, however, the model was not trained using stabilization, and therefore, stabilized weights are taken from the provided treatment assignment.
  warnings.warn("Stabilized is asked, however, the model was not trained using stabilization, and "

Out[12]:

False

Since IPW utilizes probabilites, for each sample we can get a probability (or weight) for each treatment value

In [13]:

# ipw.compute_weight_matrix(data.X, data.a).head()
ipw.compute_propensity_matrix(data.X, data.a).head()

Out[13]:

	0	1
0	0.800000	0.200000
1	0.800000	0.200000
2	0.715863	0.284137
3	0.521410	0.478590
4	0.765548	0.234452

Effect estimation options¶

We can choose whether we wish for addtive (diff) or multiplicative (ratio) effect
(If outcome y was probabilites, we could also ask for odds-ratio (or))

Providing weights w is optional, if not provided the weights would be simply calulated again using the provided X.

In [14]:

outcomes = ipw.estimate_population_outcome(data.X, data.a, data.y, w=weights)
effects = ipw.estimate_effect(outcomes[1], outcomes[0], effect_types=["diff", "ratio"])
effects

Out[14]:

diff     2.753878
ratio    2.141800
dtype: float64

Evaluation¶

We can also evaluate the performance of the IPW model

Simple evaluation¶

Evaluates a fitted model on the provided dataset

In [15]:

results = evaluate(ipw, data.X, data.a, data.y)

results contains models, plots and scores, but since we did not ask for plots, and did not refit the model, our main interest is the scores. We have both the prediction performance scores and a table1 with standardized mean differences with and without balancing

In [16]:

results.evaluated_metrics.prediction_scores

Out[16]:

	accuracy	precision	recall	f1	roc_auc	avg_precision	hinge	matthews	0_1	brier	confusion_matrix	roc_curve	pr_curve
0	0.743934	0.541667	0.032258	0.06089	0.613622	0.364679	1.116461	0.081138	0.256066	0.184569	[[1152, 11], [390, 13]]	([0.0, 0.0008598452278589854, 0.00085984522785...	([0.25734355044699875, 0.2879799666110184, 0.2...

In [17]:

results.evaluated_metrics.covariate_balance.head()

Out[17]:

abs_smd	weighted	unweighted
covariate
age	0.048381	0.199560
race	0.167261	0.125194
sex	0.145836	0.113323
smokeintensity	0.018808	0.153345
smokeyrs	0.021438	0.112470

Thorough evaluation¶

Can check general model specification as it evaluates using cross-validation and refitting the model on each fold

In [18]:

from sklearn import metrics
metrics = {"roc_auc": metrics.roc_auc_score,
           "avg_precision": metrics.average_precision_score,}
ipw = IPW(LogisticRegression(solver="liblinear"))
results = evaluate(ipw, data.X, data.a, data.y, cv="auto", metrics_to_evaluate=metrics)
results.plot_all()

Out[18]:

{'train': {'calibration': <AxesSubplot:title={'center':'Calibration'}, xlabel='Predicted probability', ylabel='Observed probability'>,
  'covariate_balance_slope': <AxesSubplot:ylabel='Absolute Standard Mean Difference'>,
  'pr_curve': <AxesSubplot:title={'center':'PR Curve'}, xlabel='Recall', ylabel='Precision'>,
  'covariate_balance_love': <AxesSubplot:xlabel='Absolute Standard Mean Difference', ylabel='Covariates'>,
  'weight_distribution': <AxesSubplot:title={'center':'Propensity Distribution'}, xlabel='Propensity', ylabel='Probability density'>,
  'roc_curve': <AxesSubplot:title={'center':'ROC Curve'}, xlabel='False Positive Rate', ylabel='True Positive Rate'>},
 'valid': {'calibration': <AxesSubplot:title={'center':'Calibration'}, xlabel='Predicted probability', ylabel='Observed probability'>,
  'covariate_balance_slope': <AxesSubplot:ylabel='Absolute Standard Mean Difference'>,
  'pr_curve': <AxesSubplot:title={'center':'PR Curve'}, xlabel='Recall', ylabel='Precision'>,
  'covariate_balance_love': <AxesSubplot:xlabel='Absolute Standard Mean Difference', ylabel='Covariates'>,
  'weight_distribution': <AxesSubplot:title={'center':'Propensity Distribution'}, xlabel='Propensity', ylabel='Probability density'>,
  'roc_curve': <AxesSubplot:title={'center':'ROC Curve'}, xlabel='False Positive Rate', ylabel='True Positive Rate'>}}

Covariate balancing¶

Let's focus on a commonly known plot - Love plot. Love plot calculates the standardized mean difference between groups for each covariate. It also calculates the inverse-propensity weighted average, so we can examine whether the weighting made the two groups more similar (smaller difference of means). We then plot the two values (weighted and unweighted mean differences) for each covariate and hope the weighted differences will be smaller than some arbitrary threshold (usually 0.1).

In [19]:

fig, ax = plt.subplots(1, 1, figsize=(6, 6))
results.plot_covariate_balance(kind="love", ax=ax, thresh=0.1);

High order balancing¶

The plot above checks for the standardized difference of each covariate. It is very intuitive and easy to understand. However, by definition, the standardized mean difference is a marginal estimate, testing each covariate independently. Unfortunately, marginal balancing does not guarantee joint balancing in the high-dimensional covariate space (see Figure 1 in this manuscript).
To overcome that, we can add feature-interactions when checking for balancing. Note, that we do not necessarily need to fit our model on a data that includes the interactions, just test on such data.

Below we show a simple example of adding some interactions and plotting. See the new interaction covariates appear in the plot and still being well-balanced. This is further evidence to suggest our model performs well by being able to also balance the joint distribution of covariates.

In [20]:

from patsy import dmatrix

# Define some basic feature interactions using Patsy's formula:
X_interactions = dmatrix(
    "0 + age:sex + age:race + smokeyrs:sex + wt71:sex", 
    data.X, 
    return_type="dataframe"
)
# Set the entire data to be evaluated, but can also just use the interactions without the original data
results.X = data.X.join(X_interactions)

fig, ax = plt.subplots(1, 1, figsize=(6,6))
results.plot_covariate_balance(kind="love", phase="train", ax=ax, thresh=0.1);

These plots are highly interpretable, but note we had to explicitly choose the interactions in order to imitate the joint covariate space. This can be exhaustive and blow up in numbers really fast (our 18-column dataset has 172 possible interactions, and those are just the two-ways ones). Below we will show how to use the ROC curve to assess high-dimensional joint balancing. We don't have to, but for the sake of order, let's revert the data in our EvaluationResults object back to the original one:

In [21]:

results.X = data.X

ROC curves for propensity scores¶

Following the discussion above, let's focus on the orange "Weighted" line. This is an inverse-propensity weighted ROC curve. As a reminder, the IP-weights simulate a pseudo-population where treatment is ignorable, meaning (weighted) covariate distribution should be similar across treatment groups. This means there shouldn't be any information that can help us discriminate the treated units from the control ones. Theoretically, to test that in a high-dimensional setting, we could use a classifier and see whether it can discriminate the treatment assignment. Ideally, it shouldn't be able to. Using a IP-weighted ROC curve is an efficient proxy for that classifier-based test.

In [22]:

results.plot_roc_curve();

For more information about how to interpret the ROC curve for a propensity in a high-level intuitive way, see this blog post.

Numeric prediction scores¶

In [23]:

results.evaluated_metrics.prediction_scores

Out[23]:

			roc_auc	avg_precision
phase	fold
train	0	0	0.658936	0.406668
	1	0	0.662192	0.411784
	2	0	0.662382	0.411057
	3	0	0.661691	0.416513
	4	0	0.667862	0.396350
valid	0	0	0.660308	0.393794
	1	0	0.649410	0.351309
	2	0	0.663573	0.402053
	3	0	0.583919	0.320748
	4	0	0.621328	0.414399

In [24]:

print(len(results.models))
results.models[2]

Out[24]:

IPW(clip_max=None, clip_min=None, use_stabilized=False, verbose=False,
    learner=LogisticRegression(solver='liblinear'))

In [ ]: