Why Causal Analysis is Needed¶

In this example we will perform a quick causal analysis to estimate the causal effect of smoking cessation on weight gain over a period of a decade.
We will compare these results to the observed effect, arising from simple descriptive statistics.
We'll then dig deeper into the data and see why such discrepancies arise.

The Data¶

We will use data from an NHEFS observational study. The NHANS (National Health and Nutrition Examionation Survey) Epidemiologic Followup Study (NHEFS) is a national longitudinal study that was jointly initiated by the American National Center for Health Statistics and the American National Institute on Aging. The study was designed to follow up on smokers, some of whom quit smoking, and record their weight (among other measurements) for a period of 11 years: from 1971 to 1982.
More information about the data can be found in the NHEFS Notebook

In [15]:

from causallib.datasets import load_nhefs

data = load_nhefs()
data.X.join(data.a).join(data.y).head()

Out[15]:

	age	race	sex	smokeintensity	smokeyrs	wt71	active_1	education_2	exercise_1	exercise_2	age^2	wt71^2	smokeintensity^2	smokeyrs^2	wt82_71
0	42	1	0	30	29	79.04	0	0	0	1	1764	6247.3216	900	841	-10.093960
1	36	0	0	20	24	58.63	0	1	0	0	1296	3437.4769	400	576	2.604970
2	56	1	1	20	26	56.81	0	1	0	1	3136	3227.3761	400	676	9.414486
3	68	1	0	3	53	59.42	1	0	0	1	4624	3530.7364	9	2809	4.990117
4	40	0	0	20	19	87.09	1	1	1	0	1600	7584.6681	400	361	4.989251

The data is already divided into three components:

X: the confounders needed to be controlled for:
- Demographics: such as age, race, sex and education level (how many years of schools).
- Smoking habits: how many years an individual has been smoking (smokeyrs) and how many cigarettes per day (smokeintensity).
- Well-being habits: how much active a participant is daily (active) and how much exercise they do in recreation (exercise).
- Weight at the start of the study (wt71).
a: the treatment assignment - whether an individual had quit smoking or not.
y: the outcome - how much weight the individual gained during the follow up period of a decade.

Estimating Causal Effect¶

The Causal Model¶

For this demosntation, we will follow the most simple and well known causal model: Inverse Probability of Treatment Weighting (also known as IPTW or IPW).
This model requires estimating the propensity of each participant to stop smoking based on their individual features.
Modeling this $\Pr[A=1|X]$ requires a machine learning model.
causallib allows users to plug in any arbitrary ML model they want. For the sake of simplicity, we will use a logistic regression model.

In [16]:

from causallib.estimation import IPW
from sklearn.linear_model import LogisticRegression

learner = LogisticRegression(penalty='none',  # No regularization, new in scikit-learn 0.21.*
                             solver='lbfgs',
                             max_iter=500)    # Increased to achieve convergence with 'lbfgs' solver
ipw = IPW(learner)

Applying Causal Analysis¶

Once we defined the causal model, we can move on to calculate the potential outcomes and the effect

In [17]:

ipw.fit(data.X, data.a)
potential_outcomes = ipw.estimate_population_outcome(data.X, data.a, data.y)
causal_effect = potential_outcomes[1] - potential_outcomes[0]
causal_effect

Out[17]:

3.5175400213088306

Our causal analysis concluded that, on average, smoking cessation contributed 3.5 kg (7.7 lbs) to the weight gain of people over a period of a decade.
Put differently, had all the participants stoped smoking, they would have weighed 3.5 kg more than if all participants had continued to smoke.

Estimating Observed Effect¶

Say you are tasked with estimating how much smoking cessation contributes to weight gain but you are not familiar with causal analysis.
On the face of it, it might sound simple. You simply measure the weight-gain in the people who continued smoking, you measure the weight gain in people who quit, and compare the two quantities to see how much the people who quit gained more (or less) than the people who persisted.
Basically you measure $\mathbb{E}[Y|A=1] - \mathbb{E}[Y|A=0]$

In [18]:

observed_outcomes = data.y.groupby(data.a).mean()
observed_effect = observed_outcomes[1] - observed_outcomes[0]
observed_effect

Out[18]:

2.540581454955888

So if you were to avoid causal analysis, you would estimate that smoking cessation contributes only 2.5 kg (5.5 lbs) to the weight gain, as opposed to the 3.5 kg we calculated above. This is an under-estimation of close to 30 (!) percent.
What can cause such a discrepancy between the descriptive-statistics calculation and the causal-based estimation?

Analysis of the difference between causal and observed effect¶

We saw that the observed difference due to smoking cessation is 2.5 kg, while the causal inference analysis suggested it is closer to 3.5.
Causal analysis, in its essence, tries to balance between the treatment and control groups before calculating the difference.
Weight gain can be caused by many factors. For example, males tend to be larger than females on average, and they gain more weight on average. What if most of the people who quit smoking were female? That would bring down the observed effect. Our observed analysis didn't account for the different distributions of such features between the treatment groups, and therefore the observed effect is confounded and most likely "contaminated" with spuriuous correlations. Only when controlling for such correlations and eliminating their effect, we can claim that the differences in weight gain are truly only due to the status of smoking.

What the observed difference indicates is only that people who chose to quit smoking were observed to gain 2.5 Kg more than people who chose not to quit smoking.

To get a clue of how the distribution differ between the treated and the controls, we can calculate the difference in means between groups for each variables. We visualize these differences in a plot called Love plot (named after Dr. Thomas Love).

In [19]:

from causallib.evaluation import evaluate
import matplotlib.pyplot as plt

evaluations = evaluate(ipw, data.X, data.a, data.y, metrics_to_evaluate={})
fig, ax = plt.subplots(figsize=(6, 6))
evaluations.plot_covariate_balance(kind="love", ax=ax);

The orange triangles show the differences in averages between for each covariates before applying balancing.
We see that age differs the most between the treatment groups, and we suspect it might be causing the discrepancy in results between the descriptive statistics and the causal analysis.

To further investigate this, let's look at the weight-gain as a function of the age:

In [20]:

%matplotlib inline
import seaborn as sns

ax = sns.scatterplot(data.X["age"], data.y, hue=data.a, alpha=0.7)
ax.get_figure().set_size_inches(9, 5);

/home/micha/causal/lib/python3.8/site-packages/seaborn-0.11.0-py3.8.egg/seaborn/_decorators.py:36: FutureWarning: Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
  warnings.warn(

We observe that for older age people the weight-gain decreases to the point that it is more frequently negative than positive at very high age.

Let's break down the observed difference by age groups of 5 years each:

In [21]:

import pandas as pd

step_size = 5
age_groups = range(data.X["age"].min(), 
                   data.X["age"].max() + step_size, 
                   step_size)
print(list(age_groups))
age_grouped = pd.cut(data.X["age"], bins=age_groups, include_lowest=True)
age_grouped.head()

[25, 30, 35, 40, 45, 50, 55, 60, 65, 70, 75]

Out[21]:

0    (40.0, 45.0]
1    (35.0, 40.0]
2    (55.0, 60.0]
3    (65.0, 70.0]
4    (35.0, 40.0]
Name: age, dtype: category
Categories (10, interval[float64, right]): [(24.999, 30.0] < (30.0, 35.0] < (35.0, 40.0] < (40.0, 45.0] ... (55.0, 60.0] < (60.0, 65.0] < (65.0, 70.0] < (70.0, 75.0]]

Now we examine how the weight gain differ in each age-group:

In [22]:

observed_diff_by_age = data.y.groupby([data.a, age_grouped]).mean()
observed_diff_by_age = observed_diff_by_age.xs(1) - observed_diff_by_age.xs(0)
observed_diff_by_age = observed_diff_by_age.rename("observed_effect")
ax = observed_diff_by_age.plot(kind="barh")
ax.set_xlabel("observed_effect");

We see here two interseting trends:

The oldest age group has an observed effect much lower than the others.
The people between ages 40-55 have larger observed difference than the average 2.5 kg.

It means when we do a simple average across all the treated and controls, we are mixing people with different observed effects: $$ \mathbb{E}\left[Y|A=1,\text{Age}\gt70\right] - \mathbb{E}\left[Y|A=0,\text{Age}\gt70\right] \ll \mathbb{E}\left[Y|A=1,\text{Age}\in[40,55]\right] - \mathbb{E}\left[Y|A=0,\text{Age}\in[40,55]\right] $$

This certainly has to be adjusted for. But how much? the overall effect will depend in the number of people in each age group:

In [23]:

age_frequency = age_grouped.value_counts(sort=False)
age_frequency.rename("counts", inplace=True)
ax = age_frequency.plot(kind="barh")
ax.set_xlabel("counts");

In [24]:

by_age = observed_diff_by_age.to_frame().join(age_frequency)
by_age

Out[24]:

	observed_effect	counts
age
(24.999, 30.0]	2.263298	276
(30.0, 35.0]	3.100541	214
(35.0, 40.0]	2.433750	177
(40.0, 45.0]	3.801950	197
(45.0, 50.0]	3.603730	234
(50.0, 55.0]	5.238368	186
(55.0, 60.0]	2.871668	132
(60.0, 65.0]	1.402340	85
(65.0, 70.0]	3.602345	53
(70.0, 75.0]	-9.985035	12

In fact, for the largest groups, ages 40 to 55, the observed effect is larger than the overall average observed effect (2.5kg).
This is somewhat similar to Simpson's paradox: How come combining groups with a large effect results in a small effect?

To check this, we'll dig even deeper into the distribution of treated vs. control in each of these age groups:

In [25]:

tx_distribution_by_age = data.a.groupby(age_grouped).value_counts()
tx_distribution_by_age = tx_distribution_by_age.unstack("qsmk")
tx_distribution_by_age["propensity"] = tx_distribution_by_age[1] / tx_distribution_by_age.sum(axis="columns")

by_age = by_age.join(tx_distribution_by_age)
by_age

Out[25]:

	observed_effect	counts	0	1	propensity
age
(24.999, 30.0]	2.263298	276	228	48	0.173913
(30.0, 35.0]	3.100541	214	160	54	0.252336
(35.0, 40.0]	2.433750	177	136	41	0.231638
(40.0, 45.0]	3.801950	197	146	51	0.258883
(45.0, 50.0]	3.603730	234	181	53	0.226496
(50.0, 55.0]	5.238368	186	133	53	0.284946
(55.0, 60.0]	2.871668	132	81	51	0.386364
(60.0, 65.0]	1.402340	85	55	30	0.352941
(65.0, 70.0]	3.602345	53	38	15	0.283019
(70.0, 75.0]	-9.985035	12	5	7	0.583333

We see that each age group has different proportions between the treated and control, meaning, individuals in each treatment group have different propensity to quit smoking.

We can go on and visualize the observed effect as a function of the propensity of each age group (blue dots), compared to the overall average (green line) and the effect derived from the causal analysis (orange line):

In [26]:

ax = sns.scatterplot(x="propensity", y="observed_effect", data=by_age.reset_index(),
                     size="counts", size_norm=(20, 200))
ax.axhline(y=causal_effect, linestyle="--", color="C1", label="causal effect", zorder=0)
ax.axhline(y=observed_effect, linestyle=":", color="C2", label="observed effect", zorder=0)
ax.legend();

It can be seen that the ratios of those who quit smoking is smaller (small propensity) in the groups for which the effect is large (the top left of the plot). The groups with small effect contribute disproportionately to the observed effect. The overall observed effect is therefore smaller than what would have been observed if the ratios were the same across all groups. In fact, it would be closer to the 3.5 value that is predicted by the causal analysis.

More generally, both the effect size and the prevalence of treatment changes between age groups, and age is therefore a confounder of the effect. Therefore, the overall population effect can only be calculated by weighting the observed group-differences by the size of each age group.
So if we were to average the observed effect and factoring in the size of each age group we would get:

In [27]:

import numpy as np

np.average(by_age["observed_effect"], weights=by_age["counts"])

Out[27]:

3.1002018883702944

Observed effect of 3.1 is closer to the causal effect of 3.5, than to the simple observed effect of 2.5 we originally got.
What we did was to take the already averaged-within-each-age-group effect and average it once more using the sizes of each group.
This is equivalent to providing each individual with a propensity value based only on their age group, and then applying an inverse-probability-weighting scheme to obtain a slightly-more corrected effect.
Let's do IPW by hand:

Building an age-based IPW model by hand¶

In [28]:

# Give each individual their propensity according their age group:
individual_age_weights = age_grouped.map(by_age["propensity"])

# Give each individual the pobability to be in their original treatment group,
# i.e. the treated get their propensity (=probability to be treated) 
#      and the control get 1 minus that propensity (=probability to by untreated)
individual_age_weights = (data.a * individual_age_weights) + ((1-data.a) * (1-individual_age_weights))

# We then inverse these propensity:
individual_age_weights = 1 / individual_age_weights

# We use the inversed propensities to create a weighted outcome:
weighted_outcome = individual_age_weights * data.y

# Finaly, we take the weighted average of the weighted outcome in each treatment group:
weighted_outcome = weighted_outcome.groupby(data.a).sum() / individual_age_weights.groupby(data.a).sum()

# And we can 
weighted_outcome[1] - weighted_outcome[0]

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-28-916d9792bddf> in <module>
      5 # i.e. the treated get their propensity (=probability to be treated)
      6 #      and the control get 1 minus that propensity (=probability to by untreated)
----> 7 individual_age_weights = (data.a * individual_age_weights) + ((1-data.a) * (1-individual_age_weights))
      8 
      9 # We then inverse these propensity:

~/causal/lib/python3.8/site-packages/pandas/core/ops/common.py in new_method(self, other)
     67         other = item_from_zerodim(other)
     68 
---> 69         return method(self, other)
     70 
     71     return new_method

~/causal/lib/python3.8/site-packages/pandas/core/arraylike.py in __mul__(self, other)
    106     @unpack_zerodim_and_defer("__mul__")
    107     def __mul__(self, other):
--> 108         return self._arith_method(other, operator.mul)
    109 
    110     @unpack_zerodim_and_defer("__rmul__")

~/causal/lib/python3.8/site-packages/pandas/core/series.py in _arith_method(self, other, op)
   5524 
   5525         with np.errstate(all="ignore"):
-> 5526             result = ops.arithmetic_op(lvalues, rvalues, op)
   5527 
   5528         return self._construct_result(result, name=res_name)

~/causal/lib/python3.8/site-packages/pandas/core/ops/array_ops.py in arithmetic_op(left, right, op)
    216         # Timedelta/Timestamp and other custom scalars are included in the check
    217         # because numexpr will fail on it, see GH#31457
--> 218         res_values = op(left, right)
    219     else:
    220         # TODO we should handle EAs consistently and move this check before the if/else

~/causal/lib/python3.8/site-packages/pandas/core/arrays/categorical.py in __array_ufunc__(self, ufunc, method, *inputs, **kwargs)
   1466         # for all other cases, raise for now (similarly as what happens in
   1467         # Series.__array_prepare__)
-> 1468         raise TypeError(
   1469             f"Object with dtype {self.dtype} cannot perform "
   1470             f"the numpy op {ufunc.__name__}"

TypeError: Object with dtype category cannot perform the numpy op multiply

Indeed we got the same effect as the weighted-average-of-averages above.

However, remember we had only 10 unique propensity values, because we arbitrarily binned the age variable into 10 age groups.
Meanwhile, using a machine learning model allows us to model the propensity using age as a continuous variable.
Let's examine what happens when using the age continuously, by building an IPW model that uses age as its sole feature when modeling propensity to treat.

In [ ]:

learner = LogisticRegression(penalty='none', solver='lbfgs', max_iter=500)
ipw = IPW(learner)
ipw.fit(data.X[["age"]], data.a)
outcomes = ipw.estimate_population_outcome(data.X[["age"]], data.a, data.y)
outcomes[1] - outcomes[0]

Out[ ]:

3.098787136071221

Indeed we got an almost identical effect, correcting only for age in our IPW model.
We can only assume that when correcting for more variables and eliminating more spuriuous correlations in the process, we will get to the causal effect of 3.5 kg.

Summary¶

We saw that there is a discrepancy between the "effect" using simple descriptive statistics, and the causal effect resulting from a causal analysis.
The discrepancy originated from the groups (quitting or not quitting smoking) having different distributions of charachteristics, and thus, are not comparable.
We discovered that participants' age had the largest distributional shift among the treatment groups and focused on it.
We saw that the observed effect differed across different age groups, and that different age groups had different number of people, suggesting a simple average mixes different effects indifferently.
We demonstrated that when accounting for the number of participants in each age group, the "effect" shifted from the original observed-effect towards the causal-effect.
We saw we get similar result when building a causal model that controls only for age (and not for all confounders).