A Python package for causal inference using Bayesian structural time-series models. It's a port of the R package CausalImpact, see https://github.com/google/CausalImpact.
This package implements an approach to estimating the causal effect of a designed intervention on a time series. For example, how many additional daily clicks were generated by an advertising campaign? Answering a question like this can be difficult when a randomized experiment is not available.
Given a response time series (e.g., clicks) and a set of control time series (e.g., clicks in non-affected markets or clicks on other sites), the package constructs a Bayesian structural time-series model. This model is then used to try and predict the counterfactual, i.e., how the response metric would have evolved after the intervention if the intervention had never occurred. For details, see: Brodersen et al., Annals of Applied Statistics (2015).
As with all non-experimental approaches to causal inference, valid conclusions require strong assumptions. In the case of CausalImpact, we assume that there is a set control time series that were themselves not affected by the intervention. If they were, we might falsely under- or overestimate the true effect. Or we might falsely conclude that there was an effect even though in reality there wasn't. The model also assumes that the relationship between covariates and treated time series, as established during the pre-period, remains stable throughout the post-period (see impact.model_args["dynamic_regression"]
for a way of relaxing this assumption). Finally, it's important to be aware of the priors that are part of the model (see impact.model_args["prior_level_sd"]
in particular).
The package is designed to make counterfactual inference as easy as fitting a regression model, but much more powerful, provided the assumptions above are met. The package has a single entry point, the function CausalImpact()
. Given a response time series and a set of control time series, the function constructs a time-series model, performs posterior inference on the counterfactual, and returns a CausalImpact
object. The results can be summarized in terms of a table, a verbal description, or a plot.
To install causalimpact
run:
# !pip install causalimpact
Once installed, the package can be imported using:
from causalimpact import CausalImpact
WARNING (pytensor.configdefaults): g++ not detected! PyTensor will be unable to compile C-implementations and will default to Python. Performance may be severely degraded. To remove this warning, set PyTensor flags cxx to an empty string. WARNING (pytensor.tensor.blas): Using NumPy C-API based implementation for BLAS functions.
To illustrate how the package works, we create a simple toy dataset. It consists of a response variable y
and a predictor x1
. Note that in practice, we'd strive for including many more predictor variables and let the model choose an appropriate subset. The example data has 100 observations. We create an intervention effect by lifting the response variable by 10 units after timepoint 71.
import numpy as np
import pandas as pd
from statsmodels.tsa.arima_process import arma_generate_sample
import matplotlib
%matplotlib inline
matplotlib.rcParams['figure.figsize'] = (15, 6)
np.random.seed(1)
x1 = arma_generate_sample(ar=[0.999], ma=[0.9], nsample=100) + 100
y = 1.2 * x1 + np.random.randn(100)
y[71:100] = y[71:100] + 10
data = pd.DataFrame(np.array([y, x1]).T, columns=["y","x1"])
We now have a simple matrix with 100 rows and two columns:
data.shape
data.head()
y | x1 | |
---|---|---|
0 | 121.308920 | 101.463374 |
1 | 120.563149 | 99.448868 |
2 | 119.832495 | 99.524170 |
3 | 119.433612 | 99.033362 |
4 | 119.840664 | 100.779647 |
We can visualize the generated data using:
data.plot();
To estimate the causal effect, we begin by specifying which period in the data should be used for training the model (pre-intervention period) and which period for computing a counterfactual prediction (post-intervention period).
pre_period = [0,69]
post_period = [71,99]
This says that time points 0...70 will be used for training, and time points 71...99 will be used for computing predictions. Alternatively, we could specify the periods in terms of dates or time points; see Section 5 for an example.
To perform inference, we run the analysis using:
impact = CausalImpact(data, pre_period, post_period)
This initialises a CausalImpact
object
impact.run()
RUNNING THE L-BFGS-B CODE * * * Machine precision = 2.220D-16 N = 3 M = 10 At X0 0 variables are exactly at the bounds At iterate 0 f= 1.06685D+00 |proj g|= 4.25552D-01 At iterate 5 f= 9.89784D-01 |proj g|= 2.23487D-03 * * * Tit = total number of iterations Tnf = total number of function evaluations Tnint = total number of segments explored during Cauchy searches Skip = number of BFGS updates skipped Nact = number of active bounds at final generalized Cauchy point Projg = norm of the final projected gradient F = final function value * * * N Tit Tnf Tnint Skip Nact Projg F 3 7 10 1 0 0 4.157D-05 9.898D-01 F = 0.98978278136724052 CONVERGENCE: REL_REDUCTION_OF_F_<=_FACTR*EPSMCH
This problem is unconstrained.
This instructs the package to assemble a structural time-series model, fit the model using MLE by default, and compute estimates of the causal effect. We can view the results in a dataframe as follows:
The easiest way of visualising the results is to use the plot()
method of the CausalImpact
object
impact.plot()
By default, the plot contains three panels. The first panel shows the data and a counterfactual prediction for the post-treatment period. The second panel shows the difference between observed data and counterfactual predictions. This is the pointwise causal effect, as estimated by the model. The third panel adds up the pointwise contributions from the second panel, resulting in a plot of the cumulative effect of the intervention.
Remember, once again, that all of the above inferences depend critically on the assumption that the covariates were not themselves affected by the intervention. The model also assumes that the relationship between covariates and treated time series, as established during the pre-period, remains stable throughout the post-period.
It is often more natural to feed a time-series object into CausalImpact()
rather than a data frame. For example, we might create a data
variable as follows:
date_range = pd.date_range(start="2014-01-01", periods=100)
ts_data = data.copy()
ts_data.index = date_range
ts_data.head()
y | x1 | |
---|---|---|
2014-01-01 | 121.308920 | 101.463374 |
2014-01-02 | 120.563149 | 99.448868 |
2014-01-03 | 119.832495 | 99.524170 |
2014-01-04 | 119.433612 | 99.033362 |
2014-01-05 | 119.840664 | 100.779647 |
We can now specify the pre_period and post_period in terms of time points rather than indices:
ts_pre_period = [pd.to_datetime(date) for date in ["2014-01-01", "2014-03-12"]]
ts_post_period = [pd.to_datetime(date) for date in ["2014-03-13", "2014-04-10"]]
As a result, the x-axis of the plot shows time points instead of indices:
ts_impact = CausalImpact(ts_data, ts_pre_period, ts_post_period)
ts_impact.run()
RUNNING THE L-BFGS-B CODE * * * Machine precision = 2.220D-16 N = 3 M = 10 At X0 0 variables are exactly at the bounds At iterate 0 f= 1.03579D+00 |proj g|= 4.35257D-01 At iterate 5 f= 9.74586D-01 |proj g|= 1.50180D-04 * * * Tit = total number of iterations Tnf = total number of function evaluations Tnint = total number of segments explored during Cauchy searches Skip = number of BFGS updates skipped Nact = number of active bounds at final generalized Cauchy point Projg = norm of the final projected gradient F = final function value * * * N Tit Tnf Tnint Skip Nact Projg F 3 5 10 1 0 0 1.502D-04 9.746D-01 F = 0.97458627690388255 CONVERGENCE: REL_REDUCTION_OF_F_<=_FACTR*EPSMCH
This problem is unconstrained.
ts_impact.inferences.head(2)
response | cum_response | point_pred | point_pred_lower | point_pred_upper | cum_pred | cum_pred_lower | cum_pred_upper | point_effect | point_effect_lower | point_effect_upper | cum_effect | cum_effect_lower | cum_effect_upper | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
2014-01-01 | 121.308920 | 121.30892 | 121.994854 | -2804.815502 | 3048.805211 | 121.994854 | -2804.815502 | 3048.805211 | -0.685934 | 2926.124423 | -2927.496290 | 0.0 | 0.0 | 0.0 |
2014-01-02 | 120.563149 | 241.87207 | 118.618185 | 116.007655 | 121.228715 | 240.613039 | -2688.807847 | 3170.033926 | 1.944964 | 4.555494 | -0.665566 | 0.0 | 0.0 | 0.0 |
ts_impact.plot()