When performing matching on a sample set, we may want to use non-standard distance measurements or faster implementations. The default behavior is to use the scikit-learn NearestNeighbors
object but that can be overriden using the knn_backend
keyword argument when initializing a Matching
, PropensityMatching
or MatchingTransformer
object.
In this notebook, we show how to use the faiss
-based backend which we have provided in the causallib.contrib
module. This leads to a speed-up of 5x or more on the full Lalonde dataset as shown here.
We also show how to use a custom metric function, by implementing a log odds ratio of the propensity score on the level of the distance metric and matching thereon.
To see the speedup, we load the augmented lalonde dataset as in the lalonde notebook. The next few cells are the same as we did there.
import pandas as pd
import numpy as np
columns = ["training", # Treatment assignment indicator
"age", # Age of participant
"education", # Years of education
"black", # Indicate whether individual is black
"hispanic", # Indicate whether individual is hispanic
"married", # Indicate whether individual is married
"no_degree", # Indicate if individual has no high-school diploma
"re74", # Real earnings in 1974, prior to study participation
"re75", # Real earnings in 1975, prior to study participation
"re78"] # Real earnings in 1978, after study end
#treated = pd.read_csv("http://www.nber.org/~rdehejia/data/nswre74_treated.txt",
# delim_whitespace=True, header=None, names=columns)
#control = pd.read_csv("http://www.nber.org/~rdehejia/data/nswre74_control.txt",
# delim_whitespace=True, header=None, names=columns)
file_names = ["http://www.nber.org/~rdehejia/data/nswre74_treated.txt",
"http://www.nber.org/~rdehejia/data/nswre74_control.txt",
"http://www.nber.org/~rdehejia/data/psid_controls.txt",
"http://www.nber.org/~rdehejia/data/psid2_controls.txt",
"http://www.nber.org/~rdehejia/data/psid3_controls.txt",
"http://www.nber.org/~rdehejia/data/cps_controls.txt",
"http://www.nber.org/~rdehejia/data/cps2_controls.txt",
"http://www.nber.org/~rdehejia/data/cps3_controls.txt"]
files = [pd.read_csv(file_name, delim_whitespace=True, header=None, names=columns) for file_name in file_names]
lalonde = pd.concat(files, ignore_index=True)
lalonde = lalonde.sample(frac=1.0, random_state=42) # Shuffle
print(lalonde.shape)
lalonde.head()
(22106, 10)
training | age | education | black | hispanic | married | no_degree | re74 | re75 | re78 | |
---|---|---|---|---|---|---|---|---|---|---|
16827 | 0.0 | 26.0 | 13.0 | 0.0 | 0.0 | 0.0 | 0.0 | 58.778 | 50.12903 | 31.03226 |
5412 | 0.0 | 27.0 | 12.0 | 0.0 | 0.0 | 1.0 | 0.0 | 16297.180 | 13429.21000 | 19562.14000 |
15399 | 0.0 | 26.0 | 12.0 | 0.0 | 0.0 | 0.0 | 0.0 | 5217.527 | 3174.24200 | 25564.67000 |
13077 | 0.0 | 38.0 | 16.0 | 0.0 | 0.0 | 1.0 | 0.0 | 23713.010 | 9178.98400 | 18814.41000 |
2189 | 0.0 | 55.0 | 8.0 | 0.0 | 0.0 | 1.0 | 1.0 | 0.000 | 0.00000 | 0.00000 |
print(f'The dataset contains {lalonde.shape[0]} people, out of which {lalonde["training"].sum():.0f} received training')
The dataset contains 22106 people, out of which 185 received training
lalonde = lalonde.join((lalonde[["re74", "re75"]] == 0).astype(int), rsuffix=("=0"))
lalonde.head()
training | age | education | black | hispanic | married | no_degree | re74 | re75 | re78 | re74=0 | re75=0 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
16827 | 0.0 | 26.0 | 13.0 | 0.0 | 0.0 | 0.0 | 0.0 | 58.778 | 50.12903 | 31.03226 | 0 | 0 |
5412 | 0.0 | 27.0 | 12.0 | 0.0 | 0.0 | 1.0 | 0.0 | 16297.180 | 13429.21000 | 19562.14000 | 0 | 0 |
15399 | 0.0 | 26.0 | 12.0 | 0.0 | 0.0 | 0.0 | 0.0 | 5217.527 | 3174.24200 | 25564.67000 | 0 | 0 |
13077 | 0.0 | 38.0 | 16.0 | 0.0 | 0.0 | 1.0 | 0.0 | 23713.010 | 9178.98400 | 18814.41000 | 0 | 0 |
2189 | 0.0 | 55.0 | 8.0 | 0.0 | 0.0 | 1.0 | 1.0 | 0.000 | 0.00000 | 0.00000 | 1 | 1 |
a = lalonde.pop("training")
y = lalonde.pop("re78")
X = lalonde
X.shape, a.shape, y.shape
((22106, 10), (22106,), (22106,))
In the lalonde matching notebook, we saw before that the full Lalonde dataset is rather slow to execute matching on. This can be sped up substantially if we use the faiss
backend which is found in the contrib
module. It will use GPU acceleration if available, falling back on CPU if not. The timings below were generated using CPU only (Intel i7-9750H). Use of this backend requires the installation of the faiss-gpu
or faiss-cpu
package from pypi.
from causallib.estimation import Matching
from causallib.contrib.faissknn import FaissNearestNeighbors
The "sklearn"
backend is not optimized for speed and takes approximately 2 minutes to fit and match.
⚠️WARNING⚠️: the %%timeit
blocks may take a long time to execute because they run several trials. If you want to run this notebook for any purpose other than timing the difference in backends you may want to comment the first line of the following cells out.
%%timeit -n 2 -r 3
m = Matching(knn_backend="sklearn")
m.fit(X,a,y)
y_potential_outcomes = m.estimate_population_outcome(X,a)
1min 51s ± 13.2 s per loop (mean ± std. dev. of 3 runs, 2 loops each)
The knn_backend
argument can be a callable returning a NearestNeighbors
-like object or an object directly. If it is an object, that object will be copied and fit for each treatment value in the data. Here we use the class name as a callable:
%%timeit -n 2 -r 3
m = Matching(knn_backend=FaissNearestNeighbors)
m.fit(X,a,y)
y_potential_outcomes = m.estimate_population_outcome(X,a)
20.3 s ± 1.51 s per loop (mean ± std. dev. of 3 runs, 2 loops each)
Here we use an instance of the FaissNearestNeighbors
class. To see the supported options, see the documentation for FaissNearestNeighbors
.
%%timeit -n 2 -r 3
m = Matching(knn_backend=FaissNearestNeighbors(index_type="ivfflat"))
m.fit(X,a,y)
y_potential_outcomes = m.estimate_population_outcome(X,a)
18.3 s ± 97.2 ms per loop (mean ± std. dev. of 3 runs, 2 loops each)
When comparing the difference between two samples using their propensity scores, a raw difference may be misleading. This is because the difference between 0.01 and 0.05 is much more meaningful than a difference between 0.51 and 0.55. In Causal Inference for Statistics, Social, and Biomedical Sciences section 18.5, Imbens and Rubin recommend taking the "log odds ratio" $$l(x) = ln( x / (1 - x) )$$ and comparing differences in propensity scores on that scale. This is not the default behavior of Matching
but it is easy to implement.
def logodds(x):
return np.log( x / (1 - x))
def logodds_distance(x,y):
return np.abs(logodds(x) - logodds(y))
def check_difference(x,y):
print("({x:.2f},{y:.2f}): original distance: {d1:.2f} logodds distance: {d2:.2f}"
.format_map({"x":x,"y":y,"d1":np.abs(x-y),"d2":logodds_distance(x,y)}))
check_difference(0.01,0.05)
check_difference(0.51,0.55)
(0.01,0.05): original distance: 0.04 logodds distance: 1.65 (0.51,0.55): original distance: 0.04 logodds distance: 0.16
This is useful for matching because matching a sample with propensity 0.51 to a sample with propensity 0.55 may represent a better match than matching a sample of 0.01 with 0.05.
We implement this by passing the logodds_distance
function to NearestNeighbors
and using that as the knn_backend
for PropensityMatching
.
from causallib.estimation import PropensityMatching
from sklearn.neighbors import NearestNeighbors
from sklearn.linear_model import LogisticRegression
logodds_knn = NearestNeighbors(metric=logodds_distance)
For speed and ease of comparison, we will load the NHEFS data that is provided with causallib
.
from causallib.datasets import load_nhefs
data_nhefs = load_nhefs(augment=False,onehot=False)
X_nhefs, a_nhefs, y_nhefs = data_nhefs.X, data_nhefs.a, data_nhefs.y
We can estimate the population outcome doing propensity score matching with the log odds ratio distance, where the propensity model is logistic regression.
pm_nhefs_log = PropensityMatching(
learner=LogisticRegression(solver="liblinear"),
knn_backend=logodds_knn,
).fit(X_nhefs, a_nhefs, y_nhefs)
pm_nhefs_log.estimate_population_outcome(X_nhefs, a_nhefs)
0 1.802614 1 4.560399 dtype: float64
For comparison, we fit the same data with a standard Euclidean metric.
pm_nhefs_lin = PropensityMatching(
learner=LogisticRegression(solver="liblinear"),
knn_backend="sklearn"
).fit(X_nhefs,a_nhefs,y_nhefs)
pm_nhefs_lin.estimate_population_outcome(X_nhefs,a_nhefs)
0 1.802614 1 4.548021 dtype: float64
We could similarly use the log odds ratio for the full lalonde dataset with a caliper to address the imbalances discussed in the lalonde matching notebook:
pm_lalonde_log = PropensityMatching(
learner=LogisticRegression(solver="liblinear"),
knn_backend=logodds_knn,
caliper=0.01,
).fit(X, a, y)
pm_lalonde_log.estimate_population_outcome(X, a)
0.0 6340.629027 1.0 7201.100363 dtype: float64
The FaissNearestNeighbors
object does not currently support metrics other than Mahalanobis and Euclidean.
NOTE : In practice, alternative metrics that can be expressed as transforms on the covariates such as the log odds ratio should be in a propensity_transform
object, not inside the metric function. This would substantially speed up run time because custom functions are slower than the built-in metrics for NearestNeighbors
, and are not supported by the FaissNearestNeighbors
backend.