Matching with Custom Backends

When performing matching on a sample set, we may want to use non-standard distance measurements or faster implementations. The default behavior is to use the scikit-learn NearestNeighbors object but that can be overriden using the knn_backend keyword argument when initializing a Matching, PropensityMatching or MatchingTransformer object.

In this notebook, we show how to use the faiss-based backend which we have provided in the causallib.contrib module. This leads to a speed-up of 5x or more on the full Lalonde dataset as shown here.

We also show how to use a custom metric function, by implementing a log odds ratio of the propensity score on the level of the distance metric and matching thereon.

Performance of Matching with Faiss vs Sklearn

To see the speedup, we load the augmented lalonde dataset as in the lalonde notebook. The next few cells are the same as we did there.

In [1]:
import pandas as pd
import numpy as np
columns = ["training",   # Treatment assignment indicator
           "age",        # Age of participant
           "education",  # Years of education
           "black",      # Indicate whether individual is black
           "hispanic",   # Indicate whether individual is hispanic
           "married",    # Indicate whether individual is married
           "no_degree",  # Indicate if individual has no high-school diploma
           "re74",       # Real earnings in 1974, prior to study participation
           "re75",       # Real earnings in 1975, prior to study participation
           "re78"]       # Real earnings in 1978, after study end

#treated = pd.read_csv("http://www.nber.org/~rdehejia/data/nswre74_treated.txt", 
#                      delim_whitespace=True, header=None, names=columns)
#control = pd.read_csv("http://www.nber.org/~rdehejia/data/nswre74_control.txt",
#                      delim_whitespace=True, header=None, names=columns)
file_names = ["http://www.nber.org/~rdehejia/data/nswre74_treated.txt",
              "http://www.nber.org/~rdehejia/data/nswre74_control.txt",
              "http://www.nber.org/~rdehejia/data/psid_controls.txt",
              "http://www.nber.org/~rdehejia/data/psid2_controls.txt",
              "http://www.nber.org/~rdehejia/data/psid3_controls.txt",
              "http://www.nber.org/~rdehejia/data/cps_controls.txt",
              "http://www.nber.org/~rdehejia/data/cps2_controls.txt",
              "http://www.nber.org/~rdehejia/data/cps3_controls.txt"]
files = [pd.read_csv(file_name, delim_whitespace=True, header=None, names=columns) for file_name in file_names]
lalonde = pd.concat(files, ignore_index=True)
lalonde = lalonde.sample(frac=1.0, random_state=42)  # Shuffle

print(lalonde.shape)
lalonde.head()
(22106, 10)
Out[1]:
training age education black hispanic married no_degree re74 re75 re78
16827 0.0 26.0 13.0 0.0 0.0 0.0 0.0 58.778 50.12903 31.03226
5412 0.0 27.0 12.0 0.0 0.0 1.0 0.0 16297.180 13429.21000 19562.14000
15399 0.0 26.0 12.0 0.0 0.0 0.0 0.0 5217.527 3174.24200 25564.67000
13077 0.0 38.0 16.0 0.0 0.0 1.0 0.0 23713.010 9178.98400 18814.41000
2189 0.0 55.0 8.0 0.0 0.0 1.0 1.0 0.000 0.00000 0.00000
In [2]:
print(f'The dataset contains {lalonde.shape[0]} people, out of which {lalonde["training"].sum():.0f} received training')
The dataset contains 22106 people, out of which 185 received training
In [3]:
lalonde = lalonde.join((lalonde[["re74", "re75"]] == 0).astype(int), rsuffix=("=0"))
lalonde.head()
Out[3]:
training age education black hispanic married no_degree re74 re75 re78 re74=0 re75=0
16827 0.0 26.0 13.0 0.0 0.0 0.0 0.0 58.778 50.12903 31.03226 0 0
5412 0.0 27.0 12.0 0.0 0.0 1.0 0.0 16297.180 13429.21000 19562.14000 0 0
15399 0.0 26.0 12.0 0.0 0.0 0.0 0.0 5217.527 3174.24200 25564.67000 0 0
13077 0.0 38.0 16.0 0.0 0.0 1.0 0.0 23713.010 9178.98400 18814.41000 0 0
2189 0.0 55.0 8.0 0.0 0.0 1.0 1.0 0.000 0.00000 0.00000 1 1
In [4]:
a = lalonde.pop("training")
y = lalonde.pop("re78")
X = lalonde
X.shape, a.shape, y.shape
Out[4]:
((22106, 10), (22106,), (22106,))

In the lalonde matching notebook, we saw before that the full Lalonde dataset is rather slow to execute matching on. This can be sped up substantially if we use the faiss backend which is found in the contrib module. It will use GPU acceleration if available, falling back on CPU if not. The timings below were generated using CPU only (Intel i7-9750H). Use of this backend requires the installation of the faiss-gpu or faiss-cpu package from pypi.

In [5]:
from causallib.estimation import Matching
from causallib.contrib.faissknn import FaissNearestNeighbors

The "sklearn" backend is not optimized for speed and takes approximately 2 minutes to fit and match.

⚠️WARNING⚠️: the %%timeit blocks may take a long time to execute because they run several trials. If you want to run this notebook for any purpose other than timing the difference in backends you may want to comment the first line of the following cells out.

In [6]:
%%timeit -n 2 -r 3 
m = Matching(knn_backend="sklearn")
m.fit(X,a,y)
y_potential_outcomes = m.estimate_population_outcome(X,a)
1min 51s ± 13.2 s per loop (mean ± std. dev. of 3 runs, 2 loops each)

The knn_backend argument can be a callable returning a NearestNeighbors-like object or an object directly. If it is an object, that object will be copied and fit for each treatment value in the data. Here we use the class name as a callable:

In [7]:
%%timeit -n 2 -r 3 
m = Matching(knn_backend=FaissNearestNeighbors)
m.fit(X,a,y)
y_potential_outcomes = m.estimate_population_outcome(X,a)
20.3 s ± 1.51 s per loop (mean ± std. dev. of 3 runs, 2 loops each)

Here we use an instance of the FaissNearestNeighbors class. To see the supported options, see the documentation for FaissNearestNeighbors.

In [8]:
%%timeit -n 2 -r 3 
m = Matching(knn_backend=FaissNearestNeighbors(index_type="ivfflat"))
m.fit(X,a,y)
y_potential_outcomes = m.estimate_population_outcome(X,a)
18.3 s ± 97.2 ms per loop (mean ± std. dev. of 3 runs, 2 loops each)

Custom metric function: Log odds ratio for propensity comparison

When comparing the difference between two samples using their propensity scores, a raw difference may be misleading. This is because the difference between 0.01 and 0.05 is much more meaningful than a difference between 0.51 and 0.55. In Causal Inference for Statistics, Social, and Biomedical Sciences section 18.5, Imbens and Rubin recommend taking the "log odds ratio" $$l(x) = ln( x / (1 - x) )$$ and comparing differences in propensity scores on that scale. This is not the default behavior of Matching but it is easy to implement.

In [9]:
def logodds(x):
    return np.log( x / (1 - x))
def logodds_distance(x,y):
    return np.abs(logodds(x) - logodds(y))
def check_difference(x,y):
    print("({x:.2f},{y:.2f}): original distance: {d1:.2f} logodds distance: {d2:.2f}"
          .format_map({"x":x,"y":y,"d1":np.abs(x-y),"d2":logodds_distance(x,y)}))
check_difference(0.01,0.05)
check_difference(0.51,0.55)
(0.01,0.05): original distance: 0.04 logodds distance: 1.65
(0.51,0.55): original distance: 0.04 logodds distance: 0.16

This is useful for matching because matching a sample with propensity 0.51 to a sample with propensity 0.55 may represent a better match than matching a sample of 0.01 with 0.05.

We implement this by passing the logodds_distance function to NearestNeighbors and using that as the knn_backend for PropensityMatching.

In [10]:
from causallib.estimation import PropensityMatching
from sklearn.neighbors import NearestNeighbors
from sklearn.linear_model import LogisticRegression

logodds_knn = NearestNeighbors(metric=logodds_distance)

For speed and ease of comparison, we will load the NHEFS data that is provided with causallib.

In [11]:
from causallib.datasets import load_nhefs

data_nhefs = load_nhefs(augment=False,onehot=False)
X_nhefs, a_nhefs, y_nhefs = data_nhefs.X, data_nhefs.a, data_nhefs.y

We can estimate the population outcome doing propensity score matching with the log odds ratio distance, where the propensity model is logistic regression.

In [12]:
pm_nhefs_log = PropensityMatching(
    learner=LogisticRegression(solver="liblinear"),
    knn_backend=logodds_knn,
    ).fit(X_nhefs, a_nhefs, y_nhefs)
pm_nhefs_log.estimate_population_outcome(X_nhefs, a_nhefs)
Out[12]:
0    1.802614
1    4.560399
dtype: float64

For comparison, we fit the same data with a standard Euclidean metric.

In [13]:
pm_nhefs_lin = PropensityMatching(
    learner=LogisticRegression(solver="liblinear"),
    knn_backend="sklearn"
    ).fit(X_nhefs,a_nhefs,y_nhefs)
pm_nhefs_lin.estimate_population_outcome(X_nhefs,a_nhefs)
Out[13]:
0    1.802614
1    4.548021
dtype: float64

We could similarly use the log odds ratio for the full lalonde dataset with a caliper to address the imbalances discussed in the lalonde matching notebook:

In [14]:
pm_lalonde_log = PropensityMatching(
    learner=LogisticRegression(solver="liblinear"),
    knn_backend=logodds_knn,
    caliper=0.01,
    ).fit(X, a, y)
pm_lalonde_log.estimate_population_outcome(X, a)
Out[14]:
0.0    6340.629027
1.0    7201.100363
dtype: float64

The FaissNearestNeighbors object does not currently support metrics other than Mahalanobis and Euclidean.

NOTE : In practice, alternative metrics that can be expressed as transforms on the covariates such as the log odds ratio should be in a propensity_transform object, not inside the metric function. This would substantially speed up run time because custom functions are slower than the built-in metrics for NearestNeighbors, and are not supported by the FaissNearestNeighbors backend.