When performing matching on a sample set, we may want to use non-standard distance measurements or faster implementations. The default behavior is to use the scikit-learn `NearestNeighbors`

object but that can be overriden using the `knn_backend`

keyword argument when initializing a `Matching`

, `PropensityMatching`

or `MatchingTransformer`

object.

In this notebook, we show how to use the `faiss`

-based backend which we have provided in the `causallib.contrib`

module. This leads to a speed-up of 5x or more on the full Lalonde dataset as shown here.

We also show how to use a custom metric function, by implementing a log odds ratio of the propensity score on the level of the distance metric and matching thereon.

To see the speedup, we load the augmented lalonde dataset as in the lalonde notebook. The next few cells are the same as we did there.

In [1]:

```
import pandas as pd
import numpy as np
columns = ["training", # Treatment assignment indicator
"age", # Age of participant
"education", # Years of education
"black", # Indicate whether individual is black
"hispanic", # Indicate whether individual is hispanic
"married", # Indicate whether individual is married
"no_degree", # Indicate if individual has no high-school diploma
"re74", # Real earnings in 1974, prior to study participation
"re75", # Real earnings in 1975, prior to study participation
"re78"] # Real earnings in 1978, after study end
#treated = pd.read_csv("http://www.nber.org/~rdehejia/data/nswre74_treated.txt",
# delim_whitespace=True, header=None, names=columns)
#control = pd.read_csv("http://www.nber.org/~rdehejia/data/nswre74_control.txt",
# delim_whitespace=True, header=None, names=columns)
file_names = ["http://www.nber.org/~rdehejia/data/nswre74_treated.txt",
"http://www.nber.org/~rdehejia/data/nswre74_control.txt",
"http://www.nber.org/~rdehejia/data/psid_controls.txt",
"http://www.nber.org/~rdehejia/data/psid2_controls.txt",
"http://www.nber.org/~rdehejia/data/psid3_controls.txt",
"http://www.nber.org/~rdehejia/data/cps_controls.txt",
"http://www.nber.org/~rdehejia/data/cps2_controls.txt",
"http://www.nber.org/~rdehejia/data/cps3_controls.txt"]
files = [pd.read_csv(file_name, delim_whitespace=True, header=None, names=columns) for file_name in file_names]
lalonde = pd.concat(files, ignore_index=True)
lalonde = lalonde.sample(frac=1.0, random_state=42) # Shuffle
print(lalonde.shape)
lalonde.head()
```

Out[1]:

In [2]:

```
print(f'The dataset contains {lalonde.shape[0]} people, out of which {lalonde["training"].sum():.0f} received training')
```

In [3]:

```
lalonde = lalonde.join((lalonde[["re74", "re75"]] == 0).astype(int), rsuffix=("=0"))
lalonde.head()
```

Out[3]:

In [4]:

```
a = lalonde.pop("training")
y = lalonde.pop("re78")
X = lalonde
X.shape, a.shape, y.shape
```

Out[4]:

In the lalonde matching notebook, we saw before that the full Lalonde dataset is rather slow to execute matching on. This can be sped up substantially if we use the `faiss`

backend which is found in the `contrib`

module. It will use GPU acceleration if available, falling back on CPU if not. The timings below were generated using CPU only (Intel i7-9750H). Use of this backend requires the installation of the `faiss-gpu`

or `faiss-cpu`

package from pypi.

In [5]:

```
from causallib.estimation import Matching
from causallib.contrib.faissknn import FaissNearestNeighbors
```

The `"sklearn"`

backend is not optimized for speed and takes approximately 2 minutes to fit and match.

⚠️**WARNING**⚠️: the `%%timeit`

blocks may take a long time to execute because they run several trials. If you want to run this notebook for any purpose other than timing the difference in backends you may want to comment the first line of the following cells out.

In [6]:

```
%%timeit -n 2 -r 3
m = Matching(knn_backend="sklearn")
m.fit(X,a,y)
y_potential_outcomes = m.estimate_population_outcome(X,a)
```

The `knn_backend`

argument can be a callable returning a `NearestNeighbors`

-like object or an object directly. If it is an object, that object will be copied and fit for each treatment value in the data. Here we use the class name as a callable:

In [7]:

```
%%timeit -n 2 -r 3
m = Matching(knn_backend=FaissNearestNeighbors)
m.fit(X,a,y)
y_potential_outcomes = m.estimate_population_outcome(X,a)
```

Here we use an instance of the `FaissNearestNeighbors`

class. To see the supported options, see the documentation for `FaissNearestNeighbors`

.

In [8]:

```
%%timeit -n 2 -r 3
m = Matching(knn_backend=FaissNearestNeighbors(index_type="ivfflat"))
m.fit(X,a,y)
y_potential_outcomes = m.estimate_population_outcome(X,a)
```

When comparing the difference between two samples using their propensity scores, a raw difference may be misleading. This is because the difference between 0.01 and 0.05 is much more meaningful than a difference between 0.51 and 0.55. In *Causal Inference for Statistics, Social, and Biomedical Sciences* section 18.5, Imbens and Rubin recommend taking the "log odds ratio" $$l(x) = ln( x / (1 - x) )$$ and comparing differences in propensity scores on that scale. This is not the default behavior of `Matching`

but it is easy to implement.

In [9]:

```
def logodds(x):
return np.log( x / (1 - x))
def logodds_distance(x,y):
return np.abs(logodds(x) - logodds(y))
def check_difference(x,y):
print("({x:.2f},{y:.2f}): original distance: {d1:.2f} logodds distance: {d2:.2f}"
.format_map({"x":x,"y":y,"d1":np.abs(x-y),"d2":logodds_distance(x,y)}))
check_difference(0.01,0.05)
check_difference(0.51,0.55)
```

This is useful for matching because matching a sample with propensity 0.51 to a sample with propensity 0.55 may represent a better match than matching a sample of 0.01 with 0.05.

We implement this by passing the `logodds_distance`

function to `NearestNeighbors`

and using that as the `knn_backend`

for `PropensityMatching`

.

In [10]:

```
from causallib.estimation import PropensityMatching
from sklearn.neighbors import NearestNeighbors
from sklearn.linear_model import LogisticRegression
logodds_knn = NearestNeighbors(metric=logodds_distance)
```

For speed and ease of comparison, we will load the NHEFS data that is provided with `causallib`

.

In [11]:

```
from causallib.datasets import load_nhefs
data_nhefs = load_nhefs(augment=False,onehot=False)
X_nhefs, a_nhefs, y_nhefs = data_nhefs.X, data_nhefs.a, data_nhefs.y
```

We can estimate the population outcome doing propensity score matching with the log odds ratio distance, where the propensity model is logistic regression.

In [12]:

```
pm_nhefs_log = PropensityMatching(
learner=LogisticRegression(solver="liblinear"),
knn_backend=logodds_knn,
).fit(X_nhefs, a_nhefs, y_nhefs)
pm_nhefs_log.estimate_population_outcome(X_nhefs, a_nhefs)
```

Out[12]:

For comparison, we fit the same data with a standard Euclidean metric.

In [13]:

```
pm_nhefs_lin = PropensityMatching(
learner=LogisticRegression(solver="liblinear"),
knn_backend="sklearn"
).fit(X_nhefs,a_nhefs,y_nhefs)
pm_nhefs_lin.estimate_population_outcome(X_nhefs,a_nhefs)
```

Out[13]:

We could similarly use the log odds ratio for the full lalonde dataset with a caliper to address the imbalances discussed in the lalonde matching notebook:

In [14]:

```
pm_lalonde_log = PropensityMatching(
learner=LogisticRegression(solver="liblinear"),
knn_backend=logodds_knn,
caliper=0.01,
).fit(X, a, y)
pm_lalonde_log.estimate_population_outcome(X, a)
```

Out[14]:

The `FaissNearestNeighbors`

object does not currently support metrics other than Mahalanobis and Euclidean.

**NOTE** : In practice, alternative metrics that can be expressed as transforms on the covariates such as the log odds ratio should be in a `propensity_transform`

object, not inside the metric function. This would substantially speed up run time because custom functions are slower than the built-in metrics for `NearestNeighbors`

, and are not supported by the `FaissNearestNeighbors`

backend.