#!/usr/bin/env python
# coding: utf-8

# # Matching Model
# To find the expected effect of the intervention on the population, we match each treated individual with one or more untreated individuals which are "almost the same" as him or her.

# In[1]:


from causallib.estimation import IPW, Matching
import matplotlib.pyplot as plt
import seaborn as sb
import pandas as pd
import numpy as np
from causallib.evaluation.weight_evaluator import calculate_covariate_balance
from sklearn.linear_model import LogisticRegression
from causallib.preprocessing.transformers import PropensityTransformer, MatchingTransformer
from causallib.evaluation import PropensityEvaluator
from causallib.datasets import load_nhefs
get_ipython().run_line_magic('matplotlib', 'inline')


# #### Data:
# The effect of quitting to smoke on weight loss.  
# Data example is taken from [Hernan and Robins Causal Inference Book](https://www.hsph.harvard.edu/miguel-hernan/causal-inference-book/)

# When we are looking for nearby data points to match against, the one hot encoding may not be the best choice. Augmented features are also not needed and may introduce bias (eg,  if we inlude `age^2`, we will call one pair of subjects more distant in the age variable than another pair if the base age is older, even though the difference is the same ). So we do not augment the continuous variables, and instead of one hot encoding the categorical variables, we binarize them to "high/low" values.

# In[2]:


def binarize(df, column_name):
    df = df.copy()
    m = df[column_name].median()
    def balance(i): return np.abs(0.5 - (df[column_name] < i).sum()/len(df))
    mstar = min([m-1, m, m+1], key=balance)
    df = df.assign(**{column_name: (df[column_name] < mstar).astype(int)})
    df = df.rename(columns={column_name: column_name + f"<{mstar}"})
    return df


def get_matching_data():
    data = load_nhefs(onehot=False, augment=False)
    data.X = binarize(data.X, "education")
    data.X = binarize(data.X, "exercise")
    data.X = binarize(data.X, "active")
    return data


binarized_data = get_matching_data()
X, a, y = binarized_data.X, binarized_data.a, binarized_data.y


# In[3]:


binarized_data.X.join(binarized_data.a).join(binarized_data.y).head()


# We can run either a Euclidean or a Mahalanobis metric match and predict the individual outcome using `MatchingIndividualOutcomeEstimator`:

# In[4]:


m_euclid = Matching(metric="euclidean").fit(X, a, y)
m_mahalanobis = Matching(metric="mahalanobis").fit(X, a, y)
Y_euclid = m_euclid.estimate_individual_outcome(X, a)
Y_mahalanobis = m_mahalanobis.estimate_individual_outcome(X, a)


# We can see that the two metrics lead to very similar results on a population level.

# In[5]:


Y_euclid.assign(ATE=Y_euclid[1]-Y_euclid[0]).mean()


# In[6]:


Y_mahalanobis.assign(ATE=Y_mahalanobis[1]-Y_mahalanobis[0]).mean()


# If we inspect the individual counterfactuals, we find, as expected that both metrics return the same value for the observed outcome but differ in the unobserved outcome:

# In[7]:


Y_euclid.join(Y_mahalanobis, lsuffix="_euclidean",
              rsuffix="_mahalanobis").join(a).sample(10)


# # Propensity Matching
# 
# To do propensity score matching, we can supply a transformer that replaces the covariates with a learned propensity model, using a given learner.

# In[8]:


propensity_transform = PropensityTransformer(
    learner=LogisticRegression(
        solver="liblinear",
        class_weight="balanced"),
    include_covariates=False)


# In this case we will want to use the augmented data to improve the accuracy of the propensity model.  We can calculate the ATE:

# In[9]:


augmented_data = load_nhefs()
X, a, y = augmented_data.X, augmented_data.a, augmented_data.y
matcher = Matching(propensity_transform=propensity_transform)
matcher.fit(X, a, y)
matcher.estimate_population_outcome(X, a)


# We have also provided a convenience subclass `PropensityMatching` which makes this common task straightforward:

# In[10]:


from causallib.estimation import PropensityMatching

pm = PropensityMatching(learner=LogisticRegression(
    solver="liblinear",
    class_weight="balanced"))
pm.fit(X, a, y)
pm.estimate_population_outcome(X, a)


# ## Multiple neighbor match (with replacement)
# As long as we permit replacement, we can allow multiple neighbors to match. We now check how the number of neighbors impacts the ATE.

# In[11]:


for n in range(1, 10):
    matcher.n_neighbors = n
    matcher.fit(X, a, y)
    Y = matcher.estimate_population_outcome(X, a)
    print(f"Using {n} neighbors, the effect is: {(Y[1] - Y[0]):.3f}")


# ## Replacement 
# 
# Until now, we have executed all of the matching with replacement, meaning that we can select the same treated sample as a match for multiple control samples or vice versa. If we want to only allow each sample to be used once, we must disallow replacement. 
# 
# If we mix in-sample and out-of-sample data, we would end up generating different estimated counterfactuals for a set of samples if they were checked all at once compared to if they were checked in subsets. Because of this, we have restricted no-replacement matching to operate on a single dataset only as a `PopulationOutcomeEstimator` called `Matching`.

# In[12]:


matcher = Matching(with_replacement=True, propensity_transform=propensity_transform)
matcher.fit(X, a, y)
match_df_with = matcher.match(X, a)
ATE_with_replacement = matcher.estimate_population_outcome(X, a).diff()[1]


# In[13]:


matcher = Matching(with_replacement=False, propensity_transform=propensity_transform)
matcher.fit(X, a, y)
match_df_without = matcher.match(X, a)
ATE_without_replacement = matcher.estimate_population_outcome(X, a).diff()[1]


# In[14]:


print(
    f"With replacement we find:\n{ATE_with_replacement:.3f}\nWithout replacement we find:\n{ATE_without_replacement:.3f}")


# In[15]:


ipw = IPW(LogisticRegression(solver="liblinear"))
ipw.fit(augmented_data.X, augmented_data.a)
Yipw = ipw.estimate_population_outcome(
    augmented_data.X, augmented_data.a, augmented_data.y)
ATE_ipw = Yipw[1] - Yipw[0]
ATE_naive = y[a == 1].mean() - y[a == 0].mean()

print(
    f"With IPW we find:\n{ATE_ipw:.3f}\nand the naive estimate is:\n{ATE_naive:.3f}")


# ## Caliper
# 
# Often we want to impose a restriction on the proximity of examples so that we do not permit any match if there is more than a distance $\kappa$ between the samples. We call this a "caliper" and we can see how it impacts the predicted effect here:

# In[16]:


caliper = np.logspace(-3, 0, 20)


def check_caliper(c, with_replacement=True):
    matcher = Matching(propensity_transform=propensity_transform,
                       caliper=c, with_replacement=with_replacement)
    matcher.fit(augmented_data.X, augmented_data.a, augmented_data.y)
    Y = matcher.estimate_population_outcome(
        augmented_data.X, augmented_data.a,)
    p = matcher.samples_used_.sum() / len(augmented_data.y)
    return p, (Y[1] - Y[0])


p_with, ATE_with = zip(
    *[check_caliper(c, with_replacement=True) for c in caliper])
p_without, ATE_without = zip(
    *[check_caliper(c, with_replacement=False) for c in caliper])


# In[17]:


with sb.axes_style("dark") as s:
    f, ax = plt.subplots(1, 2, figsize=(16, 6))

    ax[0].semilogx(caliper, p_with, "blue", label="with replacement")
    ax[0].semilogx(caliper, p_without, "purple", label="no replacement")
    ax[0].set_ylabel("fraction of data matched ", color="blue")
    ax[0].legend()
    ax[0].set_xlabel("caliper")

    ax[1].semilogx(caliper, ATE_with, "green",
                   label="matching (with replacement)")
    ax[1].semilogx(caliper, ATE_without, "orange",
                   label="matching (no replacement)")
    ax[1].set_ylabel("ATE", color="green")
    ax[1].hlines(xmin=caliper.min(), xmax=caliper.max(),
                 y=ATE_ipw, ls="--", color="green", label="ipw")
    ax[1].hlines(xmin=caliper.min(), xmax=caliper.max(),
                 y=ATE_naive, ls=":", color="green", label="naive")
    ax[1].legend(loc=4)
    ax[1].set_xlabel("caliper");


# # Intermediate results
# The `Matching` object implements `IndividualOutcomeEstimator`, specifically the `fit` and `estimate_individual_outcome` methods. Because matching has many uses, `Matching` also outputs a DataFrame of matches and a vector of weights. These can be used for filtering data before applying regression, for example, or for assessing the quality of the covariate balancing (as shown above).

# In[18]:


matcher = Matching(with_replacement=False, propensity_transform=propensity_transform)
matcher.fit(X, a, y)
match_df = matcher.match(X, a)
match_df


# The matching DataFrame can be understood in the following way: when sample `sample_id` searched for a match with treatment value `match_to_treatment` it found the samples indexed in `matches` which were located at distances according to the `distances` entry. If `matches` is empty it means `sample_id` had no match in treatment class `match_to_treatment`. This allows us to handle the case with multiple matches, as well as with uneven numbers of matches due to the caliper constraint:

# In[19]:


matcher = Matching(with_replacement=True, 
                   propensity_transform=propensity_transform, 
                   n_neighbors=3,
                   caliper=0.001).fit(X, a, y)
match_df_3 = matcher.match(X, a)
match_df_3


# We can also output weights which can be useful for comparing with other methods or preparing the data for further processing:
# 

# In[20]:


weights_df = matcher.matches_to_weights(match_df)
weights_df


# We distinguish between matching from 0 to 1 and from 1 to 0 because in general they are distinct processes which can have different weights. For the case with no replacement the two columns are identical and always only 1 or 0. We see this when we examine the weights obtained with replacement, caliper and 3 neighbors.

# In[21]:


weights_df_3 = matcher.matches_to_weights(match_df_3)
weights_df_3


# The columns no longer match because a given treatment sample may be selected multiple times when matched from the set of control samples. Each selection increases its weight. However, if it is one of $n$ chosen samples it is only increased by $1/n$. 

# ## Compare Covariate Balancing with IPW
# Using the weights we can compare how well the matching algorithm balances the covariate distributions, compared to IPW. Even though IPW is a population effect estimator and matching is an individual outcome estimator, their results can both be expressed as weights and compared using `calculate_covariate_balance`.

# In[22]:


ipw = IPW(LogisticRegression(solver="liblinear"))
ipw.fit(X, a)
ipw_binarized_weights = ipw.compute_weights(X, a)


# In[23]:


matcher = Matching(with_replacement=True, 
                   propensity_transform=propensity_transform, 
                   caliper=0.01).fit(X, a, y)
match_df = matcher.match(X, a)
weights_df = matcher.matches_to_weights(match_df)
covbal = {}
covbal["match_control_to_treatment"] = calculate_covariate_balance(
    X, a, w=weights_df.control_to_treatment)
covbal["match_treatment_to_control"] = calculate_covariate_balance(
    X, a, w=weights_df.treatment_to_control)
covbal["ipw_binarized"] = calculate_covariate_balance(
    X, a, w=ipw_binarized_weights)
covbal["match_both"] = calculate_covariate_balance(
    X, a, w=weights_df.sum(axis=1))


for k in covbal:
    covbal[k] = covbal[k].drop(
        columns="unweighted") if not "both" in k else covbal[k]

covbal_df = pd.concat(covbal, axis=1)
covbal_df.columns = list(covbal.keys()) + ["unweighted"]
covbal_df


# In[24]:


import matplotlib.pyplot as plt
import seaborn as sb
sb.set("notebook")
f, axes = plt.subplots(figsize=(8, 6))
sb.violinplot(data=covbal_df, ax=axes, cut=0)
axes.set_xticklabels(axes.get_xticklabels(), rotation=90);


# In[25]:


sb.heatmap(data=covbal_df);


# Comparing the results of the IPW weights and the matching weights, we see that while both lead to substantially better weighted covariates, the IPW does a better job in general. In the Lalonde matching notebook, we will show how matching and IPW an be used together to obtain even better covariate distribution balancing.