"In this article we discuss differences between experimental and observational data and pitfalls in using the latter for data-driven decision-making."

- toc: true
- branch: master
- badges: true
- comments: true
- categories: [python, dowhy, causal inference]

Wow, growth hacking *and* upselling all in the same article? Also Python.

Okay, let's start at the beginning. Imagine the following scenario: You're responsible for increasing the amount of money users spend on your e-commerce platform.

You and your team come up with different measures you could implement to achieve your goal. Two of these measures could be:

- Provide a discount on your best-selling items,
- Implement a rewards program that incentivices repeat purchases.

Both of these measures are fairly complex with each incurring a certain, probably known, amount of cost and an unknown effect on your customers' spending behaviour.

To decide which of these two possible measures is worth both the effort and incurred cost you need to estimate their effect on customer spend.

A natural way of estimating this effect is computing the following:

$\textrm{avg}(\textrm{spend} | \textrm{treatment} = 1) - \textrm{avg}(\textrm{spend} | \textrm{treatment} = 0) = \textrm{ATE}$.

Essentially you would compute the average spend of users who received the treatment (received a discount or signed up for rewards) and subtract from that the average spend of users who didn't receive the treatment.

Without discussing the details of the underlying potential outcomes framework, the above expression is called the average treatment effect (ATE).

So now we'll just analyze our e-commerce data of treated and untreated customers and compute the average treatment effect (ATE) for each proposed measure, right? Right?

Before you rush ahead with your ATE computations - now is a good time to take a step back and contemplate how your data was generated in the first place(data-generating process).

Before we continue: My example here is based on a tutorial by the authors of the excellent DoWhy library. You can find the original tutorial here:

And more on DoWhy here: https://microsoft.github.io/dowhy/

In [1]:

```
!pip install dowhy --quiet
```

In [2]:

```
import random
import pandas as pd
import numpy as np
np.random.seed(42)
random.seed(42)
```

So where were we ... ah right! Where does our e-commerece data come from?

Since we don't actually run an e-commerce operation here we will have to simulate our data (remember: these ideas are based on the above DoWhy tutorial).

Imagine we observe the monthly spend of each of our 10,000 users over the course of a year. Each user will spend with a certain distribution (here, a Poisson distribution) and there are both high and low spenders with different mean spends.

Over the course of the year, each user can sign up to our rewards program in any month and once they have signed up their spend goes up by 50% relative to what they would've spent without signing up.

So far so mundane: Different customers show different spending behaviour and signing up to our rewards program increases their spend.

Now the big question is: How are treatment assignment (rewards program signup) and outcome (spending behaviour) related?

If treatment and outcome, interpreted as random variables, are independent of one another then according to the potential outcome framework we can compute the ATE as easily as shown above:

$\textrm{ATE} = \textrm{avg}(\textrm{spend} | \textrm{treatment} = 1) - \textrm{avg}(\textrm{spend} | \textrm{treatment} = 0)$

When are treatment and outcome independent? The gold standard for achieving their independence in a data set is the randomized controlled trial (RCT).

In our scenario what an RCT would look like is randomly signing up our users to our rewards program - indepndent of their spending behaviour or any other characteristic.

So we would go through our list of 10,000 users and flip a coin for each of them, sign them up to our program in a random month of the year based on our coin, and send them on their merry way to continue buying stuff in our online shop.

Let's put all of this into a bit of code that simulates the spending behaviour of our users according to our thought experiment:

In [3]:

```
# Creating some simulated data for our example
num_users = 10000
num_months = 12
df = pd.DataFrame({
'user_id': np.repeat(np.arange(num_users), num_months),
'month': np.tile(np.arange(1, num_months+1), num_users), # months are from 1 to 12
'high_spender': np.repeat(np.random.randint(0, 2, size=num_users), num_months),
})
df['spend'] = None
df.loc[df['high_spender'] == 0, 'spend'] = np.random.poisson(250, df.loc[df['high_spender'] == 0].shape[0])
df.loc[df['high_spender'] == 1, 'spend'] = np.random.poisson(750, df.loc[df['high_spender'] == 1].shape[0])
df["spend"] = df["spend"] - df["month"] * 10
signup_months = np.random.choice(
np.arange(1, num_months),
num_users
) * np.random.randint(0, 2, size=num_users) # signup_months == 0 means customer did not sign up
df['signup_month'] = np.repeat(signup_months, num_months)
# A customer is in the treatment group if and only if they signed up
df["treatment"] = df["signup_month"] > 0
# Simulating a simple treatment effect of 50%
after_signup = (df["signup_month"] < df["month"]) & (df["treatment"])
df.loc[after_signup, "spend"] = df[after_signup]["spend"] * 1.5
```

Let's look at user `0`

and their treatment assignment as well as spend (since we're sampling random variables here you'll see something different from me):

In [4]:

```
df.loc[df['user_id'] == 0]
```

Out[4]:

The effect we're interested in is the impact of rewards signup on spending behaviour - i.e. the effect on post-signup spend.

Since customers can sign up any month of the year, we'll choose one month at random and compute the effect with respect to that one month.

So let's create a new table from our time series where we collect post-signup spend for those customers that signed up in `month = 6`

alongside the spend of customers who never signed up.

In [5]:

```
month = 6
post_signup_spend = (
df[df.signup_month.isin([0, month])]
.groupby(["user_id", "signup_month", "treatment"])
.apply(
lambda x: pd.Series(
{
"post_spend": x.loc[x.month > month, "spend"].mean(),
}
)
)
.reset_index()
)
print(post_signup_spend)
```

To get the average treatment effect (ATE) of our rewards signup treatment we now compute the average post-signup spend of the customers who signed up and subtract from that the average spend of users who didn't sign up:

In [6]:

```
post_spend = post_signup_spend\
.groupby('treatment')\
.agg({'post_spend': 'mean'})
post_spend
```

Out[6]:

So the ATE of rewards signup on post-signup spend is:

In [7]:

```
post_spend.loc[True, 'post_spend'] - post_spend.loc[False, 'post_spend']
```

Out[7]:

Since we simulated the treatment effect ourselves (50% post-signup spend increase) let's see if we can recover this effect from our data:

In [8]:

```
post_spend.loc[True, 'post_spend'] / post_spend.loc[False, 'post_spend']
```

Out[8]:

The post-signup spend for treated customers is roughly 50% greater than the spend for untreated customers - exactly the treatment effect we simulated!

Remember, however, that we are dealing with clean experimental data from a randomized controlled trial (RCT) here! The potential outcome framework tells us that for data from an RCT the simple ATE formula we used here yields the correct treatment effect due to independence of treatment assignment and outcome.

So the fact that we recovered the actual (simulated) treatment effect is nice to see but not surprising.

Our above thought experiment where we randomly assigned our customers to our rewards program isn't very realistic.

Randomly signing up paying customers to rewards programs without their consent may upset some and may not even be permissible. The same issue with randomized treatment assignment pops up everywhere - clean randomized controlled trials are oftentimes too expensive, infeasible to implement, unethical, or not permitted.

But since we still need to experiment with our shop to drive spending behaviour we'll still go ahead and implement our rewards program. Only that this time we'll place a regular signup page in our shop where our customers can decide for themselves if they want to sign up or not.

Activating our signup page and simply observing how users and their spend behaves gives us **observational data**.

We usually call "observational data" just "data" without giving much thought to where they came from. I mean we've all dealt with lots of different kinds of data (marketing data, R&D measurements, HR data, etc.) and all these data were simply "observed" and didn't come out of a carefully set up experiment.

Simulating our observational data we've got the same 10,000 customers over a span of a year. We still have the same high and low spenders.

Only that now our high spenders are far more likely to sign up to our rewards program than our low spenders. My reasoning for this is that customers who spend more are also more likely to show greater brand loyalty towards us and our rewards program. Further, they visit our shop more frequently hence are more likely to notice our new rewards program and the signup page. We could also add this behaviour as random variables to our simulation below but just take a shortcut and give low spenders a 5% chance of signing up and high spenders a 95% chance.

In [9]:

```
num_users = 10000
num_months = 12
df = pd.DataFrame({
'user_id': np.repeat(np.arange(num_users), num_months),
'month': np.tile(np.arange(1, num_months+1), num_users), # months are from 1 to 12
'high_spender': np.repeat(np.random.randint(0, 2, size=num_users), num_months),
})
df['spend'] = None
df.loc[df['high_spender'] == 0, 'spend'] = np.random.poisson(250, df.loc[df['high_spender'] == 0].shape[0])
df.loc[df['high_spender'] == 1, 'spend'] = np.random.poisson(750, df.loc[df['high_spender'] == 1].shape[0])
signup_months = df[['user_id', 'high_spender']].drop_duplicates().copy()
signup_months['signup_month'] = None
signup_months.loc[signup_months['high_spender'] == 0, 'signup_month'] = np.random.choice(
np.arange(1, num_months),
(signup_months['high_spender'] == 0).sum()
) * np.random.binomial(1, .05, size=(signup_months['high_spender'] == 0).sum())
signup_months.loc[signup_months['high_spender'] == 1, 'signup_month'] = np.random.choice(
np.arange(1, num_months),
(signup_months['high_spender'] == 1).sum()
) * np.random.binomial(1, .95, size=(signup_months['high_spender'] == 1).sum())
df = df.merge(signup_months)
df["treatment"] = df["signup_month"] > 0
after_signup = (df["signup_month"] < df["month"]) & (df["treatment"])
df.loc[after_signup, "spend"] = df[after_signup]["spend"] * 1.5
df
```

Out[9]:

Now imagine you weren't aware of causality, confounders, high / low spenders, and all that. You simply published your rewards signup page and observed your customers' spending behaviour over a span of a year. Chances are you'll compute the average treatment effect the exact same way we did above for our randomized controlled trial:

In [10]:

```
month = 6
post_signup_spend = (
df[df.signup_month.isin([0, month])]
.groupby(["user_id", "signup_month", "treatment"])
.apply(
lambda x: pd.Series(
{
"post_spend": x.loc[x.month > month, "spend"].mean(),
"pre_spend": x.loc[x.month < month, "spend"].mean(),
}
)
)
.reset_index()
)
print(post_signup_spend)
```

In [11]:

```
post_spend = post_signup_spend\
.groupby('treatment')\
.agg({'post_spend': 'mean'})
post_spend
```

Out[11]:

In [12]:

```
post_spend.loc[True, 'post_spend'] - post_spend.loc[False, 'post_spend']
```

Out[12]:

In [13]:

```
post_spend.loc[True, 'post_spend'] / post_spend.loc[False, 'post_spend']
```

Out[13]:

Performing the exact same computation as above, now we're estimating an average treatment effect of almost 400% instead of the actual 50%!

So what went wrong here?

Observational data got us!

Realize that in our observational data the outcome (spend) is not indepndent of treatment assignmnet (rewards program signup): High spenders are far more likely to sign up hence are overrepresented in our treatment group while low spenders are overrepresented in our control group (users that didn't sign up).

So when we compute the above difference or ratio we don't just see the average treatment effect of rewards signup we also see the inherent difference in spending between high and low spenders.

So if we ignore how our observational data are generated we'll overestimate the effect our rewards program has and likely make decisions that seem to be supported by data but in reality aren't.

Also notice that we often make this same mistake when training machine learning algorithms on observational data. Chances are someone will ask you to train a regression model to predict the effectiveness of the rewards program and your model will end up with the same inflated estimate as above.

So how do we fix this? And how can we estimate the true treatment effect from our observational data?

Generally, we know from experience in e-commerece that people who tend to spend more are more likely to sign up to our rewards program. So we could segment our users into spend buckets and compute the treatment effect within each bucket to try and breeak this confounding link in our observational data.

Notice that in practice we won't have a `high spender`

flag for our customers so we'll have to go by our customers' observed spending behaviour.

The causal inference framework offers an established approach here: Relying on our domain knowledge, we define a causal model that describes how we believe our observational data were generated.

Let's draw this as a graph with nodes and edges:

In [14]:

```
import os, sys
sys.path.append(os.path.abspath("../../../"))
import dowhy
causal_graph = """digraph {
treatment[label="Program Signup in month i"];
pre_spend;
post_spend;
U[label="Unobserved Confounders"];
pre_spend -> treatment;
pre_spend -> post_spend;
treatment->post_spend;
U->treatment; U->pre_spend; U->post_spend;
}"""
model = dowhy.CausalModel(
data=post_signup_spend,
graph=causal_graph.replace("\n", " "),
treatment="treatment",
outcome="post_spend"
)
model.view_model()
```

Our causal model states what we described above: Pre-signup spend influences both rewards signup (treatment assignment) and post-signup spend. This is the story about our high and low spenders.

Treatment (rewards signup) influences post-signup spending behaviour - this is the effect we're actually interested in.

We also added a node `U`

to signify possible other confounding factors that may exist in reality but weren't observed as part of our data.

We will now apply do-calculus to our causal model from above to figure out ways to cleanly estimate the treatment effect we're after:

In [15]:

```
identified_estimand = model.identify_effect(proceed_when_unidentifiable=True)
print(identified_estimand)
```

Very broadly and sloppily stated there a three ways to segment (or slice and dice) our observational data to get to subsets of our data within which we can cleanly compute the average treatment effect:

- Backdoor adjustment,
- Frontdoor adjustment, and
- Instrumental variables.

The above printout tells us that based on both the causal model we constructed and our observational data there is a backdoor-adjusted estimator for our desired treatment effect.

This backdoor adjustment actually follows closely what we already said above: We'll compute the post-spend given pre-spend (segment our customers based on their spending behaviour).

There are various ways to perform the backdoor adjustment that DoWhy identified for us above. One of them is called propensity score matching:

In [16]:

```
estimate = model.estimate_effect(
identified_estimand,
method_name="backdoor1.propensity_score_matching",
target_units="ate"
)
print(estimate)
```

DoWhy provides us with an estimated ATE for our observational data that is pretty close to the ATE we computed for our experimental data from our randomized controlled trial.

Even if the ATE estimate DoWhy provides doesn't match exactly our experimental ATE we're now in a much better position to take a decision regarding our rewards program based on our observational data.

So next time we want to base decisions on observational data it'll be worthwhile defining a causal model of the underlying data-generating process and using a library such as DoWhy that helps us identify and apply adjustment strategies.