Get a general overview of the challenges of giving causal interpretation to regression results of observational data
Get an overview of several techniques that can be used to try give causal interpretation to regression models of observational data
The field of causal inference is large and growing. This lab aims to give a small taste of the field and the problems that it is trying to address.
At the root of all estimation of causal inference is the issue sometimes known as the fundamental problem of causal inference. This is the idea that to truly be able to estimate a causal effect, you need to have information on an inherently unobservable counterfactual. So if we want to estimate the effect of education on future earnings (a popular topic in labor economics), we would ideally want to compare a situation where the same individual both receives an education and does not receive an education. This, of course, is impossible.
But when we try to instead substitute a comparison of people who have an education with those that don't, all sorts of problems pop up. People who take higher education are likely to be different, in both observable and unobservable ways compared to people who don't take an education. Some economists have even gone as far as to say that a university education is really only an elaborate signalling mechanism. Going to a prestigious university signals to future employers that you are a smart, hard-working person. This, it is argued, is what really drives the income difference between university graduates and those without a degree in the labor market.
The gold standard of causality modelling is randomized controlled trials, where subjects are randomly assigned to either a treatment or control group. Such randomized control experiments have become more common in economics and business (see for example the noble-prize winning work of Esther Duflo and Abhijit Banerjee where they use randomized field experiments to gain insights into developmental economics. The authors have written a popular scientific book on their work: Poor Economics).
Ch. 18 and 19 in ROS discuss techniques for analysing data that comes from randomized control trials. If you are interested, you can skim through these chapters. But for the most part we will focus on techniques for trying to establish causal relationships in observational data. Chapters 20 and 21 in ROS cover several such techniques. We will not try to cover all these in this lab. Rather we will go through one longer example that demonstrates some principle ideas. You are encouraged to skim through chapter 20 and 21 and make use of one or several of the techniques and ideas in your course project.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import scipy.stats as spt
import statsmodels.api as sm
import statsmodels.formula.api as smf
import seaborn as sns
plt.rcParams['xtick.labelsize'] = 14
plt.rcParams['ytick.labelsize'] = 14
plt.rcParams['axes.facecolor'] ="white"
plt.rcParams['grid.color'] ="grey"
plt.rcParams['grid.linestyle'] = "dotted"
plt.rcParams["axes.labelsize"]= 14
#plt.rcParams['figure.savefig.dpi'] = 100
plt.rcParams['savefig.edgecolor'] = "#f2f2f2"
plt.rcParams['savefig.facecolor'] ="white"
plt.rcParams["figure.figsize"] = [15,8]
plt.rcParams['savefig.bbox'] = "tight"
plt.rcParams['font.size'] = 14
greens = ['#66c2a4','#41ae76','#238b45','#006d2c','#00441b']
multi =['#66c2a4','#1f78b4','#a6cee3','#b2df8a','#33a02c','#fb9a99','#e31a1c','#fdbf6f']
Auditors have important informational and quality control roles in modern economies. They ensure that the information companies release through their annual financial accounts is reliable and accurate. This in turn gives confidence to third-parties who have a stake in a given company: Banks and other lenders, suppliers, investors and even consumers.
But auditing also comes with substantial costs - especially for small firms. Up until 2011, all registered Norwegian firms - no matter size - needed to obtain an audit of their yearly financial accounts. But from 2011 a new policy was put in place where firms with operating income of under 5 million kroners and total assets of under 20 million kroners did not need to obtain an audit.
In this lab we want to try to answer the question of what the effect of auditing is on the dividend pay-out of small private firms. The dividend is the main way that owners of firms extract capital from their firms. We might hypothesize that without the disciplining effect of an auditor, firms might be tempted to extract more from their firms.
We start by loading in our data set:
aq_long = pd.read_csv("https://jmaurit.github.io/anv_statistikk/data/aq_long.csv")
This data has already been partially filtered and cleaned. Very small firms and specialty firms that are under special auditing rules-like law firms-have been removed.
We do some extra cleaning of our data - revising our year variable to just show the year
aq_long.year
0 2005-01-01 1 2006-01-01 2 2007-01-01 3 2008-01-01 4 2009-01-01 ... 31521 2011-01-01 31522 2012-01-01 31523 2013-01-01 31524 2014-01-01 31525 2015-01-01 Name: year, Length: 31526, dtype: object
aq_long.loc[:,"year"] = pd.to_datetime(aq_long.year)
aq_long["year"] = aq_long.year.dt.year
We already have an indicator variable for whether a firm received an auditor or not: no_auditor, which is equal to 1 if a firm did not obtain an audit. Thus the treatment here is the absence of an audit.
We also have a variable called vol_auditor, which indicates that a firm (after 2011) was below the thresholds for being able to forgo an auditor (an audit becomes voluntary).
Below we will both create a label-variable for whether a firm got audited, and we will create an indicator variable choseAudit which indicates that a firm did not need to obtain an audit, but did anyways:
aq_long["no_auditor_lab"] = "Auditor"
aq_long.loc[aq_long.no_auditor==1, "no_auditor_lab"]="No auditor"
aq_long["choseAudit"] = aq_long.vol_auditor - aq_long.no_auditor
aq_long.head()
index | cid | nkode_15 | year | driftsinntekter | AM | EK | Varelager | OM | KF | ... | antall_ansatte | leverage | endring_Eiendeler | risk_roa | cash_flow | cash_flow_mu | cash_flow_sd | Arb.sats | no_auditor_lab | choseAudit | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 14157 | 810130822 | 46.739 | 2005 | 5859000 | 541000 | 334000 | 1440000 | 1946000 | 466000 | ... | 5 | 1.307197 | NaN | 0.039061 | 72000 | 241818.181818 | 144472.708967 | 0.106 | Auditor | 0 |
1 | 14158 | 810130822 | 46.739 | 2006 | 6254000 | 527000 | 469000 | 1355000 | 1924000 | 499000 | ... | 5 | 1.337005 | -36000.0 | 0.039061 | 175000 | 241818.181818 | 144472.708967 | 0.106 | Auditor | 0 |
2 | 14159 | 810130822 | 46.739 | 2007 | 6876000 | 2830000 | 697000 | 1362000 | 1785000 | 328000 | ... | 5 | 1.257638 | 2164000.0 | 0.039061 | 360000 | 241818.181818 | 144472.708967 | 0.106 | Auditor | 0 |
3 | 14160 | 810130822 | 46.739 | 2008 | 6461000 | 2933000 | 705000 | 1158000 | 1320000 | 117000 | ... | 5 | 0.691747 | -362000.0 | 0.039061 | 168000 | 241818.181818 | 144472.708967 | 0.106 | Auditor | 0 |
4 | 14161 | 810130822 | 46.739 | 2009 | 6608000 | 2911000 | 617000 | 953000 | 1077000 | 84000 | ... | 5 | 0.668004 | -265000.0 | 0.039061 | 108000 | 241818.181818 | 144472.708967 | 0.106 | Auditor | 0 |
5 rows × 45 columns
Now we will create a graphic showing the incidence of firms that have forgone an audit after 2010
year_lab = ["2010", "2011", "2012", "2013", "2014", "2015"]
aq_long["Eiendeler_lagM"] = aq_long.Eiendeler_lag/1000000
aq_long["driftsinntekter_lagM"] = aq_long.driftsinntekter_lag/1000000
aud_scatter = sns.FacetGrid(aq_long[(aq_long.year > 2009)],col="year", hue="no_auditor_lab" , col_wrap=2, legend_out=True)
aud_scatter = (aud_scatter.map(plt.scatter, "Eiendeler_lagM", "driftsinntekter_lagM", alpha=.05)
.set_axis_labels("", ""))
#aud_scatter.set_titles = "year_lab"
aud_scatter.fig.set_size_inches([15,9])
aud_scatter.add_legend()
aud_scatter._legend.set_title("")
for ax in aud_scatter.axes:
ax.axvline(20, color="red")
ax.axhline(5, color="red")
for i, ax in enumerate(aud_scatter.axes):
ax.set_title(year_lab[i])
aud_scatter.fig.text(0.5, 0, 'Total assets (t-1), million NOK', ha='center')
aud_scatter.fig.text(0.08, 0.5, 'Operating income (t-1), million NOK', va='center', rotation='vertical')
Text(0.08, 0.5, 'Operating income (t-1), million NOK')
The red lines represent the threshold variables
One thing we can see from the data is that operating income seems to be the constraining threshold for being eligible to forgo an audit. That is, there are very few firms that are under the operating income threshold but over the total assets threshold. We will therefor focus on this threshold in our analysis.
We'll follow the general advice from ROS, and transform our variables in our analysis to be standardized - we subtract the mean and divide by the standard deviation.
Here we'll create a function to do the standardardization
#Rescaling
def rescale_series(ser):
ser_rs = (ser - ser.mean())/ser.std()
return(ser_rs)
We indicate which variables we wish to transform:
#Transform all continuous variables
transformed_vars = ['AM', 'EK',
'Varelager', 'OM', 'KF', 'LG', 'KG', 'Varekost', 'Netincome',
'Avskrivninger', 'ordres', 'rentekostnad', 'Driftsresultat',
'Varigedm', 'age', 'cash', 'Finam', 'konsernbidrag', 'utbytte',
'rentekost_foretak', 'annen_finanskost','Eiendeler', 'ROA', 'driftsinntekter_lag',
'Eiendeler_lag', 'antall_ansatte',
'leverage', 'endring_Eiendeler', 'risk_roa', 'cash_flow', "cash_flow_mu", "cash_flow_sd"]
We'll limit our data from 2011 to 2015, after the implimentation of the policy
#limit to 2011 to 2015
aq_rs = aq_long.loc[aq_long.year>2010,:].copy()
We will save some values for standard deviations and means for use later
#standard deviations and means to save
operIncome_std = aq_rs.driftsinntekter.std() #operating income std
dividend_std = aq_rs.utbytte.std() #operating
dividend_mu = aq_rs.utbytte.mean()
We will rescale our continuous variables
#rescale variables
aq_rs.loc[:, transformed_vars]=aq_rs.loc[:,transformed_vars].apply(rescale_series)
We will make one exception to our rescaling. For our operating income variable (driftsinntekter), we will subtract 5 million - the threshold value for being able to forgo an auditor. Thus a company with a rescaled value of 0 will be right on the threshold value
#rescale driftsinntekter:
aq_rs.loc[:,"driftsinntekter"] = (aq_rs.driftsinntekter - 5000000)/operIncome_std
We can look at a histogram of the distribution of the rescaled operating income variable
aq_rs.driftsinntekter.hist(bins=50)
<AxesSubplot:>
Now let's run a simple, naive regression model, where we will regress the dividend on the no_auditor indicator variables and some year indicator variables - sometimes called year fixed effects.
rd_mod1 = smf.ols("utbytte~no_auditor + C(year)",
data=aq_rs).fit()
rd_mod1.summary()
Dep. Variable: | utbytte | R-squared: | 0.009 |
---|---|---|---|
Model: | OLS | Adj. R-squared: | 0.009 |
Method: | Least Squares | F-statistic: | 25.68 |
Date: | Tue, 01 Mar 2022 | Prob (F-statistic): | 6.76e-26 |
Time: | 08:32:25 | Log-Likelihood: | -20269. |
No. Observations: | 14330 | AIC: | 4.055e+04 |
Df Residuals: | 14324 | BIC: | 4.060e+04 |
Df Model: | 5 | ||
Covariance Type: | nonrobust |
coef | std err | t | P>|t| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
Intercept | -0.0199 | 0.019 | -1.071 | 0.284 | -0.056 | 0.017 |
C(year)[T.2012] | 0.0587 | 0.026 | 2.218 | 0.027 | 0.007 | 0.110 |
C(year)[T.2013] | 0.0742 | 0.027 | 2.799 | 0.005 | 0.022 | 0.126 |
C(year)[T.2014] | 0.0740 | 0.027 | 2.787 | 0.005 | 0.022 | 0.126 |
C(year)[T.2015] | 0.0814 | 0.027 | 3.063 | 0.002 | 0.029 | 0.134 |
no_auditor | -0.2745 | 0.025 | -11.204 | 0.000 | -0.323 | -0.226 |
Omnibus: | 17008.853 | Durbin-Watson: | 1.061 |
---|---|---|---|
Prob(Omnibus): | 0.000 | Jarque-Bera (JB): | 3192772.769 |
Skew: | 6.160 | Prob(JB): | 0.00 |
Kurtosis: | 75.080 | Cond. No. | 5.96 |
The coefficient on the no_auditor variable is estimated to be negative. We could interpret it to mean that not having an auditor will reduce the dividend by .27 standard deviations, which is estimated rather precisely with a confidence interval roughly between -.22 and -.32.
But it would be a mistake to interpret this causally. First and foremost, we likely have a lurking or omitted variable bias in the estimation (ROS p. 385). That is, the coefficient on no_auditor is picking up the effect of another variable.
In this case, we might expect that this is the operating income variable, which both is the forcing variable that decides eligibility for forgoing an audit, and which is correlated with the dividend - a higher operating income can be expected to be correlated with higher dividend.
So now, let's include a second regression where we control for operating income, as well as other confounding variables (ROS p. 386) that may be correlated with both the indicator variable no_auditor and the left-hand side variable, dividend ("utbytte"). This includes number of employees ("antall_ansatte"), and total assets ("Eiendeler").
We might also be concerned that the choice to forgo auditing may be correlated with a firm's riskiness, and that might be skewing the results. So we include measures of firm risk (Return on Assets) as well as an indicator of firm leverage:
regForm = "utbytte~no_auditor+C(year)+driftsinntekter+driftsinntekter**2+antall_ansatte+Eiendeler+risk_roa+leverage"
rs_mod2 = smf.ols(regForm, data=aq_rs).fit()
rs_mod2.summary()
Dep. Variable: | utbytte | R-squared: | 0.079 |
---|---|---|---|
Model: | OLS | Adj. R-squared: | 0.078 |
Method: | Least Squares | F-statistic: | 122.8 |
Date: | Tue, 01 Mar 2022 | Prob (F-statistic): | 6.83e-247 |
Time: | 09:00:52 | Log-Likelihood: | -19743. |
No. Observations: | 14330 | AIC: | 3.951e+04 |
Df Residuals: | 14319 | BIC: | 3.959e+04 |
Df Model: | 10 | ||
Covariance Type: | nonrobust |
coef | std err | t | P>|t| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
Intercept | -0.1317 | 0.019 | -7.040 | 0.000 | -0.168 | -0.095 |
C(year)[T.2012] | 0.0135 | 0.026 | 0.529 | 0.597 | -0.037 | 0.064 |
C(year)[T.2013] | 0.0191 | 0.026 | 0.745 | 0.456 | -0.031 | 0.069 |
C(year)[T.2014] | 0.0022 | 0.026 | 0.085 | 0.933 | -0.048 | 0.053 |
C(year)[T.2015] | -0.0028 | 0.026 | -0.109 | 0.913 | -0.053 | 0.048 |
no_auditor | 0.0722 | 0.027 | 2.720 | 0.007 | 0.020 | 0.124 |
driftsinntekter | 0.2260 | 0.011 | 21.469 | 0.000 | 0.205 | 0.247 |
antall_ansatte | -0.0882 | 0.009 | -10.075 | 0.000 | -0.105 | -0.071 |
Eiendeler | 0.1287 | 0.010 | 13.305 | 0.000 | 0.110 | 0.148 |
risk_roa | -0.0232 | 0.009 | -2.670 | 0.008 | -0.040 | -0.006 |
leverage | 0.0792 | 0.009 | 8.739 | 0.000 | 0.061 | 0.097 |
Omnibus: | 16751.549 | Durbin-Watson: | 1.094 |
---|---|---|---|
Prob(Omnibus): | 0.000 | Jarque-Bera (JB): | 3323873.499 |
Skew: | 5.959 | Prob(JB): | 0.00 |
Kurtosis: | 76.653 | Cond. No. | 7.97 |
Now that we have included measures of firm size and riskiness, the coefficient on no_auditor has flipped to be positive, with a point estimate of .0722, with a rough 95% confidence interval of between .02 and .13.
One variable we have not included, but which we might be expected to both be correlated with the no_auditor indicator and dividend is cash flow. But, following ROS (p. 374-379) we should be careful about included post treatment and mediating variables.
In the context of our estimation problem, a firm's cash flow for a certain year is jointly determined with the dividend. That is, if we imagine that auditing can have an effect on dividend, it can imagined to also have an effect on cash flow, and that the decision on the dividend can also affect the cash flow information. Controlling for cash flow will then in effect control for some of the variation we are interested in estimating.
Instead, in our data, we have created more aggregated measures of firm cash flow based on all the years of data - we have created mean cash flow: cash_flow_mu across years, and a standard deviation of cash flow, cash_flow_sd across years, which gives an indication of the volatility of a firm's cash flow.
regForm = "utbytte~no_auditor+C(year)+driftsinntekter+driftsinntekter**2+ antall_ansatte+Eiendeler+risk_roa+leverage + cash_flow_mu + cash_flow_sd"
rs_mod3 = smf.ols(regForm, data=aq_rs).fit()
rs_mod3.summary()
Dep. Variable: | utbytte | R-squared: | 0.312 |
---|---|---|---|
Model: | OLS | Adj. R-squared: | 0.311 |
Method: | Least Squares | F-statistic: | 541.1 |
Date: | Tue, 01 Mar 2022 | Prob (F-statistic): | 0.00 |
Time: | 09:14:39 | Log-Likelihood: | -17653. |
No. Observations: | 14330 | AIC: | 3.533e+04 |
Df Residuals: | 14317 | BIC: | 3.543e+04 |
Df Model: | 12 | ||
Covariance Type: | nonrobust |
coef | std err | t | P>|t| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
Intercept | -0.0578 | 0.016 | -3.569 | 0.000 | -0.090 | -0.026 |
C(year)[T.2012] | 0.0236 | 0.022 | 1.068 | 0.286 | -0.020 | 0.067 |
C(year)[T.2013] | 0.0311 | 0.022 | 1.406 | 0.160 | -0.012 | 0.075 |
C(year)[T.2014] | 0.0248 | 0.022 | 1.117 | 0.264 | -0.019 | 0.068 |
C(year)[T.2015] | 0.0270 | 0.022 | 1.212 | 0.225 | -0.017 | 0.071 |
no_auditor | 0.0517 | 0.023 | 2.252 | 0.024 | 0.007 | 0.097 |
driftsinntekter | 0.0576 | 0.010 | 6.055 | 0.000 | 0.039 | 0.076 |
antall_ansatte | 0.0172 | 0.008 | 2.228 | 0.026 | 0.002 | 0.032 |
Eiendeler | 0.0017 | 0.009 | 0.183 | 0.855 | -0.016 | 0.019 |
risk_roa | -0.0461 | 0.009 | -5.278 | 0.000 | -0.063 | -0.029 |
leverage | 0.1023 | 0.008 | 12.966 | 0.000 | 0.087 | 0.118 |
cash_flow_mu | 0.5182 | 0.008 | 67.821 | 0.000 | 0.503 | 0.533 |
cash_flow_sd | 0.0606 | 0.009 | 6.727 | 0.000 | 0.043 | 0.078 |
Omnibus: | 16501.229 | Durbin-Watson: | 1.357 |
---|---|---|---|
Prob(Omnibus): | 0.000 | Jarque-Bera (JB): | 5041003.611 |
Skew: | 5.612 | Prob(JB): | 0.00 |
Kurtosis: | 94.196 | Cond. No. | 8.46 |
We should probably still be in doubt about whether we can consider our no_auditor indicator as ignorable (ROS 388), that is that the sorting into audited and non-audited groups is close to random, conditional on the controlling variables in the regression.
One way we could potentially improve our estimation is by excluding the firms that could have forgone an auditor, but still chose to obtain one. Excluding these firms is a form of sub-classification (ROS 397), where now we are comparing groups under the threshold that have forgone an auditor, with those above, that were not eligible.
We could divide the groups as in the figure below. Ideally, we would like to compare those firms below the threshold, who would forgo an auditor if possible (and thus do) with firms above the threshold who can not forgo an auditor, but would if they could. Unfortunately, we only observe which firms would forgo an auditor for those below the threshold:
rs_mod4 = smf.ols("utbytte~no_auditor + C(year) + driftsinntekter + driftsinntekter**2 + antall_ansatte + Eiendeler + leverage + risk_roa + cash_flow_mu + cash_flow_sd",
data=aq_rs.loc[aq_rs.choseAudit==0, ]).fit()
rs_mod4.summary()
Dep. Variable: | utbytte | R-squared: | 0.327 |
---|---|---|---|
Model: | OLS | Adj. R-squared: | 0.326 |
Method: | Least Squares | F-statistic: | 449.5 |
Date: | Tue, 01 Mar 2022 | Prob (F-statistic): | 0.00 |
Time: | 09:50:22 | Log-Likelihood: | -14476. |
No. Observations: | 11110 | AIC: | 2.898e+04 |
Df Residuals: | 11097 | BIC: | 2.907e+04 |
Df Model: | 12 | ||
Covariance Type: | nonrobust |
coef | std err | t | P>|t| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
Intercept | -0.1456 | 0.024 | -6.074 | 0.000 | -0.193 | -0.099 |
C(year)[T.2012] | 0.0378 | 0.028 | 1.346 | 0.178 | -0.017 | 0.093 |
C(year)[T.2013] | 0.0544 | 0.028 | 1.950 | 0.051 | -0.000 | 0.109 |
C(year)[T.2014] | 0.0509 | 0.028 | 1.821 | 0.069 | -0.004 | 0.106 |
C(year)[T.2015] | 0.0519 | 0.028 | 1.860 | 0.063 | -0.003 | 0.107 |
no_auditor | 0.1537 | 0.030 | 5.077 | 0.000 | 0.094 | 0.213 |
driftsinntekter | 0.0986 | 0.013 | 7.348 | 0.000 | 0.072 | 0.125 |
antall_ansatte | 0.0225 | 0.009 | 2.579 | 0.010 | 0.005 | 0.040 |
Eiendeler | -0.0145 | 0.011 | -1.330 | 0.183 | -0.036 | 0.007 |
leverage | 0.1403 | 0.011 | 13.257 | 0.000 | 0.120 | 0.161 |
risk_roa | -0.0830 | 0.013 | -6.207 | 0.000 | -0.109 | -0.057 |
cash_flow_mu | 0.5371 | 0.009 | 62.619 | 0.000 | 0.520 | 0.554 |
cash_flow_sd | 0.0792 | 0.011 | 7.236 | 0.000 | 0.058 | 0.101 |
Omnibus: | 12249.409 | Durbin-Watson: | 1.348 |
---|---|---|---|
Prob(Omnibus): | 0.000 | Jarque-Bera (JB): | 3188696.954 |
Skew: | 5.217 | Prob(JB): | 0.00 |
Kurtosis: | 85.337 | Cond. No. | 9.90 |
Once we exclude the firms that voluntarily obtained an audit, the coefficient on no_auditor increases to .15, with a rough confidence interval of .09 to .21
A last step, in terms of trying to improve the comparisons in the model is to add indicator variables for the industry, as represented by the nkode variable,so called industry fixed effects. In effect, we can then interpret the no_auditor variable as a comparison between firms in the same industry who do, and do not obtain an audit.
rs_mod5 = smf.ols("utbytte~ C(nkode_agg) + no_auditor + C(year) + driftsinntekter + driftsinntekter**2 + antall_ansatte + Eiendeler + risk_roa + leverage + cash_flow_mu + cash_flow_sd",
data=aq_rs.loc[aq_rs.choseAudit==0,:]).fit()
rs_mod5.summary()
Dep. Variable: | utbytte | R-squared: | 0.339 |
---|---|---|---|
Model: | OLS | Adj. R-squared: | 0.335 |
Method: | Least Squares | F-statistic: | 83.15 |
Date: | Tue, 01 Mar 2022 | Prob (F-statistic): | 0.00 |
Time: | 10:43:54 | Log-Likelihood: | -14379. |
No. Observations: | 11110 | AIC: | 2.890e+04 |
Df Residuals: | 11041 | BIC: | 2.940e+04 |
Df Model: | 68 | ||
Covariance Type: | nonrobust |
coef | std err | t | P>|t| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
Intercept | -0.3385 | 0.109 | -3.108 | 0.002 | -0.552 | -0.125 |
C(nkode_agg)[T.2] | 0.4998 | 0.247 | 2.020 | 0.043 | 0.015 | 0.985 |
C(nkode_agg)[T.8] | 0.1144 | 0.149 | 0.766 | 0.444 | -0.178 | 0.407 |
C(nkode_agg)[T.10] | 0.2489 | 0.130 | 1.908 | 0.056 | -0.007 | 0.505 |
C(nkode_agg)[T.11] | 0.0392 | 0.456 | 0.086 | 0.931 | -0.854 | 0.933 |
C(nkode_agg)[T.13] | 0.1375 | 0.142 | 0.966 | 0.334 | -0.141 | 0.416 |
C(nkode_agg)[T.14] | 0.3256 | 0.300 | 1.086 | 0.278 | -0.262 | 0.914 |
C(nkode_agg)[T.15] | 0.1142 | 0.410 | 0.278 | 0.781 | -0.690 | 0.919 |
C(nkode_agg)[T.16] | 0.1160 | 0.124 | 0.936 | 0.349 | -0.127 | 0.359 |
C(nkode_agg)[T.17] | 0.3187 | 0.260 | 1.224 | 0.221 | -0.191 | 0.829 |
C(nkode_agg)[T.18] | 0.2752 | 0.121 | 2.279 | 0.023 | 0.038 | 0.512 |
C(nkode_agg)[T.20] | 0.1419 | 0.196 | 0.723 | 0.470 | -0.243 | 0.526 |
C(nkode_agg)[T.22] | 0.2399 | 0.138 | 1.735 | 0.083 | -0.031 | 0.511 |
C(nkode_agg)[T.23] | 0.4545 | 0.129 | 3.515 | 0.000 | 0.201 | 0.708 |
C(nkode_agg)[T.24] | 0.2799 | 0.277 | 1.010 | 0.312 | -0.263 | 0.823 |
C(nkode_agg)[T.25] | 0.2908 | 0.118 | 2.475 | 0.013 | 0.060 | 0.521 |
C(nkode_agg)[T.26] | 0.2191 | 0.202 | 1.087 | 0.277 | -0.176 | 0.614 |
C(nkode_agg)[T.27] | 0.1544 | 0.171 | 0.901 | 0.367 | -0.181 | 0.490 |
C(nkode_agg)[T.28] | 0.3713 | 0.127 | 2.918 | 0.004 | 0.122 | 0.621 |
C(nkode_agg)[T.29] | 0.0920 | 0.300 | 0.307 | 0.759 | -0.496 | 0.680 |
C(nkode_agg)[T.30] | 0.0492 | 0.252 | 0.195 | 0.846 | -0.446 | 0.544 |
C(nkode_agg)[T.31] | 0.3125 | 0.147 | 2.132 | 0.033 | 0.025 | 0.600 |
C(nkode_agg)[T.32] | 0.3042 | 0.136 | 2.234 | 0.026 | 0.037 | 0.571 |
C(nkode_agg)[T.33] | 0.2974 | 0.126 | 2.366 | 0.018 | 0.051 | 0.544 |
C(nkode_agg)[T.41] | 0.1229 | 0.128 | 0.964 | 0.335 | -0.127 | 0.373 |
C(nkode_agg)[T.42] | 0.3049 | 0.179 | 1.704 | 0.088 | -0.046 | 0.656 |
C(nkode_agg)[T.43] | 0.2483 | 0.109 | 2.278 | 0.023 | 0.035 | 0.462 |
C(nkode_agg)[T.45] | 0.1918 | 0.111 | 1.732 | 0.083 | -0.025 | 0.409 |
C(nkode_agg)[T.46] | 0.0916 | 0.109 | 0.840 | 0.401 | -0.122 | 0.306 |
C(nkode_agg)[T.47] | 0.1320 | 0.108 | 1.222 | 0.222 | -0.080 | 0.344 |
C(nkode_agg)[T.49] | 0.3229 | 0.143 | 2.265 | 0.024 | 0.043 | 0.602 |
C(nkode_agg)[T.50] | -0.8718 | 0.304 | -2.866 | 0.004 | -1.468 | -0.276 |
C(nkode_agg)[T.52] | -0.0817 | 0.268 | -0.305 | 0.760 | -0.607 | 0.444 |
C(nkode_agg)[T.55] | 0.2566 | 0.121 | 2.115 | 0.034 | 0.019 | 0.494 |
C(nkode_agg)[T.56] | 0.1926 | 0.121 | 1.594 | 0.111 | -0.044 | 0.429 |
C(nkode_agg)[T.58] | 0.5080 | 0.164 | 3.102 | 0.002 | 0.187 | 0.829 |
C(nkode_agg)[T.59] | 0.2545 | 0.411 | 0.620 | 0.536 | -0.551 | 1.060 |
C(nkode_agg)[T.61] | 0.7920 | 0.411 | 1.929 | 0.054 | -0.013 | 1.597 |
C(nkode_agg)[T.62] | 0.0356 | 0.201 | 0.177 | 0.860 | -0.359 | 0.430 |
C(nkode_agg)[T.68] | 0.7992 | 0.232 | 3.452 | 0.001 | 0.345 | 1.253 |
C(nkode_agg)[T.71] | 0.1780 | 0.168 | 1.060 | 0.289 | -0.151 | 0.507 |
C(nkode_agg)[T.73] | 0.1704 | 0.160 | 1.068 | 0.286 | -0.142 | 0.483 |
C(nkode_agg)[T.74] | 0.2109 | 0.184 | 1.145 | 0.252 | -0.150 | 0.572 |
C(nkode_agg)[T.75] | -0.1470 | 0.230 | -0.640 | 0.522 | -0.597 | 0.303 |
C(nkode_agg)[T.77] | 1.1510 | 0.210 | 5.473 | 0.000 | 0.739 | 1.563 |
C(nkode_agg)[T.79] | 0.2505 | 0.260 | 0.964 | 0.335 | -0.259 | 0.760 |
C(nkode_agg)[T.80] | 0.0172 | 0.456 | 0.038 | 0.970 | -0.876 | 0.910 |
C(nkode_agg)[T.81] | -0.0965 | 0.171 | -0.564 | 0.573 | -0.432 | 0.239 |
C(nkode_agg)[T.82] | 0.1319 | 0.204 | 0.647 | 0.518 | -0.268 | 0.532 |
C(nkode_agg)[T.85] | 0.3344 | 0.186 | 1.795 | 0.073 | -0.031 | 0.700 |
C(nkode_agg)[T.86] | 0.5185 | 0.148 | 3.492 | 0.000 | 0.227 | 0.810 |
C(nkode_agg)[T.88] | 0.2733 | 0.123 | 2.230 | 0.026 | 0.033 | 0.514 |
C(nkode_agg)[T.90] | 0.0946 | 0.523 | 0.181 | 0.857 | -0.931 | 1.121 |
C(nkode_agg)[T.91] | 0.4255 | 0.302 | 1.409 | 0.159 | -0.166 | 1.018 |
C(nkode_agg)[T.93] | 0.6556 | 0.151 | 4.330 | 0.000 | 0.359 | 0.952 |
C(nkode_agg)[T.95] | 0.1478 | 0.182 | 0.811 | 0.418 | -0.210 | 0.505 |
C(nkode_agg)[T.96] | 0.4313 | 0.130 | 3.326 | 0.001 | 0.177 | 0.685 |
C(year)[T.2012] | 0.0363 | 0.028 | 1.299 | 0.194 | -0.018 | 0.091 |
C(year)[T.2013] | 0.0525 | 0.028 | 1.893 | 0.058 | -0.002 | 0.107 |
C(year)[T.2014] | 0.0489 | 0.028 | 1.759 | 0.079 | -0.006 | 0.103 |
C(year)[T.2015] | 0.0499 | 0.028 | 1.799 | 0.072 | -0.004 | 0.104 |
no_auditor | 0.1385 | 0.031 | 4.531 | 0.000 | 0.079 | 0.198 |
driftsinntekter | 0.1050 | 0.014 | 7.703 | 0.000 | 0.078 | 0.132 |
antall_ansatte | 0.0068 | 0.010 | 0.679 | 0.497 | -0.013 | 0.026 |
Eiendeler | -0.0209 | 0.011 | -1.853 | 0.064 | -0.043 | 0.001 |
risk_roa | -0.0784 | 0.014 | -5.776 | 0.000 | -0.105 | -0.052 |
leverage | 0.1463 | 0.011 | 13.733 | 0.000 | 0.125 | 0.167 |
cash_flow_mu | 0.5439 | 0.009 | 60.303 | 0.000 | 0.526 | 0.562 |
cash_flow_sd | 0.0642 | 0.011 | 5.724 | 0.000 | 0.042 | 0.086 |
Omnibus: | 12327.324 | Durbin-Watson: | 1.365 |
---|---|---|---|
Prob(Omnibus): | 0.000 | Jarque-Bera (JB): | 3344011.768 |
Skew: | 5.264 | Prob(JB): | 0.00 |
Kurtosis: | 87.338 | Cond. No. | 169. |
This also has the effect of increasing the no_auditor indicator
In this study, since we have a sharp cut-off for eligibility that is at a certain level arbitrary, we can use a regression discontinuity design. This is the idea that whether a firm is just above or below the threshold, can be considered essentially random, and thus comparing firms close to the threshold, we can to a higher degree consider the sorting of firms into audited and non-audited to be ignorable.
We have the basic set-up already in place, now we just limit our data to firms within +/- 1 million in operating income
#narrow results
min_di = (4e6-5e6)/operIncome_std
max_di = (6e6-5e6)/operIncome_std
print(min_di, max_di)
aq_narrow = aq_rs.loc[((aq_rs.choseAudit==0) & (aq_rs.driftsinntekter>min_di) & (aq_rs.driftsinntekter<max_di)) ,:]
-0.2970846713856964 0.2970846713856964
We then run a regression with the same control variables as before. We could, in theory have a simpler specification, if we are very confident that our treatment assignment is now truly random, but adding control variables can also improve the efficiency of our results - that is reduce the amount of variance.
rs_mod6 = smf.ols("utbytte~no_auditor + C(year) + driftsinntekter + driftsinntekter**2 + antall_ansatte + Eiendeler + risk_roa + leverage + cash_flow_mu + cash_flow_sd + C(nkode_agg)",
data=aq_narrow).fit()
rs_mod6.summary()
Dep. Variable: | utbytte | R-squared: | 0.213 |
---|---|---|---|
Model: | OLS | Adj. R-squared: | 0.188 |
Method: | Least Squares | F-statistic: | 8.309 |
Date: | Tue, 01 Mar 2022 | Prob (F-statistic): | 8.74e-59 |
Time: | 10:51:12 | Log-Likelihood: | -1765.4 |
No. Observations: | 1837 | AIC: | 3649. |
Df Residuals: | 1778 | BIC: | 3974. |
Df Model: | 58 | ||
Covariance Type: | nonrobust |
coef | std err | t | P>|t| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
Intercept | -0.4391 | 0.325 | -1.352 | 0.177 | -1.076 | 0.198 |
C(year)[T.2012] | -2.423e-05 | 0.050 | -0.000 | 1.000 | -0.098 | 0.098 |
C(year)[T.2013] | 0.0043 | 0.049 | 0.088 | 0.930 | -0.092 | 0.101 |
C(year)[T.2014] | -0.0378 | 0.050 | -0.750 | 0.453 | -0.137 | 0.061 |
C(year)[T.2015] | 0.0612 | 0.050 | 1.211 | 0.226 | -0.038 | 0.160 |
C(nkode_agg)[T.2] | -0.0836 | 0.732 | -0.114 | 0.909 | -1.519 | 1.351 |
C(nkode_agg)[T.8] | 0.1775 | 0.435 | 0.408 | 0.683 | -0.675 | 1.030 |
C(nkode_agg)[T.10] | 0.0700 | 0.383 | 0.183 | 0.855 | -0.682 | 0.822 |
C(nkode_agg)[T.13] | 0.3290 | 0.350 | 0.941 | 0.347 | -0.357 | 1.015 |
C(nkode_agg)[T.14] | 0.2259 | 0.560 | 0.404 | 0.687 | -0.872 | 1.324 |
C(nkode_agg)[T.15] | 0.2159 | 0.433 | 0.499 | 0.618 | -0.633 | 1.065 |
C(nkode_agg)[T.16] | 0.2391 | 0.344 | 0.694 | 0.488 | -0.436 | 0.914 |
C(nkode_agg)[T.18] | 0.4581 | 0.336 | 1.363 | 0.173 | -0.201 | 1.117 |
C(nkode_agg)[T.20] | 0.0419 | 0.721 | 0.058 | 0.954 | -1.372 | 1.456 |
C(nkode_agg)[T.22] | 0.4945 | 0.376 | 1.313 | 0.189 | -0.244 | 1.233 |
C(nkode_agg)[T.23] | 0.3601 | 0.364 | 0.990 | 0.322 | -0.353 | 1.073 |
C(nkode_agg)[T.24] | 1.0495 | 0.560 | 1.874 | 0.061 | -0.049 | 2.148 |
C(nkode_agg)[T.25] | 0.5668 | 0.332 | 1.705 | 0.088 | -0.085 | 1.219 |
C(nkode_agg)[T.26] | 0.2231 | 0.417 | 0.535 | 0.593 | -0.595 | 1.042 |
C(nkode_agg)[T.27] | 0.2203 | 0.416 | 0.529 | 0.597 | -0.596 | 1.037 |
C(nkode_agg)[T.28] | 0.4150 | 0.349 | 1.190 | 0.234 | -0.269 | 1.099 |
C(nkode_agg)[T.30] | 0.2191 | 0.492 | 0.445 | 0.656 | -0.746 | 1.184 |
C(nkode_agg)[T.31] | 0.3741 | 0.358 | 1.044 | 0.297 | -0.329 | 1.077 |
C(nkode_agg)[T.32] | 0.4098 | 0.351 | 1.166 | 0.244 | -0.280 | 1.099 |
C(nkode_agg)[T.33] | 0.2554 | 0.345 | 0.739 | 0.460 | -0.422 | 0.933 |
C(nkode_agg)[T.41] | 0.3289 | 0.348 | 0.945 | 0.345 | -0.353 | 1.011 |
C(nkode_agg)[T.42] | 0.4427 | 0.397 | 1.116 | 0.264 | -0.335 | 1.220 |
C(nkode_agg)[T.43] | 0.3270 | 0.325 | 1.007 | 0.314 | -0.310 | 0.964 |
C(nkode_agg)[T.45] | 0.3199 | 0.326 | 0.980 | 0.327 | -0.320 | 0.960 |
C(nkode_agg)[T.46] | 0.2897 | 0.325 | 0.891 | 0.373 | -0.348 | 0.928 |
C(nkode_agg)[T.47] | 0.2991 | 0.324 | 0.923 | 0.356 | -0.336 | 0.935 |
C(nkode_agg)[T.49] | 0.4610 | 0.364 | 1.266 | 0.206 | -0.253 | 1.175 |
C(nkode_agg)[T.52] | 0.3147 | 0.494 | 0.637 | 0.524 | -0.655 | 1.284 |
C(nkode_agg)[T.55] | 0.2692 | 0.350 | 0.769 | 0.442 | -0.417 | 0.956 |
C(nkode_agg)[T.56] | 0.4054 | 0.339 | 1.195 | 0.232 | -0.260 | 1.071 |
C(nkode_agg)[T.58] | -0.0192 | 0.378 | -0.051 | 0.959 | -0.760 | 0.721 |
C(nkode_agg)[T.59] | 0.4430 | 0.563 | 0.787 | 0.431 | -0.661 | 1.547 |
C(nkode_agg)[T.62] | 0.1541 | 0.395 | 0.390 | 0.696 | -0.620 | 0.928 |
C(nkode_agg)[T.68] | 0.1892 | 0.560 | 0.338 | 0.736 | -0.909 | 1.287 |
C(nkode_agg)[T.71] | 0.2291 | 0.405 | 0.566 | 0.571 | -0.564 | 1.022 |
C(nkode_agg)[T.73] | 0.4143 | 0.368 | 1.125 | 0.261 | -0.308 | 1.137 |
C(nkode_agg)[T.74] | 0.3854 | 0.367 | 1.051 | 0.293 | -0.334 | 1.104 |
C(nkode_agg)[T.75] | -0.1888 | 0.432 | -0.437 | 0.662 | -1.037 | 0.659 |
C(nkode_agg)[T.81] | -0.0936 | 0.383 | -0.245 | 0.807 | -0.844 | 0.657 |
C(nkode_agg)[T.82] | 0.0322 | 0.433 | 0.074 | 0.941 | -0.817 | 0.881 |
C(nkode_agg)[T.86] | 1.0898 | 0.373 | 2.925 | 0.003 | 0.359 | 1.821 |
C(nkode_agg)[T.88] | 0.3532 | 0.347 | 1.018 | 0.309 | -0.327 | 1.034 |
C(nkode_agg)[T.90] | 0.2538 | 0.560 | 0.453 | 0.650 | -0.844 | 1.352 |
C(nkode_agg)[T.93] | 0.6095 | 0.361 | 1.688 | 0.092 | -0.099 | 1.317 |
C(nkode_agg)[T.95] | 0.4007 | 0.382 | 1.049 | 0.294 | -0.348 | 1.150 |
C(nkode_agg)[T.96] | 0.5757 | 0.347 | 1.658 | 0.098 | -0.105 | 1.257 |
no_auditor | 0.1580 | 0.047 | 3.357 | 0.001 | 0.066 | 0.250 |
driftsinntekter | 0.2828 | 0.110 | 2.563 | 0.010 | 0.066 | 0.499 |
antall_ansatte | -0.0445 | 0.027 | -1.642 | 0.101 | -0.098 | 0.009 |
Eiendeler | 0.0842 | 0.020 | 4.282 | 0.000 | 0.046 | 0.123 |
risk_roa | -0.0089 | 0.024 | -0.373 | 0.709 | -0.056 | 0.038 |
leverage | 0.0338 | 0.012 | 2.750 | 0.006 | 0.010 | 0.058 |
cash_flow_mu | 0.4052 | 0.024 | 16.599 | 0.000 | 0.357 | 0.453 |
cash_flow_sd | -0.0086 | 0.021 | -0.417 | 0.677 | -0.049 | 0.032 |
Omnibus: | 3061.949 | Durbin-Watson: | 1.861 |
---|---|---|---|
Prob(Omnibus): | 0.000 | Jarque-Bera (JB): | 3426499.805 |
Skew: | 10.680 | Prob(JB): | 0.00 |
Kurtosis: | 213.500 | Cond. No. | 221. |
This did not end up having a large effect on the estimation of our no_auditor coefficient, other than increasing the standard error, which would be expected when reducing the number of observations. It seems our control variables have done a pretty good job of controlling for the variation in the firms other than their auditing status.
We can illustrate our regression discontinuity with simulated uncertainty. First, we copy our function for simulating regression model uncertainty from lab 9:
def regSim(regMod):
#extract values from regression model
nmk = regMod.df_resid #n-k
sigma_hat = np.sqrt(regMod.mse_resid)
bs_vcov = regMod.cov_params(scale=1)
bs = regMod.params
#create simulated values
sigma_sim = sigma_hat*np.sqrt((nmk/np.random.chisquare(nmk,1)))
V_sim = np.array(bs_vcov) * sigma_sim**2
bs_sim = np.random.multivariate_normal(bs, V_sim, 1)
return([bs_sim.flatten(), sigma_sim])
We will first run a model, not including industry fixed effects, for simplicity.
rs_mod_fig = smf.ols("utbytte~no_auditor + C(year) + driftsinntekter + driftsinntekter**2 + antall_ansatte + Eiendeler + risk_roa + leverage + cash_flow_mu + cash_flow_sd",
data=aq_narrow).fit()
rs_mod_fig.summary()
Dep. Variable: | utbytte | R-squared: | 0.187 |
---|---|---|---|
Model: | OLS | Adj. R-squared: | 0.182 |
Method: | Least Squares | F-statistic: | 34.96 |
Date: | Tue, 01 Mar 2022 | Prob (F-statistic): | 1.27e-73 |
Time: | 11:08:53 | Log-Likelihood: | -1795.5 |
No. Observations: | 1837 | AIC: | 3617. |
Df Residuals: | 1824 | BIC: | 3689. |
Df Model: | 12 | ||
Covariance Type: | nonrobust |
coef | std err | t | P>|t| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
Intercept | -0.1000 | 0.038 | -2.621 | 0.009 | -0.175 | -0.025 |
C(year)[T.2012] | 0.0024 | 0.050 | 0.049 | 0.961 | -0.095 | 0.100 |
C(year)[T.2013] | -0.0011 | 0.049 | -0.021 | 0.983 | -0.097 | 0.095 |
C(year)[T.2014] | -0.0408 | 0.050 | -0.815 | 0.415 | -0.139 | 0.057 |
C(year)[T.2015] | 0.0641 | 0.050 | 1.274 | 0.203 | -0.035 | 0.163 |
no_auditor | 0.1396 | 0.045 | 3.080 | 0.002 | 0.051 | 0.228 |
driftsinntekter | 0.2575 | 0.108 | 2.379 | 0.017 | 0.045 | 0.470 |
antall_ansatte | -0.0219 | 0.022 | -0.994 | 0.320 | -0.065 | 0.021 |
Eiendeler | 0.0746 | 0.019 | 3.909 | 0.000 | 0.037 | 0.112 |
risk_roa | -0.0042 | 0.023 | -0.181 | 0.856 | -0.050 | 0.041 |
leverage | 0.0321 | 0.012 | 2.649 | 0.008 | 0.008 | 0.056 |
cash_flow_mu | 0.4086 | 0.022 | 18.357 | 0.000 | 0.365 | 0.452 |
cash_flow_sd | -0.0025 | 0.020 | -0.125 | 0.901 | -0.042 | 0.037 |
Omnibus: | 2992.007 | Durbin-Watson: | 1.828 |
---|---|---|---|
Prob(Omnibus): | 0.000 | Jarque-Bera (JB): | 2992577.853 |
Skew: | 10.204 | Prob(JB): | 0.00 |
Kurtosis: | 199.674 | Cond. No. | 11.0 |
sim = regSim(rs_mod_fig)
sim_inter = sim[0][0]
sim_treat = sim[0][5]
sim_beta = sim[0][6]
print(sim_inter, sim_treat, sim_beta)
-0.1149942250768759 0.0907645709571931 0.1847980214329764
nsim = 50
bs_sims = []
sigma_sims = []
for s in range(nsim):
param_sim = regSim(mod1)
bs_sims.append(param_sim[0])
sigma_sims.append(param_sim[1])
rs_mod_fig.cov_params()
se = np.sqrt(np.diagonal(rs_mod_fig.cov_params()))
params = rs_mod_fig.params
ses = dict(zip(params.index, se))
#we transform these variables back into the original scale by multipling by the standard deviations we save earlier
treat = rs_mod_fig.params["no_auditor"]
treat_se =ses["no_auditor"]
inter = rs_mod_fig.params["Intercept"]
inter_se = ses["Intercept"]
di = rs_mod_fig.params["driftsinntekter"]
di_se = ses["driftsinntekter"]
print("treatment", treat, treat_se)
print("Intercept", inter, inter_se)
print("di", di, di_se)
treatment 0.13958126433061546 0.04531536307535993 Intercept -0.09995932575667374 0.03814291312126054 di 0.25750323354220367 0.10823193503332862
pre_x = np.linspace(-2, 0, 1000)
post_x = np.linspace(0, 2, 1000)
post_y = inter + post_x*di
pre_y = inter + pre_x*di + treat
fig, ax = plt.subplots()
ax.plot(pre_x, pre_y, "k-", lw=2)
ax.plot(post_x, post_y, "k-", lw=2)
ax.axvline(0, color="red", linestyle=":")
#uncertainty
for i in range(50):
sim = regSim(rs_mod_fig)
sim_inter = sim[0][0]
sim_treat = sim[0][5]
sim_beta = sim[0][6]
post_y_r = sim_inter + sim_beta*post_x
pre_y_r = sim_inter + sim_treat + sim_beta*pre_x
ax.plot(pre_x, pre_y_r, "k-", alpha=.1)
ax.plot(post_x, post_y_r, "k-", alpha=.1)
#ax.set_ylim(0, 4e5)
#ax.set_xlim(4e6, 6e6)
ax.set_xlabel("Operating Revenue, difference from threshold in std")
ax.set_ylabel("Dividend, in StD")
Text(0, 0.5, 'Dividend, in StD')
The jump, or discontinuity at the threshold for operating revenue represents the point estimate on the no_auditor variable
Ideally, to get a causal interpretation of our estimation, we should seek to have a balance (ROS p391-394) between our treatment and control groups. In our example, that means we want to compare roughly similar firms that both are audited and are not audited. One set of techniques we can make use of are called matching (p.400-.
Matching consists of two basic steps:
We will do a relatively simple form of matching called propensity score matching.
We start by creating a logistic regression model to create a prediction of what firms are likely to forgo auditing among firms below the threshold. The fitted values of this model - in the form of probabilities - are called propensity scores.
The aim of this model is to give a best possible fit, so we will include multiple explanatory variables and their squares:
ps_fit = smf.logit("no_auditor ~ C(year) + driftsinntekter + I(driftsinntekter**2) + antall_ansatte + antall_ansatte*driftsinntekter + I(antall_ansatte**2) + Eiendeler + I(Eiendeler**2) + leverage + I(leverage**2) + risk_roa + I(risk_roa**2) + cash_flow_mu + I(cash_flow_mu**2) + cash_flow_sd + I(cash_flow_sd**2)",
data=aq_rs.loc[aq_rs.vol_auditor==1, ]).fit()
pscores = ps_fit.predict()
Optimization terminated successfully. Current function value: 0.552025 Iterations 8
Then we restrict the propensity scores to the firms who chose to get audited
pscores_noaudit = pscores[ps_fit.model.data.endog==1]
pscores_noaudit
array([0.46292287, 0.60140932, 0.61065614, ..., 0.56262474, 0.60000551, 0.65996885])
Then we use the model to predict the propensity scores for data above the threshold. Thus we are in essence creating a model to estimate which firms above the threshold would forgo an auditor if they could:
predictedP = ps_fit.predict(aq_rs.loc[aq_rs.vol_auditor==0, ])
cid = aq_rs.cid[aq_rs.vol_auditor==0]
year = aq_rs.year[aq_rs.vol_auditor==0]
d = {'value': predictedP, 'cid': cid, "year": year}
controls = pd.DataFrame(data=d)
Now we can show the underlying imbalance, as represented by the propensity scores, of the firms above and below the threshold:
fig, ax = plt.subplots(2)
ax[0].hist(pscores_noaudit)
ax[0].set_xlim(0,.8)
ax[0].set_ylabel("Below threshold")
ax[1].hist(predictedP)
ax[1].set_xlim(0,.8)
ax[1].set_ylabel("Above threshold")
plt.show()
Now we create a simple algorithm, that, for each observation in the treatment group, finds a control firm that has the closest propensity score to each treatment group. There is a growing literature on the optimal way of designing such matching algorithms - and it is one of the places where we see a clear intersection of machine learning techniques and traditional statistical techniques.
def matching(treatments, controls):
"""
treatments: list of observed treated variables
controls: dataframe with three columns:
value, cid, year
Algorithm:
Start with first value of treatment (outer loop)
start with first value of control (inner loop)
and set this to match
"""
matchedValues = [] #list of matched Pscores
matchedIndex = [] #list of CID of matched Pscores
matchValue=0
matchIndex=0
for tindex, treat in enumerate(treatments):
for cindex, cont in enumerate(controls):
if (treat-cont)**2<(treat-matchValue)**2:
matchValue=cont
matchIndex = cindex
#add closest match
matchedValues.append(matchValue)
matchedIndex.append(matchIndex)
#remove from controls, or do we want to allow for duplicates?
#controls.pop(match_index)
matchedDict = {"matchedValues":matchedValues, "matchedIndex": matchedIndex}
matchedDF = pd.DataFrame(matchedDict)
return(matchedDF)
matchedControl = matching(pscores_noaudit.tolist(), predictedP.tolist())
matchedControl.head()
matchedValues | matchedIndex | |
---|---|---|
0 | 0.463696 | 6463 |
1 | 0.590797 | 7032 |
2 | 0.626549 | 2234 |
3 | 0.675170 | 7595 |
4 | 0.333044 | 5125 |
mindex = matchedControl.matchedIndex.values
potentialControls = aq_rs[aq_rs.vol_auditor==0]
mcontrols = potentialControls.iloc[mindex,]
Now we check the balance of our new, synthetic, control group:
fig, ax = plt.subplots(3)
ax[0].hist(pscores_noaudit)
ax[0].set_xlim(0,.8)
ax[0].set_ylabel("Treated")
ax[1].hist(predictedP)
ax[1].set_xlim(0,.8)
ax[1].set_ylabel("Unmatched Control")
ax[2].hist(matchedControl.matchedValues)
ax[2].set_xlim(0,.8)
ax[2].set_ylabel("Matched Control")
ax[2].set_xlabel("Propensity Score")
plt.show()
It looks much better!
We now run a regression with our matched control group
matched_results = []
#treatments are those that were eligible for voluntary auditing, but did not choose to be audited
treatments = aq_rs[((aq_rs.vol_auditor==1) & (aq_rs.choseAudit==0))]
matched_sample =pd.concat([mcontrols, treatments], axis=0)
matched_sample["treated"] = 1*(matched_sample.vol_auditor==1)
matched_mod1 = smf.ols("utbytte~treated + C(year)",
data=matched_sample).fit(cov_type='HC3')
matched_results.append(matched_mod1)
matched_mod1.summary()
Dep. Variable: | utbytte | R-squared: | 0.007 |
---|---|---|---|
Model: | OLS | Adj. R-squared: | 0.006 |
Method: | Least Squares | F-statistic: | 6.199 |
Date: | Tue, 01 Mar 2022 | Prob (F-statistic): | 9.86e-06 |
Time: | 13:07:56 | Log-Likelihood: | -2026.9 |
No. Observations: | 3902 | AIC: | 4066. |
Df Residuals: | 3896 | BIC: | 4104. |
Df Model: | 5 | ||
Covariance Type: | HC3 |
coef | std err | z | P>|z| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
Intercept | -0.3005 | 0.030 | -10.179 | 0.000 | -0.358 | -0.243 |
C(year)[T.2012] | -0.0038 | 0.029 | -0.131 | 0.895 | -0.060 | 0.053 |
C(year)[T.2013] | 0.0439 | 0.030 | 1.453 | 0.146 | -0.015 | 0.103 |
C(year)[T.2014] | -0.0020 | 0.029 | -0.068 | 0.946 | -0.060 | 0.056 |
C(year)[T.2015] | 0.0402 | 0.033 | 1.217 | 0.224 | -0.025 | 0.105 |
treated | 0.0560 | 0.015 | 3.720 | 0.000 | 0.026 | 0.085 |
Omnibus: | 6771.205 | Durbin-Watson: | 1.717 |
---|---|---|---|
Prob(Omnibus): | 0.000 | Jarque-Bera (JB): | 12240465.458 |
Skew: | 11.766 | Prob(JB): | 0.00 |
Kurtosis: | 276.374 | Cond. No. | 19.7 |
Notice, that this is the same, naive model we started with, but this time with a matched control. We now get a positive coefficient on our treatment indicator, which is the same as the no_auditor indicator.
If we add operating income as a controlling variable, we get results that start approaching our results for our regression discontinuity design from earlier:
matched_mod1p1 = smf.ols("utbytte~treated + C(year) + driftsinntekter",
data=matched_sample).fit(cov_type='HC3')
matched_results.append(matched_mod1p1)
matched_mod1p1.summary()
Dep. Variable: | utbytte | R-squared: | 0.017 |
---|---|---|---|
Model: | OLS | Adj. R-squared: | 0.015 |
Method: | Least Squares | F-statistic: | 8.251 |
Date: | Tue, 01 Mar 2022 | Prob (F-statistic): | 6.75e-09 |
Time: | 13:10:07 | Log-Likelihood: | -2008.4 |
No. Observations: | 3902 | AIC: | 4031. |
Df Residuals: | 3895 | BIC: | 4075. |
Df Model: | 6 | ||
Covariance Type: | HC3 |
coef | std err | z | P>|z| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
Intercept | -0.3025 | 0.029 | -10.346 | 0.000 | -0.360 | -0.245 |
C(year)[T.2012] | -0.0259 | 0.029 | -0.905 | 0.365 | -0.082 | 0.030 |
C(year)[T.2013] | 0.0431 | 0.030 | 1.449 | 0.147 | -0.015 | 0.101 |
C(year)[T.2014] | 0.0072 | 0.029 | 0.248 | 0.805 | -0.050 | 0.065 |
C(year)[T.2015] | 0.0435 | 0.033 | 1.322 | 0.186 | -0.021 | 0.108 |
treated | 0.1036 | 0.019 | 5.421 | 0.000 | 0.066 | 0.141 |
driftsinntekter | 0.0742 | 0.013 | 5.786 | 0.000 | 0.049 | 0.099 |
Omnibus: | 6769.805 | Durbin-Watson: | 1.721 |
---|---|---|---|
Prob(Omnibus): | 0.000 | Jarque-Bera (JB): | 12354519.311 |
Skew: | 11.756 | Prob(JB): | 0.00 |
Kurtosis: | 277.656 | Cond. No. | 20.6 |
Minimum wage laws have been controversial in several countries, but especially the US, where minimum wages are set at both the federal level and at higher levels by some states. A very basic economic model would suggest that if you raise the minimum wage, this would lead to less employment. And this has been one of the main arguments against having or raising a minimum wage.
In the 1990's two researchers decided to test this hypothesis. New Jersey was set to increase their minimum wage. So the economists got some research assistants to start calling fast-food restaurants to enquire about how many employers they had and several other variables. They also called fast-food restaurants in neighboring Pennsylvania. Then after the implimentation of the policy, the researchers called the restaurants again. The results of this paper became well known because the results contradicted the standing economic wisdom: The researchers found that using a difference-in-difference methodology, that the higher minimum wage increased employment.
In this exercise we will try to do a simplified replication of some of the paper results and see how and why to impliment a difference-in-difference methodology.
a. Take a regression of employment on the state indicator (STATE, 1=New Jersey) for data after the introduction of the new minimum wage law. Discuss problems with interpreting this as a causal effect. Take a regression only for New Jersey data on the policy indicator (afterImplim). Discuss problems with interpreting this causally.
b. Now take a difference between restaurant employment before and after the implimentation of the law, then compare the mean values between New Jersey and Pennsylvania.
c. Now do a regression with all data (before and after implimentation of the new law) and include dummies for state, whether the law was implimented and an interaction. The interaction effect can be considered our difference-in-difference (DinD) indicator. Interpret the results and discuss how this resolves some of the problems in a.)
d. Add covariates to the the DinD regression for the chain indicator and ownership type (franchise or chain-owned).
e. What assumptions are necessary in order to interpret the DinD estimator as causal?
To start, run the following code to load in and set up the data
colNames = ["CHAIN", "CO_OWNED","STATE", "SOUTHJ", "CENTRALJ","NORTHJ", "PA1", "PA2", "SHORE", "NCALLS", "EMPFT", "EMPPT", "NMGRS", "WAGE_ST", "INCTIME", "FIRSTINC","BONUS", "PCTAFF", "MEALS", "OPEN", "HRSOPEN", "PSODA", "PFRY", "PENTREE", "NREGS", "NREGS11", "TYPE2", "STATUS2", "DATE2", "NCALLS2", "EMPFT2", "EMPPT2", "NMGRS2", "WAGE_ST2","INCTIME2","FIRSTIN2","SPECIAL2","MEALS2", "OPEN2R", "HRSOPEN2","PSODA2", "PFRY2", "PENTREE2","NREGS2", "NREGS112"]
minWage = pd.read_table("minWage.dat", sep="\s+", names=colNames)
minWage
CHAIN | CO_OWNED | STATE | SOUTHJ | CENTRALJ | NORTHJ | PA1 | PA2 | SHORE | NCALLS | ... | FIRSTIN2 | SPECIAL2 | MEALS2 | OPEN2R | HRSOPEN2 | PSODA2 | PFRY2 | PENTREE2 | NREGS2 | NREGS112 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
46 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | ... | 0.08 | 1 | 2 | 6.50 | 16.50 | 1.03 | . | 0.94 | 4 | 4 |
49 | 2 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | ... | 0.05 | 0 | 2 | 10.00 | 13.00 | 1.01 | 0.89 | 2.35 | 4 | 4 |
506 | 2 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | ... | 0.25 | . | 1 | 11.00 | 11.00 | 0.95 | 0.74 | 2.33 | 4 | 3 |
56 | 4 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | ... | 0.15 | 0 | 2 | 10.00 | 12.00 | 0.92 | 0.79 | 0.87 | 2 | 2 |
61 | 4 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | ... | 0.15 | 0 | 2 | 10.00 | 12.00 | 1.01 | 0.84 | 0.95 | 2 | 2 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
423 | 2 | 1 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 3 | ... | 0.50 | 0 | 1 | 11.00 | 11.00 | 1.05 | 0.84 | 2.32 | 3 | 2 |
424 | 2 | 1 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | ... | 0.50 | 0 | 1 | 11.00 | 14.00 | 1.05 | 0.94 | 2.32 | 5 | 3 |
426 | 3 | 1 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | ... | 0.25 | 1 | 2 | 6.00 | 18.00 | 1.11 | 1.05 | 1.05 | 6 | 5 |
427 | 4 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 3 | ... | . | 1 | 2 | 10.50 | 12.50 | 1.11 | 1.09 | 2.07 | 2 | 2 |
428 | 4 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 2 | ... | 0.10 | 0 | 2 | 10.50 | 12.50 | 1.05 | 0.89 | 2.11 | 2 | 2 |
410 rows × 45 columns
#Split the data into two data sets: before and after
minWage1 = minWage.loc[:, ['CHAIN', 'CO_OWNED', 'STATE', 'SOUTHJ', 'CENTRALJ', 'NORTHJ', 'PA1',
'PA2', 'SHORE', 'NCALLS', 'EMPFT', 'EMPPT', 'NMGRS', 'WAGE_ST',
'INCTIME', 'FIRSTINC', 'BONUS', 'PCTAFF', 'MEALS', 'OPEN', 'HRSOPEN',
'PSODA', 'PFRY', 'PENTREE', 'NREGS', 'NREGS11']]
minWage2 = minWage.loc[:, ['CHAIN', 'CO_OWNED', 'STATE', 'SOUTHJ', 'CENTRALJ', 'NORTHJ', 'PA1',
'PA2', 'SHORE','NCALLS2', 'EMPFT2', 'EMPPT2', 'NMGRS2', 'WAGE_ST2',
'INCTIME2', 'FIRSTIN2', 'BONUS', 'PCTAFF', 'MEALS2', 'OPEN2R', 'HRSOPEN2',
'PSODA2', 'PFRY2', 'PENTREE2', 'NREGS2', 'NREGS112']]
#indicator variable for before or after implimentation
minWage1["afterImplim"] = 0
minWage2["afterImplim"] = 1
#rename columns so consistent with minWage1
minWage2.columns = ['CHAIN', 'CO_OWNED', 'STATE', 'SOUTHJ', 'CENTRALJ', 'NORTHJ', 'PA1',
'PA2', 'SHORE', 'NCALLS', 'EMPFT', 'EMPPT', 'NMGRS', 'WAGE_ST',
'INCTIME', 'FIRSTINC','BONUS', 'PCTAFF', 'MEALS', 'OPEN', 'HRSOPEN',
'PSODA', 'PFRY', 'PENTREE', 'NREGS', 'NREGS11', 'afterImplim']
minWage_long = pd.concat([minWage1, minWage2])
minWage_long
CHAIN | CO_OWNED | STATE | SOUTHJ | CENTRALJ | NORTHJ | PA1 | PA2 | SHORE | NCALLS | ... | PCTAFF | MEALS | OPEN | HRSOPEN | PSODA | PFRY | PENTREE | NREGS | NREGS11 | afterImplim | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
46 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | ... | . | 2 | 6.5 | 16.5 | 1.03 | 1.03 | 0.52 | 3 | 3 | 0 |
49 | 2 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | ... | . | 2 | 10.0 | 13.0 | 1.01 | 0.90 | 2.35 | 4 | 3 | 0 |
506 | 2 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | ... | 30.0 | 2 | 11.0 | 10.0 | 0.95 | 0.74 | 2.33 | 3 | 3 | 0 |
56 | 4 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | ... | 0.0 | 2 | 10.0 | 12.0 | 0.87 | 0.82 | 1.79 | 2 | 2 | 0 |
61 | 4 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | ... | 0.0 | 3 | 10.0 | 12.0 | 0.87 | 0.77 | 1.65 | 2 | 2 | 0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
423 | 2 | 1 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | ... | 33.0 | 1 | 11.00 | 11.00 | 1.05 | 0.84 | 2.32 | 3 | 2 | 1 |
424 | 2 | 1 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | ... | 45.0 | 1 | 11.00 | 14.00 | 1.05 | 0.94 | 2.32 | 5 | 3 | 1 |
426 | 3 | 1 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | . | ... | 75.0 | 2 | 6.00 | 18.00 | 1.11 | 1.05 | 1.05 | 6 | 5 | 1 |
427 | 4 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 4 | ... | 40.0 | 2 | 10.50 | 12.50 | 1.11 | 1.09 | 2.07 | 2 | 2 | 1 |
428 | 4 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | . | ... | 100.0 | 2 | 10.50 | 12.50 | 1.05 | 0.89 | 2.11 | 2 | 2 | 1 |
820 rows × 27 columns
#convert some series from character/text to numeric
minWage_long["EMPFT"] = pd.to_numeric(minWage_long.EMPFT, errors='coerce')
minWage_long["EMPPT"] = pd.to_numeric(minWage_long.EMPPT, errors='coerce')
minWage_long["NMGRS"] = pd.to_numeric(minWage_long.NMGRS , errors='coerce')
minWage_long["WAGE_ST"] = pd.to_numeric(minWage_long.WAGE_ST, errors='coerce')
minWage_long["INCTIME"] = pd.to_numeric(minWage_long.INCTIME, errors='coerce')
#Part-time employee is counted as 50% of a full-time employee
minWage_long["EMP"] = minWage_long["EMPFT"] + .5*minWage_long["EMPPT"]
(Based on ROS Exercise 21.2 and 21.3, p.450-451)
Consider a hypothetical randomized encouragement design. In particular, imagine a study in which 1000 students entering any undergraduate degree program in Business in a major university are randomly assigned to one of two conditions.
One group is encouraged via an email from the chair of their department to participate in a one-week math boot camp just before the start of their first semester. Students in the other group are also allowed to participate but receive no special encouragement. These students would have had to discover on their own the existence of the program on the university website.
The running variable is test score on the final exam for a required math course, and the outcome variable 𝑦 that you will simulate below represents the difference between that score and the threshold for passing. Thus a negative value for a student reflects the fact that the student would not be eligible for that course. Below we will generate data for a sample of 1000 individuals.
(a) We will simulate compliance status. We assume that 25% of individuals are compliers who will take the math bootcamp if they receive an email, but otherwise will not. 60% are never-takers who will never take the math bootcamp no matter if they receive the email and 15% are always-takers, who will always take the bootcamp.
We generate 𝑑0 and 𝑑1 vectors to reflect this. We also generate a vector indicating compliance type, 𝑐.
(b) We simulate the potential outcomes in a way that meets the following criteria:
The average effect of 𝑧 on 𝑦 for the compliers is 4.
The average 𝑦(𝑧 = 0) for never-takers is 0; the average 𝑦(𝑧 = 0) for compliers is 3; - The average 𝑦(𝑧 = 0) for always-takers is 6.
The residual standard deviation is 1 for everyone in the sample (generated independently for each potential outcome).
(c) We put 𝑇0, 𝑇1, 𝑦0, 𝑦1 into one dataset called IV_df. You can also include a variable, 𝑐, indicating compliance group if you created one. We then also create the observed variables, y (actual test scores), T (whether a student went to the math boot camp) and z (whether they were randomly picked to receive an email encouragement).
#a.)
#compliance
# 0 complier, 1 never takers, 2 always takers
c = np.array(np.concatenate([["Complier"]*250, ["Never Taker"]*600, ["Always Taker"]*150]))
d0 = np.array(np.concatenate([[0]*850, [1]*150])) #dummy for always takers
d1 = np.array(np.concatenate([[1]*250, [0]*750])) #dummy for complier
#z is the encouragement
z=np.array(np.concatenate([np.zeros(500), np.ones(500)]))
#now let us randomize z
np.random.shuffle(z)
#Potential attendance outcomes
# Would attend if not encouraged
T_i0 = d0
# Would attend if encouraged
T_i1 = d0+d1
#Potential test outcomes
#test score given that they were encouraged to take test.
y0_complier = np.random.normal(3,1,250)
y0_neverTaker = np.random.normal(0,1,600) # y(z=0)
y0_alwaysTaker = np.random.normal(6,1,150)
y0 = np.concatenate([y0_complier, y0_neverTaker, y0_alwaysTaker])
complierEffect = np.random.normal(4,1,1000)
#out come
y1 = y0 + d1*complierEffect
IV_df = pd.DataFrame({
"complierStatus":c,
"z":z,
"d0":d0,
"d1":d1,
"T0":T_i0,
"T1":T_i1,
"y0": y0,
"y1": y1,
"encourEffect": y1-y0
})
IV_df
complierStatus | z | d0 | d1 | T0 | T1 | y0 | y1 | encourEffect | |
---|---|---|---|---|---|---|---|---|---|
0 | Complier | 0.0 | 0 | 1 | 0 | 1 | 0.568668 | 4.387267 | 3.818599 |
1 | Complier | 1.0 | 0 | 1 | 0 | 1 | 1.327189 | 5.027651 | 3.700463 |
2 | Complier | 1.0 | 0 | 1 | 0 | 1 | 2.846374 | 7.192827 | 4.346453 |
3 | Complier | 1.0 | 0 | 1 | 0 | 1 | 3.547111 | 8.005861 | 4.458751 |
4 | Complier | 1.0 | 0 | 1 | 0 | 1 | 2.333885 | 3.735922 | 1.402037 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
995 | Always Taker | 1.0 | 1 | 0 | 1 | 1 | 5.144561 | 5.144561 | 0.000000 |
996 | Always Taker | 0.0 | 1 | 0 | 1 | 1 | 7.603954 | 7.603954 | 0.000000 |
997 | Always Taker | 0.0 | 1 | 0 | 1 | 1 | 4.933091 | 4.933091 | 0.000000 |
998 | Always Taker | 0.0 | 1 | 0 | 1 | 1 | 5.806916 | 5.806916 | 0.000000 |
999 | Always Taker | 1.0 | 1 | 0 | 1 | 1 | 7.521124 | 7.521124 | 0.000000 |
1000 rows × 9 columns
IV_df.groupby("complierStatus")["encourEffect"].aggregate(np.mean)
complierStatus Always Taker 0.000000 Complier 4.062673 Never Taker 0.000000 Name: encourEffect, dtype: float64
#actual attendance outcome
IV_df["T"] = d1*z + d0
#observed outcome
IV_df["y"] = y0*(1-z) + y1*(z)
Now that we have created our data set let's consider the perspective of the analyst. They clearly do not get to see the counterfactuals (y0, y1, T0,T1), they only get to see whether a student attended (T), the score on the test (y) and whether they were randomly part of the group that was encouraged (z).
a.) Start with a naive regression of T on y ($y_i = \alpha + beta T_i + \epsilon$). What are the results. Are the results biased? explain
b.) Use 2-stage least squares/Instrumental variables to estimate the effect of T on Y
(You might want to see the two stage least squares estimation tool in statsmodels)
c.) Discuss why instrumental variables worked in this case / the assumptions necessary for IV to work.
Consider a regression you have run in a previous lab.
Discuss the extent to which you can consider the results causal.
Can you apply any of the methods discussed in the lab or in ROS in order to better approximate a causal result.
If possible, apply methods in order to get closer to a causal effect (For example: including extra covariates, fixed effects, matching, regression discontinuity, instrumental variables). If not possible discuss why not.