Simulating and Estimating Spillovers in Stata

This notebook simulates and then estimates spillover effects where potential outcomes are described by the function $$ y(\tau_{i,t}, v(\tau_{-i,t})) $$

for firm-level treatment $\tau_{i,t}$. Outcomes depend on peer firm treatment status through the scalar function $v(\tau_{-i,t})$ where $\tau_{-i,t}$ denotes treatments outside of the focal firm. The scalar function was first devised by Hong and Raudenbush (2006 JASA) to address the dimensionality of $\tau_{-i,1}$ (the function $y(\tau_{i,t},\tau_{-i,t})$ has $2^N-1$ potential outcomes). The scalar function $v(\cdot)$ is specified as $$ v(\tau_{-i,t}) := \frac{1}{|C|}\sum_{i' \in C} \tau_{i',t} $$ for $C := \{i'|\text{Ind. }i'=\text{Ind. }i\}$. This follows Ferraci, Jolivet, and van den Berg (2014 REStat). As shorthand, we will write $v(\tau_{-i,t})$ as $\rho_{j,t}$ where $j$ indexes the industry.

We begin by constructing a panel with $40$ industries, $50$ firms per industry, and $40$ quarters of data. A handful of true $\beta$ parameters are also specified.

In [1]:
// clear everything from Stata's memory
clear

// create industries
qui set obs 40
gen j = _n

// for each industry, create a set of firms
qui expand 50
bys j: gen i = j*100+_n

// for each industry-firm, create a set of time periods
qui expand 40
bys j i: gen t = _n

// true model parameters
scalar b0 = 0
scalar b1 = -3
scalar b2 = 2
scalar b3 = -2

Exogenous Treatment Example

Suppose first that treatment assignment is random and that spillover effects are linear. This implies a model $$ y_{i,t} = \beta_0 + \beta_1\tau_{i,t} + \beta_2\tau_{i,t}\rho_{j,t} + \beta_3(1-\tau_{i,t})\rho_{j,t} + \epsilon_{i,t} $$

In [2]:
// reproducibility
set seed 0

// create firm and industry treatment
gen tau = rbinomial(1, .05)
bys j t: egen rho = mean(tau)

// create a true model
gen u = rnormal()
gen y = b0 + b1*tau + b2*1.tau*rho + b3*0.tau*rho + u

When treatment is exogenous and spillover effects are linear in $\rho$, treatment effects are easy to estimate.

In [3]:
reg y tau i.tau#c.rho
      Source |       SS           df       MS      Number of obs   =    80,000
-------------+----------------------------------   F(3, 79996)     =   9892.12
       Model |  29701.2931         3  9900.43102   Prob > F        =    0.0000
    Residual |  80063.2094    79,996  1.00084016   R-squared       =    0.2706
-------------+----------------------------------   Adj R-squared   =    0.2706
       Total |  109764.502    79,999  1.37207343   Root MSE        =    1.0004

------------------------------------------------------------------------------
           y | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
         tau |  -2.952744   .0389518   -75.80   0.000    -3.029089   -2.876398
             |
   tau#c.rho |
          0  |  -2.218683   .1155108   -19.21   0.000    -2.445084   -1.992283
          1  |   .8730354   .4987173     1.75   0.080    -.1044473    1.850518
             |
       _cons |   .0107455   .0066978     1.60   0.109    -.0023821     .023873
------------------------------------------------------------------------------

The Stable Unit Treatment Value Assumption (SUTVA) imposes the condition that $y(\tau_{i,t},\tau_{-i,t})=y(\tau_{i,t})$ so that there are no spillover effects. This implies the model $$ y_{i,t} = \theta_0 + \theta_1\tau_{i,t} + u_{i,t} $$

Theoretically, $\hat{\theta_1} = \hat{\beta_1} + (\hat{\beta_2}-\hat{\beta_3})\bar{\rho}_{j,t}$. Within the simulated sample, the observed bias is close to the asymptotic bias.

In [4]:
reg y tau

qui su rho
di "Theoretical bias `=(b2-b3)*`r(mean)''"
di "Observed bias `=_b[tau]-b1'"
      Source |       SS           df       MS      Number of obs   =    80,000
-------------+----------------------------------   F(1, 79998)     =  29169.45
       Model |  29328.9848         1  29328.9848   Prob > F        =    0.0000
    Residual |  80435.5176    79,998  1.00546911   R-squared       =    0.2672
-------------+----------------------------------   Adj R-squared   =    0.2672
       Total |  109764.502    79,999  1.37207343   Root MSE        =    1.0027

------------------------------------------------------------------------------
           y | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
         tau |  -2.783437   .0162974  -170.79   0.000    -2.815379   -2.751494
       _cons |  -.0973879   .0036369   -26.78   0.000    -.1045162   -.0902596
------------------------------------------------------------------------------


Theoretical bias .1991999967209996

Observed bias .2165633739491448

Endogenous Treatment Sample

Now suppose that treatment assignment is endogenous. In particular, there is an industry-level covariate $w$ with which a firm-level covariate $x$ is correlated. We allow the firm-level covariate $x$ to predict treatment status.

For simplicity, we won't allow $x$ to affect outcomes directly, though adding this feature wouldn't really change anything.

In [5]:
// reproducibility
set seed 0

// hold on only to the panel identifiers
keep i j t

// create dummy for unique industry-quarters
bys j t: gen _first = _n == 1

// generate industry-quarter data
qui gen w = runiform(1, 3) if _first == 1
sort j t i
qui by j t: carryforward w, replace

// generate firm data
gen x = rnormal(w, 1)

// simulate treatment
gen pr = 1/(1+exp(2*x))

// create treatment dummy
qui gen tau = rbinomial(1,pr)

// rbinomial yields some missing values for pr very close to zero or one
qui replace tau = 0 if mi(tau) & pr < .5
qui replace tau = 1 if mi(tau) & pr >= .5

// industry treatment intensity
bys j t: egen rho = mean(tau)

// generate outcome of interest
gen u = rnormal() + .05*x
gen y = b0 + b1*tau + b2*1.tau*rho + b3*0.tau*rho + u





(39,503 missing values generated)

(39,503 missing values generated)





(39,503 missing values generated)

(39,503 missing values generated)

As one would expect, without addressing endogenous treatment assignment, the model that we ran before is flawed.

In [6]:
// without addressing endogeneity
reg y tau i.tau#c.rho
      Source |       SS           df       MS      Number of obs   =    40,497
-------------+----------------------------------   F(3, 40493)     =   4911.02
       Model |  14675.0009         3  4891.66695   Prob > F        =    0.0000
    Residual |  40333.4624    40,493  .996060118   R-squared       =    0.2668
-------------+----------------------------------   Adj R-squared   =    0.2667
       Total |  55008.4632    40,496  1.35836782   Root MSE        =    .99803

------------------------------------------------------------------------------
           y | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
         tau |  -3.071996   .0404585   -75.93   0.000    -3.151296   -2.992697
             |
   tau#c.rho |
          0  |   -2.02764   .0225631   -89.87   0.000    -2.071864   -1.983415
          1  |   2.036824   .0798152    25.52   0.000     1.880384    2.193263
             |
       _cons |   .1169907   .0099473    11.76   0.000     .0974938    .1364876
------------------------------------------------------------------------------

The simplest approach is to use regression adjustment and include $x$ as a covariate.

In [7]:
// regression adjustment
reg y tau i.tau#c.rho x
      Source |       SS           df       MS      Number of obs   =    40,497
-------------+----------------------------------   F(4, 40492)     =   3713.66
       Model |  14763.8806         4  3690.97015   Prob > F        =    0.0000
    Residual |  40244.5826    40,492  .993889722   R-squared       =    0.2684
-------------+----------------------------------   Adj R-squared   =    0.2683
       Total |  55008.4632    40,496  1.35836782   Root MSE        =    .99694

------------------------------------------------------------------------------
           y | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
         tau |   -2.99079   .0413167   -72.39   0.000    -3.071772   -2.909809
             |
   tau#c.rho |
          0  |  -2.006762   .0226464   -88.61   0.000    -2.051149   -1.962374
          1  |   2.053711   .0797482    25.75   0.000     1.897403     2.21002
             |
           x |   .0452351   .0047835     9.46   0.000     .0358594    .0546108
       _cons |   .0105126   .0150171     0.70   0.484    -.0189213    .0399465
------------------------------------------------------------------------------

Now suppose that spillover effects are nonlinear in $\rho$.

In this scenario, regression adjustment alone isn't sufficient.

In [8]:
// nonlinear spillovers
replace y = b0 + b1*tau + b2*1.tau*log(.1+rho) + b3*0.tau*log(.1+rho) + u

// regression adjustment
reg y tau i.tau#c.rho x
(40,497 real changes made)


      Source |       SS           df       MS      Number of obs   =    40,497
-------------+----------------------------------   F(4, 40492)     =  36948.53
       Model |  159000.108         4  39750.0271   Prob > F        =    0.0000
    Residual |  43562.1663    40,492  1.07582155   R-squared       =    0.7849
-------------+----------------------------------   Adj R-squared   =    0.7849
       Total |  202562.275    40,496  5.00203167   Root MSE        =    1.0372

------------------------------------------------------------------------------
           y | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
         tau |  -9.582966   .0429859  -222.93   0.000    -9.667219   -9.498712
             |
   tau#c.rho |
          0  |  -4.528388   .0235614  -192.20   0.000    -4.574568   -4.482207
          1  |   4.051024   .0829701    48.83   0.000     3.888401    4.213648
             |
           x |   .0640858   .0049767    12.88   0.000     .0543313    .0738403
       _cons |   3.418386   .0156238   218.79   0.000     3.387763     3.44901
------------------------------------------------------------------------------

Begin by computing likelihoods of treatment.

Ideally, this would be done on an industry-by-industry or industry-quarter by industry-quarter basis. To save some computation time, I'll run it all in one go. This is fine because the treatment propensity simulation above does not use industry-specific parameters to calculate treatment likelihoods.

In [9]:
// logit
logit tau x

// store probabilities
predict phat, pr

// store regression weight
gen wt = tau/phat + (1-tau)/(1-phat)
Iteration 0:   log likelihood = -11341.123  
Iteration 1:   log likelihood = -8458.2127  
Iteration 2:   log likelihood = -7160.0203  
Iteration 3:   log likelihood = -7088.6004  
Iteration 4:   log likelihood = -7088.2951  
Iteration 5:   log likelihood = -7088.2951  

Logistic regression                                    Number of obs =  40,497
                                                       LR chi2(1)    = 8505.66
                                                       Prob > chi2   =  0.0000
Log likelihood = -7088.2951                            Pseudo R2     =  0.3750

------------------------------------------------------------------------------
         tau | Coefficient  Std. err.      z    P>|z|     [95% conf. interval]
-------------+----------------------------------------------------------------
           x |   -1.99339   .0298281   -66.83   0.000    -2.051852   -1.934928
       _cons |  -.0128757   .0311895    -0.41   0.680    -.0740061    .0482546
------------------------------------------------------------------------------

(39,503 missing values generated)

(39,503 missing values generated)

Impose overlap with a maxima-minima rule for estimated treatment probabilities.

In [10]:
// overlap
qui su phat if tau == 0
local min0 = `r(min)'
local max0 = `r(max)'
qui su phat if tau == 1
local min1 = `r(min)'
local max1 = `r(max)'
replace wt = . if phat < max(`min0',`min1') | phat > min(`max0',`max1')





(4,770 real changes made, 4,770 to missing)

Now, within a particular industry-quarter, the parameter $\rho$ is fixed. This allows us to freely estimate $E[y(\tau=1,\rho)|\rho]$ and $E[\tau=0,\rho)|\rho]$. We will include $x$ as a covariate in the industry-quarter by industry-quarter regressions so that the expected outcomes we estimate are IPWRA estimators and thus doubly-robust.

In [11]:
// within-group expected outcomes
qui gen y1 = .
qui gen y0 = .
egen grp = group(j t)
qui levelsof grp, local(glevels)
foreach g of local glevels{
    qui{
    cap reg y tau [pw = wt] if grp == `g'
    if _rc ~= 0{
        continue
    }
    gen tau_orig = tau
    replace tau = 1
    predict y1tmp if e(sample)
    replace tau = 0
    predict y0tmp if e(sample)
    replace tau = tau_orig
    replace y1 = y1tmp if grp == `g'
    replace y0 = y0tmp if grp == `g'
    drop tau_orig y*tmp
    }
}
// mean group outcomes
bys grp: egen Ey0 = mean(y0)
bys grp: egen Ey1 = mean(y1)




(2,700 missing values generated)

(2,700 missing values generated)

To estimate the dose response functions $E[y(1,\rho)]$ and $E[y(0,\rho)]$ over industry-quarter level data, one could either use $\rho$ as the continous treatment variable or a standardized form of $\rho$. In the case of the former, because it is bounded over $[0,1]$ and likely to be skewed, special care needs to be taken with specifying the link function used in the dose response estimation. It turns out that it is more effective to standardize $\rho$ (done below) and then un-standardize it post-estimation because of the good performance that we get out of the usual Gaussian distribution with identity link function.

In [12]:
// normalize rho
su rho if _first, d

cap drop lrho
lnskew0 rho_deskewed = rho
scalar lnskew0_k = `r(gamma)'
qui su rho_deskewed, d
scalar lnskew_sd = `r(sd)'
scalar lnskew_mu = `r(mean)'
replace rho_deskewed = (rho_deskewed-lnskew_mu)/lnskew_sd

su rho_deskewed if _first, d
                             rho
-------------------------------------------------------------
      Percentiles      Smallest
 1%          .04              0
 5%           .1              0
10%          .16              0       Obs               1,600
25%           .3              0       Sum of wgt.       1,600

50%          .54                      Mean             .53455
                        Largest       Std. dev.      .2712384
75%          .76            .98
90%           .9            .98       Variance       .0735703
95%          .96              1       Skewness      -.0927423
99%          .98              1       Kurtosis       1.900762



       Transform |          k    [95% conf. interval]       Skewness
-----------------+--------------------------------------------------
      ln(-rho-k) |  -4.434525      (not calculated)        -.0000362





(80,000 real changes made)


                      ln(-rho+4.434525)
-------------------------------------------------------------
      Percentiles      Smallest
 1%    -1.709036      -1.792517
 5%    -1.626036      -1.792517
10%    -1.379873      -1.709036       Obs               1,600
25%    -.8213747      -1.709036       Sum of wgt.       1,600

50%     .0146534                      Mean           4.94e-09
                        Largest       Std. dev.      1.000306
75%     .8744471       1.881574
90%     1.353232       1.881574       Variance       1.000613
95%     1.553642       1.881574       Skewness      -.0000364
99%     1.751296       1.881574       Kurtosis       1.888488

We can now estimate a dose response function on $E[y(0,\rho)]$ with covariate $w$. Recall that $x$ predicts firm-level treatment and was correlated with $w$, so $w$ should predict industry-quarter level treatment $\rho$. To speed things along, I've specified flag(0) to turn off covariate balancing tests (since we know that, by construction, w is sufficient to balance over $\rho$).

In [13]:
// drf on y0
qui{
cap restore
preserve
keep if _first == 1
drf w, outcome(Ey0) treatment(rho_deskewed) method(iwkernel) npoints(100) common(1) flag(0)
matrix x = e(tvec)
local npoints `=rowsof(x)'
matrix y0_hat = e(b)
mat y0_hat = y0_hat[1,1..`npoints']'
svmat x, name(rho_points)
svmat y0_hat, name(y0_points)
replace rho_points1 = -1*lnskew0_k-exp(rho_points1*lnskew_sd+lnskew_mu)
gen ytrue = b0 + b3*log(.1+rho_points1)
twoway (scatter y0_points rho_points1) (scatter ytrue rho_points1)
restore
}
file C:/Users/james/.stata_kernel_cache/graph0.svg saved as SVG format
file C:/Users/james/.stata_kernel_cache/graph0.pdf saved as PDF format

stata plot

We do the same for $E[y(1,\rho)]$.

In [14]:
// drf on y1
qui{
preserve
keep if _first == 1
drf w, outcome(Ey1) treatment(rho_deskewed) method(iwkernel) npoints(100) common(1) flag(0)
matrix x = e(tvec)
local npoints `=rowsof(x)'
matrix y1_hat = e(b)
mat y1_hat = y1_hat[1,1..`npoints']'
svmat x, name(rho_points)
svmat y1_hat, name(y1_points)
replace rho_points1 = -1*lnskew0_k-exp(rho_points1*lnskew_sd+lnskew_mu)
gen ytrue = b0 + b1 + b2*log(.1+rho_points1)
twoway (scatter y1_points rho_points1) (scatter ytrue rho_points1)
restore
}
file C:/Users/james/.stata_kernel_cache/graph1.svg saved as SVG format
file C:/Users/james/.stata_kernel_cache/graph1.pdf saved as PDF format

stata graph

The estimated effects for $\tau=0$ and for $\tau=1$ firms line up very well with their true values. The only failure is at the tail end of the distribution for $\rho$. Given that the dose response function is estimated nonparameterically, failures near the edge of the distribution are expected.