Problem set: binary experiments with binary outcomes¶

This problem set is about the Rigdon and Hudgens (2015)) and Li and Ding (2016) approaches to making a confidence interval for the average treatment effect $\bar{\tau}$ in binary experiments with binary outcomes. The notebook includes python implementations of three building blocks, N_generator and filter_table. In this assignment, you will use those building blocks--and functions you write--to implement the Li and Ding method. Provide unit tests (using pytest) and test coverage reports for all the functions you implement in this assignment.

Recall that the data $n_{wk}$ is the number of subjects assigned to treatment $w$ whose response was $k$, for $w, k \in \{0, 1\}$, and that $N_{ij}$, the summary population potential outcome table, is the number of subjects whose response to control would be $i$ and response to treatment would be $j$. The matrix $\mathbf{N}$ fully specifies the study population; the observed response matrix $\mathbf{n} := (n_{wk})_{w, k \in \{0, 1\}}$ represents the data that would be observed for a given assignment of subjects to treatment or control, $\{W_j\}_{j=1}^N$.

Write a function to simulate observed response data $\mathbf{n}$

from a hypothesized 2x2 summary potential outcome table $\mathbf{N}$, taking as input the potential outcome table $\mathbf{N}$, the number $n$ of subjects to be assigned to the active treatment, the number of replications, and an object of class RandomState to generate the pseudo-random samples. The function should generate observed response tables $\mathbf{n}$, from pseudo-random samples of subjects' responses, randomly allocating $n$ subjects to treatment and the other $m=N-n$ to control. The pseudo-random sampling could be implemented in many ways, including "exploding" the full outcome table $\mathbf{N}$ into a 2x$N$ table with a row for each subject, as described in the chapter. But to save memory, instead implement it in a way that does not require explicitly storing all $2N$ responses, but instead constructs the data table "directly" from the 2x2 response table $\mathbf{N}$ without ever listing all the theoretical or sampled outcomes. Justify your choice of random sampling algorithms.

Write a function to test whether a hypothesized 2x2 summary potential outcome table $\mathbf{N}$

is (statistically) consistent with a table $\mathbf{n} = (n_{wk})$ of observed responses. The test should allow an arbitrary test statistic $T$ to be passed as an argument. The test statistic should take as input a 2x2 "data" table of observed responses and a hypothesized table of potential outcomes $\mathbf{N}$), as well as a RandomState object and a number of replications for the simulation.

Using those functions, write a function that finds lower and uppoer confidence bounds for the

average treatment effect, allowing an arbitrary test statistic to be passed to the function. The default value of the test statistic should be $T = | \hat{\tau} - \bar{\tau}|$. Use simulations to estimate the tail probabilities. The function should take as input a number of iterations to use (default 10^4), the confidence level (default 0.95), and a seed or instance of a RandomState object, with a default of an instance of a numpy RandomState object initialized without specifying a seed. To improve efficiency, the function should only test whether a potential response table is consistent with the data if the average treatment effect for that table is greater than the largest ATE already in the confidence set or smaller than the smallest ATE already in the confidence set. (Bonus: Consider re-using random samples across nulls to improve efficiency. What are the tradeoffs?)

There are many other test statistics one could use instead of $T = | \hat{\tau} - \bar{\tau}|$.

Recall Sterne's method in the discussion of inverting tests to get confidence intervals for the binomial parameter $p$ and for the hypergeometric parameter $G$, and the difference between tests that exclude the same probability from both tails versus tests that minimize the number of outcomes in the acceptance region. How might you use those ideas to get tighter confidence intervals for the average treatment effect in binary experiments with binary outcomes? Implement three other test statistics and perform some simulations using different "true" response tables to compare the expected lengths of the confidence intervals produced by the different test statistics. Try to provide some intuition for when each method works well, and for which is best overall.