Explain the "randomization model" (randomizing an experimental group into different treatments)

and some of its advantages and limitations. State the strong null hypothesis for the randomization model (the Neyman model) and the typical weak null hypothesis.

For the randomization model, state some pros and cons of the Wilcoxon rank sum test versus a permutation test based on the sample mean or on the permutation distribution of the $t$ statistic, compared with the parametric two-sample Student $t$-test. For each test, state the null hypothesis under which the nominal $P$-value of each test is its true $P$-value.
Show mathematically that the Wilcoxon rank-sum test (which uses the sum of the ranks of the treatment group as the test statistic) is equivalent to the Mann-Whitney test (which uses the number of (control, treatment) pairs for which the control response is less than the control response as the test statistic), in the sense that they reject the null for the same data sets. (It's enough to show that their relationship is monotonic.) You may assume that there are no ties in the data, so the (mid-)ranks are $\{1, \ldots, n\}$.
Consider an experiment involving 9 subjects, 5 assigned at random to treatment and 4 to control. We want to test the null hypothesis that "treatment makes no difference" at significance level 10%. For each individual, we measure a quantitative response. Think about the omnibus alternative, about the one-sided shift alternative that treatment increases the response, and about the two-sided shift alternative that treatment increases or decreases the response. Consider four tests: the Wilcoxon rank-sum test (using mid-ranks for ties), a permutation test based on the difference in the sample means for the control and the treatment groups, the Smirnov test, and a 2-sample $t$-test based on the difference in the sample means for the control and treatment groups.
1. Explain the assumptions of each test, including a precise statement of the null hypothesis under which the nominal significance level is the true significance level.
2. Which test would you use in this situation, absent any additional information about the nature of the experiment? Why?
3. Find or estimate by simulation the power of each test against the one-sided shift alternative that treatment increases each individual's response by 1 unit. If the power depends on additional unspecified features of the problem, such as the baseline responses of the subjects, explain what those features are, and estimate the power for a few different values of those features.
4. For one-sided versions of the Wilcoxon rank-sum test, the permutation test using the sample mean, and the $t$-test, and for the Smirnov test, find the (nominal) $P$-values for the following hypothetical data, and the power against a shift alternative that treatment increases the response by one unit:

treatment	1	2	3	3	4
control	0	1.5	2.5	3.5

Consider the Smirnov test for an experiment involving 5 subjects, 2 assigned at random to treatment and 3 to control.
1. Enumerate all possible values of the test statistic and their probabilities under the strong null hypothesis of no treatment effect, assuming no ties among the data.
2. Now suppose that the observed responses are as follows. Treatment: 1, 2. Control: 0, 2, 4. Estimate the probabilities in the previous part by simulation, using 10,000 replications. Implement the simulation two ways: one by taking the first $n$ elements of a pseudorandom permutations (the PIKK algorithm) and the other by directly constructing pseudorandom samples. Provide unit tests and a coverage report for the functions you write.
3. Calculate the theoretical standard error of the simulated probabilities, on the assumption that the PRNG generates genuine uniform random numbers.
4. If the simulations were truly random and uniform, what would be the joint distribution of the number of times the test statistic takes each of its possible values in the simulation? (Identify the parametric distribution and its parameters.)
5. Sketch how you would test the hypothesis that the true probabilities are equal to the values you calculated in the first part of this question using the empirical frequencies you observed in the second part.
6. What software package did you use for the simulations? What is its default algorithm for calculating pseudo-random numbers? What is the period of that pseudo-random number generator? What is the largest number of objects for which that generator can give you all permutations? Is there an option in the package to use a better pseudo-random number generator? If so, which one? What algorithm does the package use to generate permutations? What algorithm does the package use to generate random samples?
Give two statistical interpretations and theoretical justifications for using $\mbox{hits}/\mbox{reps}$ as the simulation $P$-value and two for using $(\mbox{hits}+1)/(\mbox{reps}+1)$ as the simulation $P$-value. Which do you prefer? Why?
Give three real-world examples of a scientific null hypothesis that can be expressed as the invariance of a probability model for the data under the action of some group. In each case, identify the "scientific" invariance and the corresponding group invariance for the data. Explain how you could use that invariance to test the scientific hypotheses without knowing anything else about the probability model for the data.

Rosenzweig, M.R., E.L. Bennet, and M.C. Diamond, 1972. Brain changes in response to experience, Scientific American, 226, 22–29 report an experiment in which 11 triples of male rats, each triple from the same litter, were assigned at random to three different environments, "enriched" (E), standard, and "impoverished."

After 65 days, they euthanized the rats and measured their cortical masses (mg). Here are the results:

enriched	689	656	668	660	679	663	664	647	694	633	653
impoverished	657	623	652	654	658	646	600	640	605	635	642
difference	32	33	16	6	21	17	64	7	89	-2	11

Code, from scratch, an appropriate 1-sample permutation test in Python using the sum of the 11 differences as the test statistic. The permutation scheme should match the actual randomization in the experiment.

You may either construct a test that generates permutations at random, or write code that enumerates all permutations (since there are only 2048 possible equally-likely data in this case). If you choose to enumerate all permutations, you can use the itertools library. Provide unit tests and coverate reports for the functions you write.

Explain whether a 1-sided or 2-sided test is appropriate.
Under the Neyman model, find the $P$-value of the strong null hypothesis that the treatment makes no difference whatsoever.

Use $10^4$ random assignments or enumerate all possible assignments. Your code should work in the general case (different data, different number of litters), not only the particular set of 22 numbers given. That said, for reference, the $P$-value for a 2-sided test should be $1/2^9$ for these data, so you can check your answer.

Find an upper confidence bound for the exact $P$-value (the $P$-value based on all possible permutations) by inverting Binomial tests. You may re-use code from the class to do this. If you did not use simulations but instead used all possible assignments, explain how you would get an upper confidence bound if you had simulated things instead.

For your convenience, the data are below in tuple of lists

Problem set: permutation tests, rank-based tests, and simulating $P$-values¶