#!/usr/bin/env python
# coding: utf-8

# # Tests using the "distance" between the empirical and the null

# 1. Show that the collection of all sets of the form $(-\infty, x] \times (-\infty, y]$
# comprise a Vapnik-Cervonenkis class (V-C class) over the plane.
# 
# 1. Show that if $\mathcal{V}$ is the set of all closed intervals in $\Re$,
# $m^{\mathcal{V}}(n) = 1 + n + {{n}\choose{2}}$.
# 
# 1. Show that intersections and finite unions of V-C classes are V-C classes. 
# 
# 1. Show that countable unions of V-C classes need not be V-C classes.
# 
# 1. Code up Romano's approach for testing whether a set of $k$ of real-valued random variables is independent
# based on observing $n$ IID $k$-tuples of values, using group invariance (not the bootstrap approach). 
# That is, we observe $\{X_j\}_{j=1}^n$ where each $X_j = (X_{j1}, \ldots, X_{jk})$ takes values in $\Re^k$.
# The null hypothesis is that for each $j$, the components $\{X_{j1}, \ldots, X_{jk}\}$ are independent.
# Explain why you used the particular V-C class you chose.
# Is the relevant group of transformations for the hypothesis a finite or infinite group?
# As usual, provide unit tests and a coverage report for your code.
# 
# 1. Write a program to simulate data from $k$-variate normal distributions with different covariance matrices and apply the
# test you programmed in the previous question.
# Confirm that the level of the test is approximately correct by simulating from a multivariate normal distribution
# with a diagonal covariance matrix with various values of $k$ and $n$.
# Simulate the power of the test for level $\alpha = 0.05$ as a function of $\rho$
# for $k = 3$, $n=10$, $100$, and $1000$, and a covariance matrix of the form 
# \begin{equation}
# \Sigma = \left [ \begin{array}{ccc}
#     1 & \rho & 0\\
#     \rho & 1 & 0 \\
#     0 & 0 & 1 
#     \end{array}
#     \right ]
# \end{equation}
# for $\rho \in \{-1, -.75, -.5, -.25, .25, .5, .75, 1\}$.
# Provide unit tests and a coverage report for your code.
# 
# 1. The file https://www.stat.berkeley.edu/~stark/Java/Data/lomaPrieta.dat contains 221 observations of the times of putative aftershocks of the 17 October 1989 earthquake in Loma Prieta, California.
# There are 222 lines in the file.
# The first is 0, the main shock, which occurred at 4:15:43pm.
# The other lines are the times in days from the main event to the aftershocks,
# defined as earthquakes determined to have magnitude 3.0 and above, focal depth
# of 0--20km, and epicenter within 40km of the epicenter of the Loma Prieta earthquake.
# The data are from the UC Berkeley Seismographic Stations, courtesy of
# Dr. Bob Uhrhammer.  
# A common model for earthquakes ("main" shocks, not aftershocks) is that they are a spatially heterogeneous but temporally homogeneous Poisson
# process.
# If so, inter-event times have an exponential distribution, and conditional on the number $n$ of events in the time interval $[0, T]$, the times of the events are IID uniform.  
# Treat the time of the first event as 0, and let $T = 805$.
# Find the $P$-value of the hypothesis that the 222 events are a realization of a Poisson process for three tests:
#     + The Kolmogorov-Smirnov test that the inter-event times are exponentially distributed
#     + The Kolmogorov-Smirnov test that the times of the 221 events after the first are IID uniform on $[0, T]$
#     + A 2-sample permutation test that compares the number of events in the first half of the interval, $[0, T/2)$, to the number in the second half, $[T/2, T]$.  
# Provide unit tests and a coverage report.
# Comment on the test results.
# Which test would you recommend? Why? ("It gave the smallest $P$-value" isn't a good reason: selecting the
# test after peeking at the data introduces multiplicity and selection
# that are hard to take into account in the $P$-value.)
# 

# In[ ]: