#!/usr/bin/env python # coding: utf-8 # # Tests using the "distance" between the empirical and the null # 1. Show that the collection of all sets of the form $(-\infty, x] \times (-\infty, y]$ # comprise a Vapnik-Cervonenkis class (V-C class) over the plane. # # 1. Show that if $\mathcal{V}$ is the set of all closed intervals in $\Re$, # $m^{\mathcal{V}}(n) = 1 + n + {{n}\choose{2}}$. # # 1. Show that intersections and finite unions of V-C classes are V-C classes. # # 1. Show that countable unions of V-C classes need not be V-C classes. # # 1. Code up Romano's approach for testing whether a set of $k$ of real-valued random variables is independent # based on observing $n$ IID $k$-tuples of values, using group invariance (not the bootstrap approach). # That is, we observe $\{X_j\}_{j=1}^n$ where each $X_j = (X_{j1}, \ldots, X_{jk})$ takes values in $\Re^k$. # The null hypothesis is that for each $j$, the components $\{X_{j1}, \ldots, X_{jk}\}$ are independent. # Explain why you used the particular V-C class you chose. # Is the relevant group of transformations for the hypothesis a finite or infinite group? # As usual, provide unit tests and a coverage report for your code. # # 1. Write a program to simulate data from $k$-variate normal distributions with different covariance matrices and apply the # test you programmed in the previous question. # Confirm that the level of the test is approximately correct by simulating from a multivariate normal distribution # with a diagonal covariance matrix with various values of $k$ and $n$. # Simulate the power of the test for level $\alpha = 0.05$ as a function of $\rho$ # for $k = 3$, $n=10$, $100$, and $1000$, and a covariance matrix of the form # \begin{equation} # \Sigma = \left [ \begin{array}{ccc} # 1 & \rho & 0\\ # \rho & 1 & 0 \\ # 0 & 0 & 1 # \end{array} # \right ] # \end{equation} # for $\rho \in \{-1, -.75, -.5, -.25, .25, .5, .75, 1\}$. # Provide unit tests and a coverage report for your code. # # 1. The file https://www.stat.berkeley.edu/~stark/Java/Data/lomaPrieta.dat contains 221 observations of the times of putative aftershocks of the 17 October 1989 earthquake in Loma Prieta, California. # There are 222 lines in the file. # The first is 0, the main shock, which occurred at 4:15:43pm. # The other lines are the times in days from the main event to the aftershocks, # defined as earthquakes determined to have magnitude 3.0 and above, focal depth # of 0--20km, and epicenter within 40km of the epicenter of the Loma Prieta earthquake. # The data are from the UC Berkeley Seismographic Stations, courtesy of # Dr. Bob Uhrhammer. # A common model for earthquakes ("main" shocks, not aftershocks) is that they are a spatially heterogeneous but temporally homogeneous Poisson # process. # If so, inter-event times have an exponential distribution, and conditional on the number $n$ of events in the time interval $[0, T]$, the times of the events are IID uniform. # Treat the time of the first event as 0, and let $T = 805$. # Find the $P$-value of the hypothesis that the 222 events are a realization of a Poisson process for three tests: # + The Kolmogorov-Smirnov test that the inter-event times are exponentially distributed # + The Kolmogorov-Smirnov test that the times of the 221 events after the first are IID uniform on $[0, T]$ # + A 2-sample permutation test that compares the number of events in the first half of the interval, $[0, T/2)$, to the number in the second half, $[T/2, T]$. # Provide unit tests and a coverage report. # Comment on the test results. # Which test would you recommend? Why? ("It gave the smallest $P$-value" isn't a good reason: selecting the # test after peeking at the data introduces multiplicity and selection # that are hard to take into account in the $P$-value.) # # In[ ]: