In this chapter, you will practice your EDA, parameter estimation, and hypothesis testing skills on the results of the 2015 FINA World Swimming Championships. This is the Summary of lecture "Case Studies in Statistical Thinking", via datacamp.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import dc_stat_think as dcst
plt.rcParams['figure.figsize'] = (10, 5)
In the heats, all contestants swim, the very fast and the very slow. To explore how the swim times are distributed, plot an ECDF of the men's 200 freestyle.
swim = pd.read_csv('./dataset/2015_FINA.csv', skiprows=4)
swim.head()
athleteid | lastname | firstname | birthdate | gender | name | code | eventid | heat | lane | ... | swimtime | split | cumswimtime | splitdistance | daytime | round | distance | relaycount | stroke | splitswimtime | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 100784 | BORSHI | NOEL | 1996-02-13 | F | Albania | ALB | 1 | 1 | 4 | ... | 63.65 | 1 | 29.63 | 50 | 930.0 | PRE | 100 | 1 | FLY | 29.63 |
1 | 100784 | BORSHI | NOEL | 1996-02-13 | F | Albania | ALB | 1 | 1 | 4 | ... | 63.65 | 2 | 63.65 | 100 | 930.0 | PRE | 100 | 1 | FLY | 34.02 |
2 | 100784 | BORSHI | NOEL | 1996-02-13 | F | Albania | ALB | 20 | 1 | 8 | ... | 140.28 | 1 | 31.33 | 50 | 1014.0 | PRE | 200 | 1 | FLY | 31.33 |
3 | 100784 | BORSHI | NOEL | 1996-02-13 | F | Albania | ALB | 20 | 1 | 8 | ... | 140.28 | 2 | 66.81 | 100 | 1014.0 | PRE | 200 | 1 | FLY | 35.48 |
4 | 100784 | BORSHI | NOEL | 1996-02-13 | F | Albania | ALB | 20 | 1 | 8 | ... | 140.28 | 3 | 103.29 | 150 | 1014.0 | PRE | 200 | 1 | FLY | 36.48 |
5 rows × 22 columns
mens_200_free_heats_df = swim[(swim['gender'] == 'M') &
(swim['distance'] == 200) &
(swim['stroke'] == 'FREE') &
(swim['round'] == 'PRE') &
(swim['split'] == 4)]
mens_200_free_heats = mens_200_free_heats_df['cumswimtime'].unique()
# Generate x and y values for ECDF: x, y
x, y = dcst.ecdf(mens_200_free_heats)
# Plot the ECDF as dots
_ = plt.plot(x, y, marker='.', linestyle='none')
# Label axes and show plot
_ = plt.xlabel('time (s)')
_ = plt.ylabel('ECDF')
We see that fast swimmers are below 115 seconds, with a smattering of slow swimmers past that, including one very slow swimmer.
Now, you will practice parameter estimation and computation of confidence intervals by computing the mean and median swim time for the men's 200 freestyle heats. The median is useful because it is immune to heavy tails in the distribution of swim times, such as the slow swimmers in the heats.
# Compute mean and median swim times
mean_time = np.mean(mens_200_free_heats)
median_time = np.median(mens_200_free_heats)
# Draw 10,000 bootstrap replicates of the mean and median
bs_reps_mean = dcst.draw_bs_reps(mens_200_free_heats, np.mean, size=10000)
bs_reps_median = dcst.draw_bs_reps(mens_200_free_heats, np.median, size=10000)
# Compute the 95% confidence intervals
conf_int_mean = np.percentile(bs_reps_mean, [2.5, 97.5])
conf_int_median = np.percentile(bs_reps_median, [2.5, 97.5])
# Print the result to the screen
print("""
mean time: {0:.2f} sec.
95% conf int of mean: [{1:.2f}, {2:.2f}] sec.
median time: {3:.2f} sec.
95% conf int of median: [{4:.2f}, {5:.2f}] sec.
""".format(mean_time, *conf_int_mean, median_time, *conf_int_median))
mean time: 111.63 sec. 95% conf int of mean: [110.49, 112.92] sec. median time: 110.04 sec. 95% conf int of median: [108.96, 111.01] sec.
Question : Do swimmers swim faster in the finals than in other rounds?
Question: Do individual female swimmers swim faster in the finals compared to the semifinals?
Fractional improvement
First, you will get an understanding of how athletes' performance changes from the semifinals to the finals by computing the fractional improvement from the semifinals to finals and plotting an ECDF of all of these values.
The arrays final_times
and semi_times
contain the swim times of the respective rounds. The arrays are aligned such that final_times[i]
and semi_times[i]
are for the same swimmer/event. If you are interested in the strokes/events, you can check out the data frame df in your namespace, which has more detailed information, but is not used in the analysis.
women_swim_df = swim[(swim['gender'] == "F") &
(swim['stroke'] != "MEDLEY") &
(swim['distance'].isin([100, 50, 200])) &
(swim['round'].isin(['SEM', 'FIN'])) &
(swim['splitdistance'] == swim['distance'])]
women_swim_df.head(n = 5)
athleteid | lastname | firstname | birthdate | gender | name | code | eventid | heat | lane | ... | swimtime | split | cumswimtime | splitdistance | daytime | round | distance | relaycount | stroke | splitswimtime | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
303 | 100537 | CAMPBELL | BRONTE | 1994-05-14 | F | Australia | AUS | 223 | 2 | 5 | ... | 53.00 | 2 | 53.00 | 100 | 1732.0 | SEM | 100 | 1 | FREE | 27.44 |
305 | 100537 | CAMPBELL | BRONTE | 1994-05-14 | F | Australia | AUS | 123 | 1 | 3 | ... | 52.52 | 2 | 52.52 | 100 | 1732.0 | FIN | 100 | 1 | FREE | 27.37 |
307 | 100537 | CAMPBELL | BRONTE | 1994-05-14 | F | Australia | AUS | 234 | 2 | 6 | ... | 24.32 | 1 | 24.32 | 50 | 1828.0 | SEM | 50 | 1 | FREE | 24.32 |
308 | 100537 | CAMPBELL | BRONTE | 1994-05-14 | F | Australia | AUS | 134 | 1 | 6 | ... | 24.12 | 1 | 24.12 | 50 | 1805.0 | FIN | 50 | 1 | FREE | 24.12 |
315 | 100631 | CAMPBELL | CATE | 1992-05-20 | F | Australia | AUS | 223 | 1 | 4 | ... | 52.84 | 2 | 52.84 | 100 | 1732.0 | SEM | 100 | 1 | FREE | 27.49 |
5 rows × 22 columns
women_swim_df = women_swim_df[['athleteid', 'stroke', 'distance', 'lastname', 'cumswimtime', 'round']]
women_swim_df_fin = women_swim_df.loc[(women_swim_df['round'] == 'FIN')]
women_swim_df_sem = women_swim_df.loc[(women_swim_df['round'] == 'SEM')]
women_swim_df_w = women_swim_df_fin.merge(women_swim_df_sem, how = 'left', on = ['athleteid', 'stroke', 'distance', 'lastname'])
df = women_swim_df_w.rename(index = str, columns = {"cumswimtime_x" : "final_swimtime", "cumswimtime_y" : "semi_swimtime"})
df = df[['athleteid', 'stroke', 'distance', 'lastname', 'final_swimtime', 'semi_swimtime']]
final_times = df['final_swimtime'].values
semi_times = df['semi_swimtime'].values
df
athleteid | stroke | distance | lastname | final_swimtime | semi_swimtime | |
---|---|---|---|---|---|---|
0 | 100537 | FREE | 100 | CAMPBELL | 52.52 | 53.00 |
1 | 100537 | FREE | 50 | CAMPBELL | 24.12 | 24.32 |
2 | 100631 | FREE | 100 | CAMPBELL | 52.82 | 52.84 |
3 | 100631 | FREE | 50 | CAMPBELL | 24.36 | 24.22 |
4 | 100650 | FLY | 100 | MCKEON | 57.67 | 57.59 |
... | ... | ... | ... | ... | ... | ... |
91 | 105595 | BACK | 200 | FRANKLIN | 126.34 | 127.79 |
92 | 105607 | BREAST | 50 | HARDY | 30.20 | 30.25 |
93 | 105640 | FLY | 200 | MCLAUGHLIN | 126.95 | 127.52 |
94 | 105676 | BACK | 100 | BAKER | 59.99 | 59.63 |
95 | 105686 | FLY | 200 | ADAMS | 126.40 | 127.57 |
96 rows × 6 columns
# Compute fractional difference in time between finals and semis
f = (df['semi_swimtime'] - df['final_swimtime']) / df['semi_swimtime']
# Generate x and y values for the ECDF: x, y
x, y = dcst.ecdf(f)
# Make a plot of the ECDF
_ = plt.plot(x, y, marker='.', linestyle='none')
# Label axes
_ = plt.xlabel('f')
_ = plt.ylabel('ECDF')
The median of the ECDF is juuuust above zero. But at first glance, it does not look like there is much of any difference between semifinals and finals.
Compute the mean fractional improvement from the semifinals to finals, along with a 95% confidence interval of the mean.
# Mean fractional time difference: f_mean
f_mean = np.mean(f)
# Get bootstrap reps of means: bs_reps
bs_reps = dcst.draw_bs_reps(f, np.mean, size=10000)
# Compute confidence intervals: conf_int
conf_int = np.percentile(bs_reps, [2.5, 97.5])
# Report
print("""
mean frac. diff.: {0:.5f}
95% conf int of mean frac. diff.: [{1:.5f}, {2:.5f}]""".format(f_mean, *conf_int))
mean frac. diff.: 0.00040 95% conf int of mean frac. diff.: [-0.00093, 0.00173]
It looks like the mean finals time is juuuust faster than the mean semifinal time, and they very well may be the same. We'll test this hypothesis next.
Based on our EDA and parameter estimates, it is tough to discern improvement from the semifinals to finals. In the next exercise, you will test the hypothesis that there is no difference in performance between the semifinals and finals. A permutation test is fitting for this. We'll get test statistics with following strategy:
As you worked out in the last exercise, we need to generate a permutation sample by randomly swapping corresponding entries in the semi_times
and final_times
array. Write a function with signature swap_random(a, b)
that returns arrays where random indices have the entries in a and b swapped.
def swap_random(a, b):
"""Randomly swap entries in two arrays"""
# Indices to swap
swap_inds = np.random.random(size=len(a)) < 0.5
# Make copies of arrays a and b for output
a_out = np.copy(a)
b_out = np.copy(b)
# Swap values
a_out[swap_inds] = b[swap_inds]
b_out[swap_inds] = a[swap_inds]
return a_out, b_out
Test the hypothesis that performance in the finals and semifinals are identical using the mean of the fractional improvement as your test statistic. The test statistic under the null hypothesis is considered to be at least as extreme as what was observed if it is greater than or equal to f_mean
.
# Set up array of permutation replicates
perm_reps = np.empty(1000)
for i in range(1000):
# Generate a permutation sample
semi_perm, final_perm = swap_random(semi_times, final_times)
# Compute f from the permutation sample
f = (semi_perm - final_perm) / semi_perm
# Compute and store permutation replicate
perm_reps[i] = np.mean(f)
# Compute and print p-value
print('p =', np.sum(perm_reps >= f_mean) / 1000)
p = 0.255
The p-value is large, about 0.28, which suggests that the results of the 2015 World Championships are consistent with there being no difference in performance between the finals and semifinals.
To get a graphical overview of a data set, it is often useful to plot all of your data. In this exercise, plot all of the splits for all female swimmers in the 800 meter heats. The data are available in a Numpy arrays split_number
and splits
. The arrays are organized such that splits[i,j]
is the split time for swimmer i for split_number[j]
.
free_800_w = swim.loc[(swim['gender'] == 'F') &
(swim['stroke'] == 'FREE') &
(swim['distance'] == 800) &
(swim['round'].isin(['PRE'])) &
(~swim['split'].isin([1,2,15,16]))]
free_800_w = free_800_w[['split', 'splitswimtime']]
splits = np.reshape(free_800_w['splitswimtime'].values, (-1, 12))
split_number = free_800_w['split'].unique()
# Plot the splits for each swimmer
for splitset in splits:
_ = plt.plot(split_number, splitset, linewidth=1, color='lightgray')
# Compute the mean split times
mean_splits = np.mean(splits, axis=0)
# Plot the mean split time
_ = plt.plot(split_number, mean_splits, marker='.', linewidth=3, markersize=12)
# Label axes
_ = plt.xlabel('split number')
_ = plt.ylabel('split time (s)')
You can see that there is wide variability in the splits among the swimmers, and what appears to be a slight trend toward slower split times.
We will assume that the swimmers slow down in a linear fashion over the course of the 800 m event. The slowdown per split is then the slope of the mean split time versus split number plot. Perform a linear regression to estimate the slowdown per split and compute a pairs bootstrap 95% confidence interval on the slowdown. Also show a plot of the best fit line.
Note: We can compute error bars for the mean split times and use those in the regression analysis, but we will not take those into account here, as that is beyond the scope of this course.
# Perform regression
slowdown, split_3 = np.polyfit(split_number, mean_splits, 1)
# Compute pairs bootstrap
bs_reps, _ = dcst.draw_bs_pairs_linreg(split_number, mean_splits, size=10000)
# Compute confidence interval
conf_int = np.percentile(bs_reps, [2.5, 97.5])
# Plot the data with regressions line
_ = plt.plot(split_number, mean_splits, marker='.', linestyle='none')
_ = plt.plot(split_number, slowdown * split_number + split_3, '-')
# Label axes and show plot
_ = plt.xlabel('split number')
_ = plt.ylabel('split time (s)')
# Print the slowdown per split
print("""
mean slowdown: {0:.3f} sec./split
95% conf int of mean slowdown: [{1:.3f}, {2:.3f}] sec./split""".format(
slowdown, *conf_int))
mean slowdown: 0.065 sec./split 95% conf int of mean slowdown: [0.052, 0.079] sec./split
There is a small (about 6 hundreths of a second), but discernible, slowdown per split. We'll do a hypothesis test next.
Now we will test the null hypothesis that the swimmer's split time is not at all correlated with the distance they are at in the swim. We will use the Pearson correlation coefficient (computed using dcst.pearson_r()) as the test statistic.
# Observed correlation
rho = dcst.pearson_r(split_number, mean_splits)
# Initialize permutation reps
perm_reps_rho = np.empty(10000)
# Make permutation reps
for i in range(10000):
# Scramble_split number array
scrambled_split_number = np.random.permutation(split_number)
# Compute the Pearson correlation coefficient
perm_reps_rho[i] = dcst.pearson_r(scrambled_split_number, mean_splits)
# Compute and print p-value
p_val = np.sum(perm_reps_rho >= rho) / 10000
print('p =', p_val)
p = 0.0
The tiny effect is very real! With 10,000 replicates, we never got a correlation as big as observed under the hypothesis that the swimmers do not change speed as the race progresses. In fact, I did the test with a million replicates, and still never got a single replicate as big as the observed Pearson correlation coefficient.