Uncertainty in the sample statistics
Would it be useful to know the uncertainty in these statistics due to limited sampling?
Bootstrap is a method to assess the uncertainty in a sample statistic by repeated random sampling with replacement.
Assumptions
Limitations
The Bootstrap Approach (Efron, 1982)
Statistical resampling procedure to calculate uncertainty in a calculated statistic from the data itself.
Extremely powerful - could calculate uncertainty in any statistic! e.g. P13, skew etc.
Steps:
assemble a sample set, must be representative, reasonable to assume independence between samples
optional: build a cumulative distribution function (CDF)
For $\ell = 1, \ldots, L$ realizations, do the following:
For $i = \alpha, \ldots, n$ data, do the following:
Calculate a realization of the sammary statistic of interest from the $n$ samples, e.g. $m^\ell$, $\sigma^2_{\ell}$. Return to 3 for another realization.
Compile and summarize the $L$ realizations of the statistic of interest.
This is a very powerful method. Let's try it out.
Provide an example and demonstration for:
Here's the steps to get setup in Python with the GeostatsPy package:
The following code loads the required libraries.
%matplotlib inline
from ipywidgets import interactive # widgets and interactivity
from ipywidgets import widgets
from ipywidgets import Layout
from ipywidgets import Label
from ipywidgets import VBox, HBox
import matplotlib.pyplot as plt # plotting
import numpy as np # working with arrays
import pandas as pd # working with DataFrames
from scipy.stats import triang # parametric distributions
from scipy.stats import binom
from scipy.stats import norm
from scipy.stats import uniform
from scipy.stats import triang
from scipy import stats # statistical calculations
import random # random drawing / bootstrap realizations of the data
This is an interactive method to:
select a parametric distribution
select the distribution parameters
select the number of samples and visualize the synthetic dataset distribution
# parameters for the synthetic dataset
bins = np.linspace(0,1000,1000)
# interactive calculation of the sample set (control of source parametric distribution and number of samples)
l = widgets.Text(value=' Boostrap Demonstration, Michael Pyrcz, Associate Professor, The University of Texas at Austin',layout=Layout(width='950px', height='30px'))
dist = widgets.Dropdown(
options=['Triangular', 'Uniform', 'Gaussian'],
value='Gaussian',
description='Dataset Distribution:',
disabled=False,
layout=Layout(width='200px', height='30px')
)
a = widgets.FloatSlider(min=0.0, max = 100.0, value = 50.0, description = 'Dataset: Mean / Mode',orientation='vertical',layout=Layout(width='170px', height='200px'),continuous_update=False)
a.style.handle_color = 'blue'
d = widgets.FloatSlider(min=0.01, max = 30.0, value = 10.0, step = 1.0, description = 'Dataset: St. Deviation',orientation='vertical',layout=Layout(width='130px', height='200px'),continuous_update=False)
d.style.handle_color = 'green'
b = widgets.FloatSlider(min = 0, max = 100.0, value = 0.0, description = 'Dataset: Minimum',orientation='vertical',layout=Layout(width='130px', height='200px'),continuous_update=False)
b.style.handle_color = 'red'
c = widgets.IntSlider(min = 0, max = 100, value = 100, description = 'Dataset: Maximum',orientation='vertical',layout=Layout(width='130px', height='200px'),continuous_update=False)
c.style.handle_color = 'orange'
n = widgets.IntSlider(min = 2, max = 1000, value = 100, description = 'Dataset: Number Samples',orientation='vertical',layout=Layout(width='180px', height='200px'),continuous_update=False)
n.style.handle_color = 'gray'
ui = widgets.HBox([dist,a,d,b,c,n],) # basic widget formatting
ui2 = widgets.VBox([l,ui],)
def f_make(dist,a, b, c, d, n): # function to take parameters, make sample and plot
global df
dataset = make_data(dist,a, b, c, d, n)
df = pd.DataFrame({'DataSet':dataset})
plt.subplot(111)
plt.hist(
dataset,
alpha=0.8,
color="darkorange",
edgecolor="black",
bins=bins)
plt.xlim(0.0,100.0); plt.title('Synthetic Dataset'); plt.ylabel('Frequency'); plt.xlabel('Data Values')
plt.subplots_adjust(left=0.0, bottom=0.0, right=2.0, top=1.6, wspace=0.2, hspace=0.3)
plt.show()
return df
def make_data(dist,a, b, c, d, n): # function to check parameters and make sample
if dist == 'Uniform':
if b >= c:
print('Invalid uniform distribution parameters')
return None
dataset = uniform.rvs(size=n, loc = b, scale = c, random_state = 73073).tolist()
return dataset
elif dist == 'Triangular':
interval = c - b
if b >= a or a >= c or interval <= 0:
print('Invalid triangular distribution parameters')
return None
dataset = triang.rvs(size=n, loc = b, c = (a-b)/interval, scale = interval, random_state = 73073).tolist()
return dataset
elif dist == 'Gaussian':
dataset = norm.rvs(size=n, loc = a, scale = d, random_state = 73073).tolist()
return dataset
# connect the function to make the samples and plot to the widgets
interactive_plot = widgets.interactive_output(f_make, {'dist': dist,'a': a, 'd': d, 'b': b, 'c': c, 'n': n})
interactive_plot.clear_output(wait = True) # reduce flickering by delaying plot updating
We display the GUI now. Select the desired parametric distribution and associated parameters.
display(ui2, interactive_plot) # display the interactive plot
VBox(children=(Text(value=' Boostrap Demonstration, Michael Pyrcz, Associ…
Output()
We now have a synthetic dataset to work with. Now we can:
assign the samples to a 1D ndarray
make a DataFrame with the samples
check the summary statistics
This is our sample set that we will apply ot bootstrap.
Now we take our synthetic dataset, sampled from the parametric distributioin above, and apply it to statistical bootstrap.
L = 1000 # set the number of realizations
mean = np.zeros(L); stdev = np.zeros(L) # declare arrays to hold the realizations of the statistics
for l in range(0, L): # loop over realizations
samples = random.choices(df['DataSet'].values, weights=None, cum_weights=None, k=len(df))
mean[l] = np.average(samples)
stdev[l] = np.std(samples)
plt.subplot(121) # plot the distribution for uncertainty in the mean
plt.hist(mean,alpha=0.8,color="darkorange",edgecolor="black",bins=np.linspace(0,100,40)); plt.xlim(0,100); plt.ylabel('Frequency'); plt.xlabel('Values'); plt.title('Bootstrap Uncertainty in the Mean')
plt.subplot(122) # plot the distribution for uncertainty in the standard deviation
plt.hist(stdev,alpha=0.8,color="darkorange",edgecolor="black",bins=np.linspace(0,40,40)); plt.xlim(0,40); plt.ylabel('Frequency'); plt.xlabel('Values'); plt.title('Bootstrap Uncertainty in the Standard Deviation')
plt.subplots_adjust(left=0.0, bottom=0.0, right=2.0, top=1.2, wspace=0.2, hspace=0.2)
plt.show()
# provide summary statistics, P10, P50 and P90
print('Summary Statistics for Bootstrap for Uncertainty in the Mean:')
print(stats.describe(mean))
print('P10: ' + str(round(np.percentile(mean,10),3)) + ', P50: ' + str(round(np.percentile(mean,50),3)) + ', P90: ' + str(round(np.percentile(mean,90),3)))
print('\nSummary Statistics for Bootstrap for Uncertainty in the Standard Deviation:')
print(stats.describe(stdev))
print('P10: ' + str(round(np.percentile(stdev,10),3)) + ', P50: ' + str(round(np.percentile(stdev,50),3)) + ', P90: ' + str(round(np.percentile(stdev,90),3)))
Summary Statistics for Bootstrap for Uncertainty in the Mean: DescribeResult(nobs=1000, minmax=(47.80303675385087, 53.748592754849234), mean=50.29150553261752, variance=0.863625322270442, skewness=0.22636075566365849, kurtosis=0.1821324523811434) P10: 49.096, P50: 50.247, P90: 51.5 Summary Statistics for Bootstrap for Uncertainty in the Standard Deviation: DescribeResult(nobs=1000, minmax=(6.7253875548101005, 11.500296674885535), mean=9.29309137378717, variance=0.5093060880574539, skewness=-0.042851374811427104, kurtosis=0.005247675496348414) P10: 8.384, P50: 9.297, P90: 10.189
Let's change the number of data drawn to observe the change in uncertainty
we will assume the same dataset and not recalculate it each time
we will just sample with replacement with the new number of samples for each bootstrap realization
# parameters for the synthetic dataset
bins = np.linspace(0,1000,1000)
l = widgets.Text(value=' Boostrap Demonstration with Modified Number of Data, Michael Pyrcz, Associate Professor, The University of Texas at Austin',layout=Layout(width='950px', height='30px'))
n = widgets.IntSlider(min = 2, max = 1000, value = 100, description = 'New Number Samples',orientation='horizontal',layout=Layout(width='800px', height='20px'),continuous_update=False)
n.style.handle_color = 'gray'
ui3 = widgets.VBox([l,n],)
def f_rerun(n):
L = 1000 # set the number of realizations
mean2 = np.zeros(L); stdev2 = np.zeros(L) # declare arrays to hold the realizations of the statistics
for l in range(0, L): # loop over realizations
samples = random.choices(df['DataSet'].values, weights=None, cum_weights=None, k=n)
mean2[l] = np.average(samples)
stdev2[l] = np.std(samples)
plt.subplot(121)
plt.hist(mean,alpha=0.5,color="red",edgecolor="black",bins=np.linspace(0,100,40),label='Original'); plt.xlim(0,100); plt.ylabel('Frequency'); plt.xlabel('Values'); plt.title('Bootstrap Uncertainty in the Mean')
plt.hist(mean2,alpha=0.5,color="blue",edgecolor="black",bins=np.linspace(0,100,40),label='New'); plt.xlim(0,100); plt.ylabel('Frequency'); plt.xlabel('Values'); plt.title('Bootstrap Uncertainty in the Mean')
plt.legend()
plt.subplot(122)
plt.hist(stdev,alpha=0.5,color="red",edgecolor="black",bins=np.linspace(0,40,40),label='Original'); plt.xlim(0,40); plt.ylabel('Frequency'); plt.xlabel('Values'); plt.title('Bootstrap Uncertainty in the Standard Deviation')
plt.hist(stdev2,alpha=0.5,color="blue",edgecolor="black",bins=np.linspace(0,40,40),label='New'); plt.xlim(0,40); plt.ylabel('Frequency'); plt.xlabel('Values'); plt.title('Bootstrap Uncertainty in the Standard Deviation')
plt.legend()
plt.subplots_adjust(left=0.0, bottom=0.0, right=2.0, top=1.2, wspace=0.2, hspace=0.2)
plt.show()
interactive_plot3 = widgets.interactive_output(f_rerun, {'n': n})
interactive_plot3.clear_output(wait = True) # reduce flickering by delaying plot updating
We display the GUI now. Select the desired number of data
display(ui3, interactive_plot3) # display the interactive plot
VBox(children=(Text(value=' Boostrap Demonstration with Modified Number of Data, Mic…
Output(outputs=({'output_type': 'display_data', 'data': {'text/plain': '<Figure size 640x480 with 2 Axes>', 'i…
Some observations form changing the number of data:
with less data the uncertainty increases
with more data the uncertianty decreases
there is a bias low in standard deviation with too few samples, as we fail to observe the dispersion well
This was a simple demonstration of interactive plots in Jupyter Notebook Python with the ipywidgets and matplotlib packages.
I have many other demonstrations on data analytics and machine learning, e.g. on the basics of working with DataFrames, ndarrays, univariate statistics, plotting data, declustering, data transformations, trend modeling and many other workflows available at https://github.com/GeostatsGuy/PythonNumericalDemos and https://github.com/GeostatsGuy/GeostatsPy.
I hope this was helpful,
Michael
Novel Data Analytics, Geostatistics and Machine Learning Subsurface Solutions
With over 17 years of experience in subsurface consulting, research and development, Michael has returned to academia driven by his passion for teaching and enthusiasm for enhancing engineers' and geoscientists' impact in subsurface resource development.
For more about Michael check out these links:
I hope this content is helpful to those that want to learn more about subsurface modeling, data analytics and machine learning. Students and working professionals are welcome to participate.
Want to invite me to visit your company for training, mentoring, project review, workflow design and / or consulting? I'd be happy to drop by and work with you!
Interested in partnering, supporting my graduate student research or my Subsurface Data Analytics and Machine Learning consortium (co-PIs including Profs. Foster, Torres-Verdin and van Oort)? My research combines data analytics, stochastic modeling and machine learning theory with practice to develop novel methods and workflows to add value. We are solving challenging subsurface problems!
I can be reached at mpyrcz@austin.utexas.edu.
I'm always happy to discuss,
Michael
Michael Pyrcz, Ph.D., P.Eng. Associate Professor The Hildebrand Department of Petroleum and Geosystems Engineering, Bureau of Economic Geology, The Jackson School of Geosciences, The University of Texas at Austin