#!/usr/bin/env python # coding: utf-8 # # 11 | INTRODUCTION TO DISTRIBUTIONS AND THEIR USE #
This chapter of an Introduction to Health Data Science by Dr JH Klopper is licensed under Attribution-NonCommercial-NoDerivatives 4.0 International
# ## Introduction # This chapter introduces the topics of discrete and continuous distributions. Random variable values in a population and in a sample taken from a population follow _patterns_ with respect to how often single values (in the case of discrete variables) or an interval of values (in the case of continuous variables) occur. These _patterns_ are called distributions. The distributions of random variables are important because they are used to make inferences about the population from which a sample is taken. # Even statistics calculated from sample values follow, or are describe by, distributions. This chapter introduces these concepts and discusses the foundations of distributions and how they might be used in inferential statistics. # ## Packages used in this chapter # The following packages are used in this notebook. Note that no namespace abbreviations are used. This is a stylistic and grammatical choice. # In[2]: import numpy from plotly import graph_objects, figure_factory import pandas from scipy import stats # ## Variables # The first chapter introduced the concept of levels of measurement. A review of these levels is given below. # # - Categorical data type # - __Nominal__: This is a type of data that is used to label variables without providing any quantitative value. It's the simplest form of data that isn’t measured but categorized. Nominal data can include names, labels, or categories, and cannot be arranged in any meaningful order. For example, the named of diseases (hypertension, diabetes, chronic obstructive airway disease). # - __Ordinal__: This is a type of data that includes a set order or scale of values, allowing them to be sorted or ranked. However, the exact differences between the ranks may not be known or consistent. Ordinal data combines the characteristics of both nominal and interval data. An example of ordinal data is a customer satisfaction survey that asks customers to rate a service on a scale of _very unsatisfied_, _unsatisfied_, _neutral_, _satisfied_, or _very satisfied_. The data is categorical as with nominal data, but there is a clear order to the categories. # - Numerical data type # - __Interval__: This is a type of numerical data in which the difference between two values is meaningful and quantifiable, but there is no true zero point. This means that values can add and subtract with interval data, but multiplication and division do not have a meaningful interpretation. Examples of interval data include temperature measured in Celsius or Fahrenheit (because $0$ degrees does not imply _no temperature_), or the years on a calendar (because the year $0$ does not imply _no time_). # - __Ratio-type__: This is the highest level of data measurement. With ratio-type data, values can add, subtract, multiply, and divide, and statements about ratios are meaningful. Examples of ratio data include height, weight, age, or distance, where zero indicates the absence of the attribute being measured. For example, a weight of $0$ punds means there is no weight, and a distance of $0$ miles means there is no distance. # Chapter 5 introduced the concept of variables. These are the names given to a specific instance of data that is measured or collected. These names are used in the first row of a spreadsheet file that is in long-format (tidy data). The data captured for each observations is the same level of measurement, but the values may be different. For example, the variable _age_ may be used to capture the age of a patient. The variable _age_ is the same for each patient, but the value of the variable is different for each patient. # A __random variable__ is a function that maps the outcomes of random processes to numerical values. In this sense, the age of a randomly selected person is a random variable. The age of a randomly selected person is a random variable because the age of the person is unknown until the person is selected. The age of the person is a random variable because the age is a numerical value that is mapped to the outcome of a random process (selecting a person at random). # There are two types of random variables. # # - __Discrete random variable__: A discrete random variable is a random variable that can take on a countable number of values. For example, the number of patients in a hospital is a discrete random variable because the number of patients can only be a positive integer. The number of patients in a hospital cannot be $1.5$ or $-2$. # - __Continuous random variable__: A continuous random variable is a random variable that can take on an uncountable number of values. For example, the weight of a patient is a continuous random variable because the weight of a patient can be any real number. The weight of a patient can be $203.2$ or $123.5$ pounds. # Discrete random variables are used to model count data. Continuous random variables are used to model measurement data. In the case of a nominal variable, the variable is discrete because the number of categories is countable. In the case of an ordinal variable, the variable is discrete because the number of categories is countable. In the case of an interval variable, the variable is continuous because the number of values is uncountable. In the case of a ratio-type variable, the variable is continuous because the number of values is uncountable. # The values of a random variable are called __realizations__. For example, the age of a randomly selected person is a random variable. The age of the person is a realization of the random variable. The number of patients in a hospital is a random variable. The number of patients with diabetes in an surgical ward is a realization of the random variable. Note that in this case of the number of patients with diabetes, the variable that might be used is a spreadsheet is _Diabetes_ and the values that are captured in the data might be _Yes_ or _No_. The number of patients with diabetes is a random variable, though. # Statistical analysis is used in terms of random variables. A random process occurs and a random variables maps the outcome of this process to a number. There are two important defintions in relation to random processes. # # - __Random experiment__: A random experiment is a process or procedure that leads to one of several possible outcomes. The outcome is not known in advance and will vary when the experiment is repeated under the same conditions. Examples of random experiments include flipping a coin, rolling a die, or drawing a card from a deck. # - __Trial__: A trial refers to each occasion when a random experiment is carried out. For example, if you flip a coin 10 times, each individual coin flip is considered a trial. The result of each trial is an outcome. In this case, the outcome of each trial could be a head or a tail. # The __sample space__ of a random experiment is the set of all possible outcomes of that experiment. It is usually denoted by the letter $S$ or the Greek letter $\Sigma$. For example, if a coin is flipped, the sample space is the set of values $\Omega= \{\text{Heads}, \text{Tails}\}$. If a six-sided die is rolled, the sample space is $S=\{1, 2, 3, 4, 5, 6\}$. For a more complex experiment, such as the flipping of two coins, the sample space would be $\Omega=\{ (\text{Heads}, \text{Heads}), (\text{Heads}, \text{Tails}), (\text{Tails}, \text{Heads}), (\text{Tails}, \text{Tails})\}$. Each individual outcome in the sample space is called a sample point. The sample space should include all outcomes that are possible to occur. # ## Distributions # ### Discrete random variables # Consider a trial that roles two fair six-sided dice and add the two values that land face up. The fact that the dice are fair means that each of the six faces has an equal likelihood of landing face up. The sample space of each trial is $\Omega = \{ 2,3,4,5,6,7,8,9,10,11,12\}$. A random experiment might be to roll the dice $1000$ times (the trial is repeated $1000$ times). This can be simulated with Python code. # The code below seeds the numpy pseudo-random number generator with the integer $28$. The pseudo-random values that are generated produces repeatable results. For every integer used, the same values will be returned. An empty Python list is created and then a for loop is used to create to random values on the interval $[1,6]$ that are added and appended to the list. # In[3]: # Seed the random number generator with a known value numpy.random.seed(28) # Create a empty list named "results" to store the results results = [] # Create a for loop to roll the dice 1000 times for _ in range(1000): results.append(numpy.random.randint(1, 7) + numpy.random.randint(1, 7)) # The commented code below shows a more compact way to generate the same list of values. The code uses a list comprehension to generate the list of values. # ``` # # Seed the random number generator with a known value # numpy.random.seed(0) # # # Select two ramdom integers between 1 and 6 and add them # def roll_dice(): # return numpy.random.randint(1, 7) + numpy.random.randint(1, 7) # # # Roll the dice 1000 times and store the results # rolls = [roll_dice() for _ in range(1000)] # ``` # A dataframe object is created from the results. # In[4]: # Convert the numpy array to a pandas DataFrame results = pandas.DataFrame(results, columns=['Value']) # A second dataframe object is created to hold the results of the `value_counts` method applied to the first dataframe. The `value_counts` method returns a series object that contains the number of times each value occurs in the dataframe. The `sort_index` method is used to sort the series by the index values. # In[5]: # Create a DataFrame object from the value counts and sort the index value_counts = pandas.DataFrame(results['Value'].value_counts()).sort_index() # Print the series object to the screen print(value_counts) # A bar plot shows the number of times each value occurs in the dataframe. The bar plot shows that the value $7$ occurs most often and the values $2$ and $12$ occur least often. # In[6]: # Create a bar chart of the value counts fig = graph_objects.Figure( graph_objects.Bar( x=value_counts.index.to_list(), y=value_counts.values[:, 0], text=value_counts.values[:, 0], textposition='auto' ) ) # Update the layout to improve readability fig.update_layout( title='Value Counts for 1000 Dice Rolls', xaxis_title='Dice Value', yaxis_title='Count', width=700, ) # Display the plot fig.show() # In the case of this experiment the random variable, $X$, is the sum of the two values that land face up. The random variable is discrete because the number of possible values is countable (an integer). It can also be stated that the random variable is discrete because the values are finite and that the random variable is discrete because the values are not continuous. # The random variable $X$ can take on the values $2,3,4,5,6,7,8,9,10,11,12$. The probability that the random variable takes on a specific value is called the __probability mass function__ (PMF). The PMF is denoted by $f(x)$ and is defined as $P \left( X=x \right)$, where $x$ can take any value $2,3,4,5,6,7,8,9,10,11,12$. # In this simple case of a discrete random variable, the PMF can be calculated by dividing the number of times a value occurs by the total number of trials. For example, the value $7$ occurs $165$ times in the dataframe. The total number of trials is $1000$. The PMF for the value $7$ is calculated below. # In[7]: # Experimental probability mass function of the two-dice sum distribution value_counts # In[8]: # Calculate the probability of rolling a 7 in this experiment print('Probability of rolling a 7:', value_counts.iloc[5].values.tolist()[0] / 1000) # The bar plot below visualizes the probability mass function. # In[9]: # Create a bar chart of the value counts fig = graph_objects.Figure( graph_objects.Bar( x=value_counts.index.to_list(), y=value_counts.values[:, 0] / 1000 ) ) # Update the layout to improve readability fig.update_layout( title='Visualizing the Probabilities', xaxis_title='Dice Value', yaxis_title='Probability (proportion)', width=700 ) # Display the plot fig.show() # Note that the probabilities are simply proportions. Each probability is on the interval $[0,1]$ and all the probabilities sum to $1$. An experiemnt such as this takes a sample ($1000$ trials here). The probabilities are estimates of a theoretical probability mass function. The estimates PMF (using the data from this experiment) is shown below in (1). The hat, $\hat{}$, on the $P$ denotes that the probabilities are estimates. # $$ # \hat{P}(X=x) = \begin{cases} 0.032 &\text{ if } x = 2 \\ 0.051 &\text{ if } x = 3 \\ 0.074 &\text{ if } x = 4 \\ 0.13 &\text{ if } x = 5 \\ 0.126 &\text{ if } x = 6 \\ 0.165 &\text{ if } x = 7 \\ 0.0146 &\text{ if } x = 8 \\ 0.106 &\text{ if } x = 9 \\ 0.09 &\text{ if } x = 10 \\ 0.049 &\text{ if } x = 11 \\ 0.031 &\text{ if } x = 12 \end{cases} \tag{1} # $$ # There is only a single way to roll a $2$ and that is a $1$ and a $1$. There are two ways to roll as $3$ and that is a $1$ and a $2$ or a $2$ and a $1$. The sum of $7$ can be rolled in six ways. In total there are $36$ possibilities. The theoretical PMF is given below in (2). # $$ # P(X=x) = \begin{cases} \frac{1}{36} &\text{ if } x = 2 \\ \frac{2}{36} &\text{ if } x = 3 \\ \frac{3}{36} &\text{ if } x = 4 \\ \frac{4}{36} &\text{ if } x = 5 \\ \frac{5}{36} &\text{ if } x = 6 \\ \frac{6}{36} &\text{ if } x = 7 \\ \frac{5}{36} &\text{ if } x = 8 \\ \frac{4}{36} &\text{ if } x = 9 \\ \frac{3}{36} &\text{ if } x = 10 \\ \frac{2}{36} &\text{ if } x = 11 \\ \frac{1}{36} &\text{ if } x = 12 \end{cases} \tag{2} # $$ # The theoretical PMF is visualized below. # In[10]: # Create a bar chart of the value counts fig = graph_objects.Figure( graph_objects.Bar( x=numpy.arange(2,13), y=numpy.array([1,2,3,4,5,6,5,4,3,2,1]) / 36 ) ) # Update the layout to improve readability fig.update_layout( title='Visualizing the PMF', xaxis_title='Dice Value', yaxis_title='Probability (proportion)', width=700 ) # Display the plot fig.show() # Using the theoretical PMF it is possible to ask questions such as: _What is the probability of rolling a value of_ $10$ _or more?_ Since the outcome of each trial is independent, the individual probabilities can be added together. The probability of rolling a value of $10$ or more is $P(X=10) + P(X=11) + P(X=12)$. The probability sums to $0.167$ and is visualized below. # In[11]: # Generate bar colors colors = ['lightslategray',] * 11 colors[8] = 'crimson' colors[9] = 'crimson' colors[10] = 'crimson' # Create a bar chart of the value counts fig = graph_objects.Figure( graph_objects.Bar( x=numpy.arange(2,13), y=numpy.array([1,2,3,4,5,6,5,4,3,2,1]) / 36, marker_color=colors ) ) # Update the layout to improve readability fig.update_layout( title='Probability of rolling a 10 or more (red)', xaxis_title='Dice Value', yaxis_title='Probability (proportion)', width=700 ) # Display the plot fig.show() # ### Continuous random variables # Continuous variables can take on an infinite number of values. This is because these values are real numbers. The probability that a continuous random variable takes on a specific value is zero. For example, the probability that a patient's weight is exactly $150$ pounds is zero. The probability that a patient's weight is between $150$ and $151$ pounds is not zero. It is only possible to consider the probability that a continuous random variable is within a range of values. # Instead of a probability mass function, a continuous random variable has a __probability density function__ (PDF). The PDF is denoted by $f(x)$ and is defined as $P \left( a \leq X \leq b \right)$, where $a$ and $b$ are real numbers. The PDF is the area under the curve of the PDF between $a$ and $b$. The probability that a patient's weight is between $150$ and $151$ pounds is the area under the probability density function (PDF) between $150$ and $151$ pounds. # The normal distribution is perhaps the best know distribution for a continuous random variable. The normal distribution is defined by two parameters, the mean, $\mu$, and the standard deviation, $\sigma$. The mean is the center of the distribution and the standard deviation is a measure of the spread of the distribution. The normal distribution is denoted by $N(\mu, \sigma)$. # The code cell below creates a numpy array of $10000$ values, taken from a normal distribution, $N(\mu=100, \sigma=10)$ and plots the probability density function. # In[12]: # Set the numpy pseudo-random number generator seed numpy.random.seed(28) # Create an array of 10000 values from a normal distribution with a mean of 100 and a standard deviation of 10 assigned to the variables values values = numpy.random.normal(100, 10, 10000) # Create a plotly create_distplot figure of the data in values fig = figure_factory.create_distplot( [values], ['Values'], show_hist=False, histnorm='probability desnity', curve_type='normal', show_rug=False ) fig.update_layout( title='Probability density function for normal distribution', xaxis_title='Variable value', yaxis_title='Probability density', width=700 ) fig.show() # More values occur in the array around the mean of $100$ than at the tails of the distribution. The left and right tails are the _zones_ further away to the left and right of the mean. # For the sake of interest, the equation for the normal distribution is shown in (3). # $$ # f(x) = \frac{1}{\sigma \sqrt{2 \pi}} e^{- \frac{1}{2} \left( \frac{x - \mu}{\sigma} \right)^{2}}\tag{3} # $$ # From the plot it should be clear that $50\%$ of the values are below $100$. The normal distribution is symmetrical. The total area under the curve is $1$. The idea of area is the same as the idea of an area of a circle or the area of a square. Because proportion is the same as probability, the area under the curve is the same as the probability that a value is within a range of values. # The __standard normal distribution__ has a mean of $0$ and a standard deviation of $1$, written as $N(0,1)$. Before the general availability of computers, probabilities (area under teh curve for a given interval) was calculated using tables. These were created for the standard normal distribution. There is an equation for converting from a normal distribution with a mean of $\mu$ and a standard deviation of $\sigma$ to a standard normal distribution. This is shown in (4), where $Z$ is the symbol for the converted value and $x$ is the value that the random variable $X$ takes. # $$ # Z = \frac{x - \mu}{\sigma}\tag{4} # $$ # This allows for answering an example question such as the following. _Given that a random variable follows normal distribution with a mean of $100$ and a standard deviation of $10$, what is the probability that the value is less than $90$?_ Since the probability is a proportion of the total area under the curve, the questions really asks what proportion of the area under the curve (the plot above) is to the left of $90$. The answer is $0.1587$. # The value of $90$ can be converted using (4), shown in (5). # $$ # Z = \frac{90-100}{10} = -1\tag{5} # $$ # Instead of asking the probability of a value being less than $90$, written as $P(X<90)$ (and given that $X$ follows a normal distribution with a mean of $100$ and a standard deviation of $10$), the question can be asked as the probability of a value being less than $-1$, written as $P(Z<-1)$ (and given that $Z$ is described by the standard normal distribution). # The `cdf` function from the stats module in the scipy package can perform the calculation. The `cdf` function takes a value and returns the probability that a value is less than or equal to that value. The value of $-1$ is passed to the `cdf` function and the result is $0.1587$. # In[13]: # Calculate the probability of a value less than -1 for the standard normal distribution print('Probability of a value less than -1:', stats.norm.cdf(-1)) # The probability that a random value taken from the distribution (plot above) is at most $90$, is $0.159$ or about $15.9\%$. This is the same as saying that the area under the curve to the left of $90$, is $0.159$ or about $15.9\%$. # Probability distributions can be used to consider population paramaters such as $\mu$ and $\sigma$ by selecting a sample from a population. The mean and standard deviation for a continuous variable, calculated from a sample of individuals taken from a population, is written as $\bar{X}$ and $s$. Here $\bar{X}$ and $s$ are estimates of the population mean and standard deviation, $\mu$ and $\sigma$. # ## Inference # __Inference__ is the use of calculations, that is, statistics, based on a sample taken from a population to learn about population parameters. In the frequenctist approach to statistics, there are two type of inference. These are __interval estimation__ (uncertainty or confidence intervals) and __hypothesis testing__. # ### Uncertainty # To understand the concept of uncertainty, a variable is created for a simulated population, with a population size of $1000$ individuals (considering perhaps a rare disease). The variable is created from continuous uniform distribution (each value is as likeliky as any other value on an interval). The code below generates the values for the variable in the population using the `rvs` function. the values are on the interval $[50,100]$ # In[14]: # Seed the numpy pseudo-random number generator numpy.random.seed(28) # Create an array of 1000 values using a continuous uniform distribution between 50 and 100 and assign it to the variable population population = numpy.random.uniform(50, 100, 1000) # Create a plotly create_distplot figure of the data in population fig = figure_factory.create_distplot( [numpy.random.choice(population, 1000)], ['Population'], show_hist=True, show_rug=False, show_curve=False ) fig.update_layout( title='Histogram of population values', xaxis_title='Variable value', yaxis_title='Frequency', width=700 ) fig.show() # Even though the random variable, $X$, does not follow a normal distribution, the population mean and standard deviation can still be caluclated. # In[15]: # Calculate the population mean of the population and assign it to the variable mu mu = population.mean() mu # In[16]: # Calculate the population standard deviation of the population and assign it to the variable sigma sigma = population.std() sigma # The mean (green line) and standard deviation (red dashed lines) are added to the plot below. # In[17]: # Add a horizontal line to the plot at the mean and dashed lines for the standard deviation fig.add_vline(x=mu, line_width=2, line_color='green') fig.add_vline(x=mu - sigma, line_width=2, line_dash='dash', line_color='red') fig.add_vline(x=mu + sigma, line_width=2, line_dash='dash', line_color='red') fig.show() # A random sample can be taken from this popuation. The sample mean can be calculated and serves as an estimate of the population mean. A random sample of $30$ individuals are taken from the population. # In[18]: # Seed the pseudo-random number generator numpy.random.seed(28) # Select a random sample of 30 values from the population and assign it to the variable sample sample = numpy.random.choice(population, 30) # Calculate the sample mean and assign it to the variable x_bar x_bar = sample.mean() x_bar # Whereas the population mean is $\mu = 74.87$, the sample mean is $\bar{X} = 75.39$. The sample mean is an estimate of the population mean. The sample mean is not the same as the population mean. The sample mean is a random variable. If another sample of $30$ individuals is taken from the population, the sample mean will be different. The sample mean is a random variable because it is the result of a random process (taking a sample from a population). # Consider then repeating this experiment of taking a random sample of $30$ individuals from the population over and over again. The sample mean will be different each time. The distribution of the sample means is called the __sampling distribution of the sample mean__. The sampling distribution of the sample mean is a theoretical distribution. The sampling distribution of the sample mean is a theoretical distribution because it is not possible to take an infinite number of samples from a population, that is to say, the sampling distribution of the sample mean is a theoretical distribution because it is not possible to take an infinite number of samples from a population. # The code below, simulates taking a random set of $30$ individuals, calculating the mean, and repeating this random experiment $1000$ times. # In[19]: # Use list comprehension to calculate the sample mean of 1000 samples of size 30 and assign it to the variable x_bar_vals x_bar_vals = [numpy.random.choice(population, 30).mean() for _ in range(1000)] # In[50]: # Sort the values in x_bar_vals and assign it to the variable x_bar_vals_sorted x_bar_vals_sorted = numpy.sort(x_bar_vals) # In[52]: x_bar_vals_sorted[24] # In[53]: x_bar_vals_sorted[974] # A histogram of the $1000$ sample means is shown below. # In[20]: # Create a histogram of the sample means fig = graph_objects.Figure( graph_objects.Histogram( x=x_bar_vals, histnorm='probability' ) ) fig.update_layout( title='Histogram of sample means', xaxis_title='Sample means', yaxis_title='Probability', width=700 ) fig.show() # Although the variable does not follow a normal distribution in the population, by repeatedly selecting at least $30$ individuals from the population, the sampling distribution of the sample mean is approximately normal. This is called the __central limit theorem__. # Since it is not possible to repeat a study many times over, there is uncertainty of how well the sample mean (of a single experiment) estimates the population mean. The code below simulates a single study, taking a random $30$ individuals from the initial population and calculates the sample mean. # In[21]: # Seed the pseudo-random number generator numpy.random.seed(28) # Selected a random sample of 30 values from the population and assign it to the variable sample sample = numpy.random.choice(population, 30) # Calculate the sample mean and assign it to the variable x_bar x_bar = sample.mean() x_bar # By the method of __bootstrapping__ a repeated sample of $30$ values can be taken from the sample of $30$ values. This is done by sampling with replacement (which means after each selection of an individual random value, it is returned to the pool for possible random reselection). The code below takes a sample of $30$ values from the sample of $30$ values and calculates the sample mean. This is repeated $1000$ times. # In[22]: # Seed the pseudo-random number generator numpy.random.seed(28) # Randomly selected 30 values from sample with replacement 1000 times and calculate the mean of each sample and assign it to the variable x_bar_bootstrap_vals x_bar_bootstrap_vals = [numpy.random.choice(sample, 30, replace=True).mean() for _ in range(1000)] # A histogram of the bootstrapped sampling distribution is shown below. # In[23]: # Create a histogram of the bootstrap sample means fig = graph_objects.Figure( graph_objects.Histogram( x=x_bar_bootstrap_vals, histnorm='probability' ) ) fig.update_layout( title='Histogram of bootstrap sample means', xaxis_title='Bootstrap sample means', yaxis_title='Probability', width=700 ) fig.show() # The mean of the bootstrapped sample mean is calculated below. # In[24]: # Calculate the mean of the bootstrap sample means and assign it to the variable x_bar_bootstrap x_bar_bootstrap = numpy.mean(x_bar_bootstrap_vals) x_bar_bootstrap # Since there is uncertainty in the sample mean, this uncertainty must be expressed. The code below considers what bootstrapped value that represents a cutoff for the lowest $2.5\%$ of all the values. There are $1000$ values in total, therefor the $2.5\%$ cutoff is the $25^{\text{th}}$ value in the sorted list of values. # In[25]: # Sort the values in x_bar_bootstrap_vals and show the 25th value x_bar_bootstrap_vals.sort() # In[26]: # Dispaly the 25th value x_bar_bootstrap_vals[24] # The cutoff value for the upper $2.5\%$ of bootstrapped values is calculated below. # In[27]: # Display the 975th value x_bar_bootstrap_vals[974] # With cutoff values for the bottom and the top $2.5\%$ of bootstrapped means, the middle $95\%$ remains. This is reffered to as a __confidence level__ and the bounds are referred to as the __confidence interval__. With the single experiment mean of $\bar{X}=75.39$, a confidence interval can now be stated as $69.91$ to $80.37$. # The theoretical sampling distribution of the sample mean can also be used to express this uncertainty in the single sample mean that stands as estimate for the population mean. The theoretical sampling distribution of the sample mean is a normal distribution (given the assumptions for the use of the Central Limit Theorem, which are that the sample size is at least $30$ or that the random variable is indeed normally distributed in the population). As with any distribution, it has a standard deviation. The standard deviation of the sampling distribution of the sampling mean is scaled by the sample size, shown in (6). The standard deviation of a sampling distribution is termed the __standard error__ (SE). # $$ # \text{SE} = \frac{s}{\sqrt{n}}\tag{6} # $$ # In a biostatistics course the SE is shown to be used in the calculation of the confidence interval for a given confidence level. The calculation is shown in (7), where $c$ is the confidence coefficient. # $$ # \bar{X} \pm c \times \text{SE}\tag{7} # $$ # For a sample (given that the population standard deviation is not known), the mean in this example would be described by a _t_ distribution with $n-1$ degrees of freedom. The confidence coefficient is calculated below using the `ppf` function. # In[28]: # Calculate the confidence coeffient for a confidence level of 95% using the percentile method for the t distribution given a sample size of 30 and assign and assign it to the variable c c = stats.t.ppf(0.975, 29) c # The standard deviation for the sample is calculated below. # In[29]: # Calculate the standard deviation of the sample below and assign it to the variable sample_std sample_std = sample.std(ddof=1) sample_std # From this, the SE is calculated. # In[30]: # Calculate the standard error of the sample mean and assign it to the variable standard_error standard_error = sample_std / numpy.sqrt(30) standard_error # The confidence interval is calculated below. # In[31]: # Calculate the lower bound of the confidence interval and assign it to the variable lower_bound lower_bound = x_bar - c * standard_error lower_bound # In[32]: # Caluculate the upper bound of the confidence interval and assign it to the variable upper_bound upper_bound = x_bar + c * standard_error upper_bound # Notice how close these theoretical values are to the bootstrapped values. # Understanding distributions, the proportions of areas under the probability density curve, and uncertainty in how well a sample statistic estimates a population parameter all form the basis of confidence intervals, making inferences about a population from a sample. With an undersdtanding of confidence intervals developed above, the rest of this chapter examines hypothesis testing. # ### Hypothesis testing # Another set of example data is used in this section to develop an intuition for hypothesis testing. Below, values for two identical populations of $3000$ individuals are generated from a continuous uniform distribution on the interval $[50,100]$. The two sets of values are assigned to the variable `townA` and `townB`, to simulate individuals from two towns. # In[33]: # Seed the pseudo-random number generator numpy.random.seed(28) # Create two numpy arrays, selecting from a continuous uniform distribution with a range of 50 to 100, both with 3000 values and assign them to the variables townA and townB townA = numpy.random.uniform(50, 100, 3000) townB = numpy.random.uniform(50, 100, 3000) # The difference in the mean of the simulated variable is calculated for the two towns below. Note that since there is no natural order for the two towns, two differences are possible. # In[34]: # Calculate the difference in means between the two towns (town A - town B) diffAB = townA.mean() - townB.mean() diffAB # In[35]: # Calculate the difference in means between the two towns (town B - town A) diffBA = townB.mean() - townA.mean() diffBA # The two value represent the true difference in mean for this simulated continuous numerical variable for the two towns. In _real-life_, researchers will not ne aware of this true difference. Instead, they can take a random sample of individuals from each town and calculate the sample mean for each town. The difference in the sample means is an estimate of the true difference in means. The code below takes a random sample of $30$ individuals from each town and calculates the sample mean for each town. # In[36]: # Seed the pseudo-random number generator numpy.random.seed(28) # Create a numpy array of 30 random values from townA and assign it to the variable sampleA sampleA = numpy.random.choice(townA, 30) # Create a numpy array of 30 random values from townB and assign it to the variable sampleB sampleB = numpy.random.choice(townB, 30) # The two sample means and their differences are calculated below. # In[37]: # Calculate the difference in sample means between the two samples (sampleA - sampleB) and assign the difference to the variable sample_diffAB sample_diffAB = sampleA.mean() - sampleB.mean() sample_diffAB # In[38]: # Calculate the difference in sample means between the two samples (sampleB - sampleA) and assign the difference to the variable sample_diffBA sample_diffBA = sampleB.mean() - sampleA.mean() sample_diffBA # These sample difference estimates are quite different from the actual differences. How can researchers investigate this difference? The answers are _interval estimation_ (confidence intervals) and _hypothesis testing_. # As a review of hypothesis testing, the code below uses bootstrap resampling from the sample data. This is repeated $1000$ times. The difference in the sample means is calculated for each bootstrap sample. The differences are ordered in ascending order and the $25^{\text{th}}$ and $975^{\text{th}}$ values are selected. These values represent the $2.5\%$ cutoffs for the lower and upper $2.5\%$ of the distribution of the differences in the sample means. The middle $95\%$ of the differences in the sample means is the confidence interval for the difference in the sample means. # In[39]: # Seed the pseudo-random number generator numpy.random.seed(28) # Generate 10000 differences in means using resampling with replacement and list comprehension and assign it to the variable bootstrap_diffAB bootstrap_diffAB = [numpy.random.choice(sampleA, 30, replace=True).mean() - numpy.random.choice(sampleB, 30, replace=True).mean() for _ in range(1000)] # Sort the values in bootstrap_diffAB bootstrap_diffAB.sort() # Display the 25th value bootstrap_diffAB[24] # In[40]: # Dispaly the 975th value bootstrap_diffAB[974] # Using bootsrap resampling, the confidence interval for the difference in the sample means (with the order, town A minus town B) is between $-1.77$ and $9.04$. Note that the true difference is in this interval. Note also, that $0$ (indicating no difference) is in this interval. # In hypothesis testing, researchers consider if there is a difference between the mean for the variable between the two towns. This is a __research question__. # In hypothesis testing there are two views of the research question. The two views are termed __hypotheses__, the __null hypothesis__ and the __alternative hypothesis__. # The null hypothesis takes on a conservative approach. In the case of the example above, the null hypothesis is that there is no difference between the mean for the variable between the two towns. The alternative hypothesis is that there is a difference between the mean for the variable between the two towns. The null, $H_{0}$, and alternative, $H_{\alpha}$, hypotheses are written in (8), where $\mu_{1}$ is the mean of the variable in town A and $\mu_{2}$ os the mean of the variable in twon B. Note that these refer to the population parameters and not the sample statistics. # $$ # \begin{align*} # &H_{0}: \, \mu_{1} - \mu_{2} = 0 \\ \\ # &H_{\alpha}: \, \mu_{1} - \mu_{2} \neq 0 # \end{align*} # \tag{8} # $$ # Until data values are collected and analyzed from random samples, the null hypothesis (about the distribution of the data in the populations from which the sampe was taken) stands. To _choose between_ the two hypotheses, a statistic is required, termed the __test statistic__. In the example above, it is the difference in means. # The null hypothesis could then be stated as :_There is no difference in the means of the variable between the two towns_. The alternative hypothesis would then be: _There is a difference in the mean between the two towns_. This is referred to as a __two-tailed alternative hypothesis__. # If the null hypothesis is true, then it does not matter if a value is from town A or town B. This can be simulated by combining the two sets of sample values into a single set of values. The code below combines the two sets of values into a single set of values. # In[41]: # Merge the two samples into a single numpy array and assign it to the variable samples samples = numpy.concatenate((sampleA, sampleB)) # The `shuffle` function randomly reorders the order of values in the combined array. # In[42]: # Seed the pseudo-random number generator numpy.random.seed(28) # Randomly reorder the 60 values in the samples array numpy.random.shuffle(samples) # This reordering can be done repeatedly. The code below shuffles the combined array $1000$ times and calculates the difference in the means between the first $30$ and last $30$ values at each repeat, simulating the null hypothesis. # In[43]: # Seed the pseudo-random number generator numpy.random.seed(28) # Create an empty numpy array to hold the differences in means and assign it to the variable shuffle_diffAB shuffle_diffAB = numpy.empty(1000) # Reshuffle the values in samples 1000 times and calculate the difference in the means of the first 30 values and the last 30 values each time, and assign it to the variable shuffle_diffAB for i in range(1000): numpy.random.shuffle(samples) shuffle_diffAB[i] = samples[:30].mean() - samples[30:].mean() # A histogram of the resampled differences in the means is shown below. # In[44]: # Create a histogram of the shuffle_diffAB values fig = graph_objects.Figure( graph_objects.Histogram( x=shuffle_diffAB, histnorm='probability' ) ) fig.update_layout( title='Histogram of shuffle differences in means', xaxis_title='Shuffle differences in means', yaxis_title='Probability', width=700 ) fig.show() # The proportion of values less than $-1.7692322524601138$ and more than $1.7692322524601138$ is calculated below. # In[45]: # Calculate the proportion of values in shuffle_diffAB that are less than -1.7692322524601138 and more than 1.7692322524601138 (shuffle_diffAB < -1.7692322524601138).sum() / 1000 + (shuffle_diffAB > 1.7692322524601138).sum() / 1000 # This is the probability of finding the difference in means from the two samples or more extreme. Note that the sampling distribution of the difference in the means was created given that the null hypothesis is true. An arbitraray value can be set indicating the willingness of the researchers to falsely reject the null hypothesis. Rejecting the null hypothesis requires accepting the alternative hypothesis. This value, termed the $\alpha$ value, is typically set to $0.05$. For a two-tailed alternative hypothesis, this indicates a value in the lower $2.5\%$ of bootsrapped values or in the upper $2.5\%$ of values. # In this case, the _p_ value was $0.613$. This is larger than the $\alpha$ value, leading to a failure to reject the null hypothesis. # As with intreval estimates, a theoretical distribution of the sampling differences in means can also be used. One such test, known as the equal variance _t_ test (or Student's _t_ test), determines the _p_ value. The `ttest_ind` function from the stats module of the scipy package can calculate the _p_ value. # In[46]: # Use the ttest.ind function to calculate the t statistic and p-value for the two samples stats.ttest_ind(sampleA, sampleB) # The _p_ value is very close to the _p_ value calculated from the bootstrapped sampling distribution of the difference in the means. In both cases, the value is larger than the chosen $\alpha$ value, leading to a failure to reject the null hypothesis. The null hypothesis is that there is no difference in the means of the variable between the two towns. The same conclusion could have been taken from the confidence interval that contained the value $0$, indicating no difference in the means of the variable between the two towns. # # Quiz questions # ### Questions # Consider the array below, representing diastolic blood pressure of a given simulated population of $100000$ people. # In[47]: # Seed the pseudo-random number generator numpy.random.seed(28) # Create an array of random values from a normal distribution with a mean of 85 and a standard deviation of 10 and assign it to the variable dBP dBP = numpy.random.normal(85, 10, 100000) # The population mean is calculated below. # In[48]: # Calculate the population mean dBP.mean() # Use the random seed value of $12$ through to answer the following questions or perform the indicated tasks. # 1. Select a random sample of $40$ individuals from the population and assign the created numpy array to the variable `sample`. # # 2. Calculate the sample mean and assign the value to the variable `sample_mean`. # # 3. Calculate the standard deviation of the sample and assign the value to the variable `sample_std`. # # 4. Calculate the standard error of the sample and assign the value to the variable `standard_error`. # # 5. Calculate the confidence coefficient for a $95\%$ confidence level, given the sample size and degrees of freedom, and assign the value to the variable `c`. # # 6. Calculate the lower bound of the confidence interval for the sample mean and assign the result to the variable `lower_bound`. # # 7. Calculate the upper bound of the confidence interval for the sample mean and assign the result to the variable `upper_bound`. # # 8. Create an array of $1000$ bootstrapped sample means (using the sample size of $40$) and assign the array to the variable `bootstrapped_means` and sort the values in the array in ascending order. # # 9. Show the bootstrapped value representing the $2.5^{\text{th}}$ percentile. # # 10. Show the value bootstrapped representing the $97.5^{\text{th}}$ percentile. # In[ ]: