!pip install --upgrade scipy
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Requirement already satisfied: scipy in /usr/local/lib/python3.7/dist-packages (1.4.1)
Collecting scipy
Downloading scipy-1.7.3-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (38.1 MB)
|████████████████████████████████| 38.1 MB 1.3 MB/s
Requirement already satisfied: numpy<1.23.0,>=1.16.5 in /usr/local/lib/python3.7/dist-packages (from scipy) (1.21.6)
Installing collected packages: scipy
Attempting uninstall: scipy
Found existing installation: scipy 1.4.1
Uninstalling scipy-1.4.1:
Successfully uninstalled scipy-1.4.1
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
albumentations 0.1.12 requires imgaug<0.2.7,>=0.2.5, but you have imgaug 0.2.9 which is incompatible.
Successfully installed scipy-1.7.3
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
import seaborn as sns
import math
from scipy import stats
from scipy.stats import norm
from scipy.stats import chi2
from scipy.stats import t
from scipy.stats import f
from scipy.stats import bernoulli
from scipy.stats import binom
from scipy.stats import nbinom
from scipy.stats import geom
from scipy.stats import poisson
from scipy.stats import uniform
from scipy.stats import randint
from scipy.stats import expon
from scipy.stats import gamma
from scipy.stats import beta
from scipy.stats import weibull_min
from scipy.stats import hypergeom
from scipy.stats import shapiro
from scipy.stats import pearsonr
from scipy.stats import normaltest
from scipy.stats import anderson
from scipy.stats import spearmanr
from scipy.stats import kendalltau
from scipy.stats import chi2_contingency
from scipy.stats import ttest_ind
from scipy.stats import ttest_rel
from scipy.stats import mannwhitneyu
from scipy.stats import wilcoxon
from scipy.stats import kruskal
from scipy.stats import friedmanchisquare
from statsmodels.tsa.stattools import adfuller
from statsmodels.tsa.stattools import kpss
from statsmodels.stats.weightstats import ztest
from scipy.integrate import quad
from IPython.display import display, Latex
import warnings
warnings.filterwarnings('ignore')
warnings.simplefilter(action='ignore', category=FutureWarning)
/usr/local/lib/python3.7/dist-packages/statsmodels/tools/_testing.py:19: FutureWarning: pandas.util.testing is deprecated. Use the functions in the public API at pandas.testing instead. import pandas.util.testing as tm
$H_0$ : The sample has a Normal (Gaussian) distribution
$H_1$ : The sample does not have a Normal (Gaussian) distribution.
Assumptions:
$\\ $
N = 100
alpha = 0.05
np.random.seed(1)
data = np.random.normal(0, 1, N)
Test_statistic, p_value = shapiro(data)
print(f'Test_statistic_shapiro = {Test_statistic}, p_value = {p_value}', '\n')
if p_value < alpha:
print(f'Since p_value < {alpha}, reject null hypothesis. Therefore, The data is probably normal.')
else:
print(f'Since p_value > {alpha}, the null hypothesis cannot be rejected. Therefore, The data is not probably normal.')
Test_statistic_shapiro = 0.9920045137405396, p_value = 0.8215526342391968 Since p_value > 0.05, the null hypothesis cannot be rejected. Therefore, The data is not probably normal.
$H_0$ : The sample has a Normal (Gaussian) distribution
$H_1$ : The sample does not have a Normal (Gaussian) distribution.
Assumptions:
$\\ $
N = 100
alpha = 0.05
np.random.seed(1)
data = np.random.normal(0, 1, N)
Test_statistic, p_value = normaltest(data)
print(f"Test_statistic_D'Agostino's K-squared = {Test_statistic}, p_value = {p_value}", "\n")
if p_value < alpha:
print(f'Since p_value < {alpha}, reject null hypothesis. Therefore, The data is probably normal.')
else:
print(f'Since p_value > {alpha}, the null hypothesis cannot be rejected. Therefore, The data is not probably normal.')
Test_statistic_D'Agostino's K-squared = 0.10202388832581702, p_value = 0.9502673203169621 Since p_value > 0.05, the null hypothesis cannot be rejected. Therefore, The data is not probably normal.
$H_0$ : The sample has a Normal (Gaussian) distribution
$H_1$ : The sample does not have a Normal (Gaussian) distribution.
Assumptions:
$\\ $
Critical values provided are for the following significance levels:
normal/exponential:
$15\%, 10\%, 5\%, 2.5\%, 1\%$
logistic:
$25\%, 10\%, 5\%, 2.5\%, 1\%, 0.5\%$
Gumbel:
$25\%, 10\%, 5\%, 2.5\%, 1\%$
If the test statistic is larger than these critical values then for the corresponding significance level, the null hypothesis that the data come from the chosen distribution can be rejected.
N = 100
np.random.seed(1)
data = np.random.normal(0, 1, N)
Test_statistic, critical_values, significance_level = anderson(data, dist='norm')
print(f'Test_statistic_anderson = {Test_statistic}', '\n')
for i in range(len(critical_values)):
sl, cv = significance_level[i], critical_values[i]
if Test_statistic > cv:
print(f'(Test statistic = {Test_statistic}) > (critical value = {sl}%), therefore for the corresponding significance level, the null hpothesis cannot be rejected.')
else:
print(f'(Test statistic = {Test_statistic}) > (critical value = {sl}%), therefore for the corresponding significance level, the null hpothesis is rejected.')
Test_statistic_anderson = 0.2196508855594459 (Test statistic = 0.2196508855594459) > (critical value = 15.0%), therefore for the corresponding significance level, the null hpothesis is rejected. (Test statistic = 0.2196508855594459) > (critical value = 10.0%), therefore for the corresponding significance level, the null hpothesis is rejected. (Test statistic = 0.2196508855594459) > (critical value = 5.0%), therefore for the corresponding significance level, the null hpothesis is rejected. (Test statistic = 0.2196508855594459) > (critical value = 2.5%), therefore for the corresponding significance level, the null hpothesis is rejected. (Test statistic = 0.2196508855594459) > (critical value = 1.0%), therefore for the corresponding significance level, the null hpothesis is rejected.
Note that you can use Anderson-Darling test for other distributions.
The valid values are: {‘norm’, ‘expon’, ‘logistic’, ‘gumbel’, ‘gumbel_l’, ‘gumbel_r’, ‘extreme1’}
Tests whether two data sample have a linear relationship.
$H_0$: The two data are independent.
$H_1$: There is a dependency between the two data.
Assumptions:
$\\ $
N = 10
alpha = 0.05
np.random.seed(1)
data1 = np.random.normal(0, 1, N)
data2 = np.random.normal(0, 1, N) + 2
Test_statistic, p_value = pearsonr(data1, data2)
print(f"Test_statistic_Pearson's Correlation = {Test_statistic}, p_value = {p_value}", "\n")
if p_value < alpha:
print(f'Since p_value < {alpha}, reject null hypothesis. Therefore, Two data are probably dependent.')
else:
print(f'Since p_value > {alpha}, the null hypothesis cannot be rejected. Therefore, Two data are probably independent.')
Test_statistic_Pearson's Correlation = 0.6556177144470315, p_value = 0.03957633895447448 Since p_value < 0.05, reject null hypothesis. Therefore, Two data are probably dependent.
This test is parametric.
Tests whether two data samples have a monotonic relationship.
$H_0$: The two data are independent.
$H_1$: There is a dependency between the two data.
Assumptions:
$\\ $
N = 10
alpha = 0.05
np.random.seed(1)
data1 = np.random.normal(0, 1, N)
data2 = np.random.normal(0, 1, N) + 2
Test_statistic, p_value = spearmanr(data1, data2, alternative = 'two-sided')
print(f"Test_statistic_Spearman's Rank Correlation = {Test_statistic}, p_value = {p_value}", "\n")
if p_value < alpha:
print(f'Since p_value < {alpha}, reject null hypothesis. Therefore, Two data are probably dependent.')
else:
print(f'Since p_value > {alpha}, the null hypothesis cannot be rejected. Therefore, Two data are probably independent.')
Test_statistic_Spearman's Rank Correlation = 0.7818181818181817, p_value = 0.007547007781067878 Since p_value < 0.05, reject null hypothesis. Therefore, Two data are probably dependent.
Alternative hypothesis can be {‘two-sided’, ‘less’, ‘greater’}.
'two-sided': the correlation is non-zero
'less': the correlation is negative (less than zero)
'greater': the correlation is positive (greater than zero)
Tests whether two data samples have a monotonic relationship.
$H_0$: The two data are independent.
$H_1$: There is a dependency between the two data.
Assumptions:
$\\ $
N = 10
alpha = 0.05
np.random.seed(1)
data1 = np.random.normal(0, 1, N)
data2 = np.random.normal(0, 1, N) + 2
Test_statistic, p_value = kendalltau(data1, data2)
print(f"Test_statistic_Kendall's Rank Correlation = {Test_statistic}, p_value = {p_value}", "\n")
if p_value < alpha:
print(f'Since p_value < {alpha}, reject null hypothesis. Therefore, Two data are probably dependent.')
else:
print(f'Since p_value > {alpha}, the null hypothesis cannot be rejected. Therefore, Two data are probably independent.')
Test_statistic_Kendall's Rank Correlation = 0.6, p_value = 0.016666115520282188 Since p_value < 0.05, reject null hypothesis. Therefore, Two data are probably dependent.
Tests whether two categorical variables are related or independent.
$H_0$: The two data are independent.
$H_1$: There is a dependency between the two data.
Assumptions:
$\\ $
$\\ $
degrees of freedom: $(rows - 1) * (cols - 1)$
N = 10
alpha = 0.05
table = [[10, 20, 30],
[6, 9, 17]]
Test_statistic, p_value, dof, expected = chi2_contingency(table)
print(f"Test_statistic_Chi-Squared = {Test_statistic}, p_value = {p_value}, df = {dof}, \n", f"Expected = {expected}","\n")
if p_value < alpha:
print(f'Since p_value < {alpha}, reject null hypothesis. Therefore, Two data are probably dependent.')
else:
print(f'Since p_value > {alpha}, the null hypothesis cannot be rejected. Therefore, Two data are probably independent.')
Test_statistic_Chi-Squared = 0.27157465150403504, p_value = 0.873028283380073, df = 2, Expected = [[10.43478261 18.91304348 30.65217391] [ 5.56521739 10.08695652 16.34782609]] Since p_value > 0.05, the null hypothesis cannot be rejected. Therefore, Two data are probably independent.
Tests whether a time series has a unit root, e.g. has a trend or more generally is autoregressive.
$H_0$: A unit root is present (series is non-stationary).
$H_1$: A unit root is not present (series is stationary).
Assumptions:
$\\ $
alpha = 0.05
data = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
Test_statistic, p_value, lags, obs, crit, t = adfuller(data)
print(f"Test_statistic_Mann-Whitney = {Test_statistic}, p_value = {p_value}", "\n")
if p_value < alpha:
print(f'Since p_value < {alpha}, reject null hypothesis. Therefore, the series is probably stationary.')
else:
print(f'Since p_value > {alpha}, the null hypothesis cannot be rejected. Therefore, the series is probably non-stationary.')
Test_statistic_Mann-Whitney = 0.5171974540944098, p_value = 0.9853865316323872 Since p_value > 0.05, the null hypothesis cannot be rejected. Therefore, the series is probably non-stationary.
Tests whether a time series is trend stationary or not.
$H_0$: The time series is trend-stationary.
$H_1$: The time series is not trend-stationary.
Assumptions:
$\\ $
alpha = 0.05
data = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
Test_statistic, p_value, lags, crit = kpss(data)
print(f"Test_statistic_Kwiatkowski = {Test_statistic}, p_value = {p_value}", "\n")
if p_value < alpha:
print(f'Since p_value < {alpha}, reject null hypothesis. Therefore, the series is probably not trend-stationary.')
else:
print(f'Since p_value > {alpha}, the null hypothesis cannot be rejected. Therefore, the series is probably trend-stationary.')
Test_statistic_Kwiatkowski = 0.4099630996309963, p_value = 0.072860732917674 Since p_value > 0.05, the null hypothesis cannot be rejected. Therefore, the series is probably trend-stationary.
Tests whether the distributions of two independent samples are equal or not.
$H_0$: The distributions of both samples are equal.
$H_1$: The distributions of both samples are not equal.
Assumptions:
$\\ $
N = 10
alpha = 0.05
data1 = np.random.normal(0, 1, N)
data2 = np.random.normal(0, 1, N)
Test_statistic, p_value = mannwhitneyu(data1, data2, alternative='two-sided')
print(f"Test_statistic_Mann-Whitney = {Test_statistic}, p_value = {p_value}", "\n")
if p_value < alpha:
print(f'Since p_value < {alpha}, reject null hypothesis. Therefore, Two data distributions are probably not equal.')
else:
print(f'Since p_value > {alpha}, the null hypothesis cannot be rejected. Therefore, Two data distributions are probably equal.')
Test_statistic_Mann-Whitney = 61.0, p_value = 0.4273553138978077 Since p_value > 0.05, the null hypothesis cannot be rejected. Therefore, Two data distributions are probably equal.
Tests whether the distributions of two paired samples are equal or not.
$H_0$: The distributions of both samples are equal.
$H_1$: The distributions of both samples are not equal.
Assumptions:
$\\ $
N = 10
alpha = 0.05
data1 = np.random.normal(0, 1, N)
data2 = np.random.normal(0, 1, N)
Test_statistic, p_value = wilcoxon(data1, data2, alternative='two-sided')
print(f"Test_statistic_Wilcoxon = {Test_statistic}, p_value = {p_value}", "\n")
if p_value < alpha:
print(f'Since p_value < {alpha}, reject null hypothesis. Therefore, Two data distributions are probably not equal.')
else:
print(f'Since p_value > {alpha}, the null hypothesis cannot be rejected. Therefore, Two data distributions are probably equal.')
Test_statistic_Wilcoxon = 24.0, p_value = 0.76953125 Since p_value > 0.05, the null hypothesis cannot be rejected. Therefore, Two data distributions are probably equal.
Tests whether the distributions of two or more independent samples are equal or not.
$H_0$: The distributions of all samples are equal.
$H_1$: The distributions of one or more samples are not equal.
Assumptions:
$\\ $
N = 10
alpha = 0.05
data1 = np.random.normal(0, 1, N)
data2 = np.random.normal(0, 1, N)
Test_statistic, p_value = kruskal(data1, data2)
print(f"Test_statistic_Wilcoxon = {Test_statistic}, p_value = {p_value}", "\n")
if p_value < alpha:
print(f'Since p_value < {alpha}, reject null hypothesis. Therefore, Two data distributions are probably not equal.')
else:
print(f'Since p_value > {alpha}, the null hypothesis cannot be rejected. Therefore, Two data distributions are probably equal.')
Test_statistic_Wilcoxon = 1.462857142857132, p_value = 0.22647606604348455 Since p_value > 0.05, the null hypothesis cannot be rejected. Therefore, Two data distributions are probably equal.
Tests whether the distributions of two or more paired samples are equal or not.
$H_0$: The distributions of both samples are equal.
$H_1$: The distributions of both samples are not equal.
Assumptions:
$\\ $
alpha = 0.05
data1 = [0.873, 2.817, 0.121, -0.945, -0.055, -1.436, 0.360, -1.478, -1.637, -1.869]
data2 = [1.142, -0.432, -0.938, -0.729, -0.846, -0.157, 0.500, 1.183, -1.075, -0.169]
data3 = [-0.208, 0.696, 0.928, -1.148, -0.213, 0.229, 0.137, 0.269, -0.870, -1.204]
Test_statistic, p_value = friedmanchisquare(data1, data2, data3)
print(f"Test_statistic_Friedman = {Test_statistic}, p_value = {p_value}", "\n")
if p_value < alpha:
print(f'Since p_value < {alpha}, reject null hypothesis. Therefore, data distributions are probably not equal.')
else:
print(f'Since p_value > {alpha}, the null hypothesis cannot be rejected. Therefore, data distributions are probably equal.')
Test_statistic_Friedman = 0.8000000000000114, p_value = 0.6703200460356356 Since p_value > 0.05, the null hypothesis cannot be rejected. Therefore, data distributions are probably equal.