#!/usr/bin/env python
# coding: utf-8
# ---
#
#
#
Department of Data Science
# Course: Tools and Techniques for Data Science
#
# ---
# Instructor: Muhammad Arif Butt, Ph.D.
# Lecture 4.1 (Descriptive Statistics)
#
#
#
#
# In[ ]:
# Unlike the other modules, we have been working so far, you have to download and install...
# To install this library in Jupyter notebook
import sys
get_ipython().system('{sys.executable} -m pip install -q --upgrade pip')
get_ipython().system('{sys.executable} -m pip install -q statistics statsmodels scipy')
# In[ ]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import statistics
import scipy.stats as st
import statsmodels as sm
# ## Learning agenda of this notebook
# 1. Data and its Types
# 2. Collection of Data (Population vs Sample)
# 3. Overview of Statistics and its Types
# 4. Descriptive Statistics: Measure of Central Tendency
# - Mean
# - Median
# - Mode
# 5. Descriptive Statistics: Measure of Spread
# - Range
# - IQR
# - Variance
# - Standard Deviation
# 6. Descriptive Statistics: Measure of Asymmetry
# - Skewness
# - Kurtosis
# 7. Analysis Strategy
# 8. Example Datasets
# 9. Univariate Analysis and Data Visualization
# 10. Bivariate Analysis and Data Visualization
# In[ ]:
# ## 1. Data and its Types
#
#
# In[ ]:
# ## 2. Collection of Data (Population vs Sample)
#
# - Data can be sourced directly or indirectly.
# - Direct methods of data collection involve collecting new data for a specific study. This type of data is known as primary data.
# - Indirect methods of data collection involve sourcing and accessing existing data that were not originally collected for the purpose of the study. This type of data is known as secondary data.
# - A survey involves collecting information from every unit in the population (a census), or from a subset of units (a sample) from the population.
# - **`Population:`** It is an entire pool of data from where a statistical sample is extracted. It can be visualized as a complete data set of items that are similar in nature.
#
# - **`Sample:`** It is a subset of the population, i.e. it is an integral part of the population that has been collected for analysis.
#
# - **`Variable:`** A value whose characteristics such as quantity can be measured, it can also be addressed as a data point, or a data item.
#
#
#
#
# In[ ]:
# - **`Types of Sampling:`**
#
#
#
#
#
#
#
# In[ ]:
# ## 3. Overview of Statistics and its Types
#
#
# In[ ]:
# ### Descriptive vs Inferential Statistics
#
#
# >- **Descriptive statistics is used to describe, summarize and present different datasets through numerical calculations, tables and graphs.**
#
# >- **Inferential statistics is used to make inference and predictions for an entire population, based on sample data of that population.**
# | Descriptive Statistics | Inferential Statistics |
# | --- | --- |
# |Describe the features of populations and/or samples | Use samples to make generalizations about larger populations |
# |Draw conclusions based on known data | Draw conclusions that go beyond the available data |
# |It helps in organizing, analyzing and to present data in a meaningful manner | It allows us to compare data, make hypothesis and predictions |
# |Present final results visually, using tables, charts, or graphs| Present final results in the form of probabilities |
# |Use measures like central tendency, distribution, and variance| Use techniques like hypothesis testing, confidence intervals, and regression and correlation analysis |
# In[ ]:
# ## 4. Descriptive Statistics: Measures of Center
#
# ### a. Mean
# - `Arithmetic Mean` is the sum of the value of each observation in a dataset divided by the number of observations. The mean cannot be calculated for categorical data, as the values cannot be summed. Moreover, as mean includes every value in the distribution it is influenced by outliers and skewed distributions.
# ```
# Arithmetic Mean = (x1 + x2 + … + xN) / N
# ```
# - `Geometric Mean` is calculated as the N-th root of the product of all values, where N is the number of values. The geometric mean is used if the data is comprised of different units of measure, e.g. some measure are height, some are dollars, some are miles, etc. In ML, geometric mean is used to calculate the G-Mean measure that is a model evaluation metric. The geometric mean does not accept negative or zero values.
# ```
# Geometric Mean = N-root(x1 * x2 * … * xN)
# ```
# - `Harmonic Mean` is calculated as the number of values N divided by the sum of the reciprocal of the values. The harmonic mean is used if the data is comprised of rates, i.e., a ratio between two quantities with different measures, e.g. speed, acceleration, frequency, etc. In ML, harmonic mean is used to calculate the F1 measure that is a model evaluation metric. The harmonic mean does not accept negative or zero value.
# ```
# Harmonic Mean = N / (1/x1 + 1/x2 + … + 1/xN)
# ```
# - Note:
# - If values have the same units: Use the arithmetic mean.
# - If values have differing units: Use the geometric mean.
# - If values are rates: Use the harmonic mean.
#
# ### b. Median
# - `Median` Median is a statistical measure that determines the middle value of a dataset listed in ascending order.
# - It splits the data in half, so it is also called 50th percentile.
# - To compute median:
# - Arrange the data in ascending order (from the lowest to the largest value).
# - If the dataset contains an odd number of values, the median is a central value that will split the dataset into halves.
# - If the dataset contains an even number of values, find the two central values that split the dataset into halves. Then, calculate the mean of the two central values, which is the median of the dataset.
# - Median is used where strong outliers may skew the representation of the data. If you have one person who earns 1 billion a year and nine other people who earn under 100,000 a year, the mean income for people in the group set be around 100 million, a gross distortion.
#
#
# ### c. Mode
# - Mode is frequently occurring data or elements.
# - If an element occurs the highest number of times, it is the mode of that data. If no number in the data is repeated, then there is no mode for that data. There can be more than one mode in a dataset if two values have the same frequency and also the highest frequency.
# - Outliers don’t influence the data.
# - The mode can be calculated for both quantitative and qualitative data.
# In[ ]:
# **Example 1:**
# In[ ]:
import statistics
print(dir(statistics))
# In[ ]:
data1 = [14000, 19000, 16000, 20000, 15000, 16000, 15000, 15000, 22000]
print("mean(data1): %.2f" % statistics.mean(data1))
print("geometric_mean(data1): %.2f" % statistics.geometric_mean(data1))
print("harmonic_mean(data1): %.2f" % statistics.harmonic_mean(data1))
print("median(data1): ", statistics.median(data1))
print("mode(data1): ", statistics.mode(data1))
# In[ ]:
# **Example 2:**
# In[ ]:
data2 = [14000, 19000, 16000, 20000, 15000, 16000, 15000, 15000, 22000, 900000]
print("mean(data2): %.2f" % statistics.mean(data2))
# In[ ]:
# In[ ]:
# ## 5. Descriptive Statistics: Measures of Spread
#
# ### a. Range
# - `Range` is the spread of your data from the lowest to the highest value in the distribution,
# - The range is calculated by subtracting the lowest value from the highest value.
# - A large range means high variability, a small range means low variability in a distribution.
#
# **Example 1:**
# In[ ]:
import numpy as np
data1 = [25, 10, 9, 6, 12, 11, 15]
range1 = np.max(data1) - np.min(data1)
range1
# **Example 2:**
# In[ ]:
import numpy as np
data2 = [10, 9, 8, 11, 10, 9, 8, 11]
range2 = max(data2) - min(data2)
range2
# In[ ]:
# ### b. Inter Quartile Range
#
#
#
# - Quantiles divedes a distribution and the most common are `quartiles`, `percentiles`, and `deciles`.
# - The median, which divides a distribution in two at its midpoint, is the most well-known example of a quantile.
# - **Quartiles:**, as their name suggests, are quantiles that divide a distribution into quarters by splitting a rank-ordered dataset into four equal parts.
# - 1st Quartile Q1 is the same as the 25th percentile.
# - 2nd Quartile Q2 is the same as 50th percentile.
# - 3rd Quratile Q3 is same as 75th percentile
# - Inter Quartile Range is the difference between the third quartile(Q3) and the first Quartile (Q1)
# - The range gives us a measurement of how spread out the entirety of our data set is. The interquartile range, which tells us how far apart the first and third quartile are, indicates how spread out the middle 50% of our set of data is.
# - The primary advantage of using the interquartile range rather than the range for the measurement of the spread of a data set is that the interquartile range is not sensitive to outliers.
# - Due to its resistance to outliers, the interquartile range is useful in identifying when a value is an outlier.
#
# - **Percentiles** divide the distribution at any point out of one hundred. For example, if we'd like to identify the threshold for the top 5% of a distribution, we'd cut it at the 95th percentile. Or, for the top 1%, we'd cut at the 99th percentile.
# - **Deciles:** (from Latin *decimus*, meaning "tenth") divide a distribution into ten evenly-sized segments.
# In[ ]:
# #### Quartiles
# **Example 1:**
# In[ ]:
import statistics
data = [11,13,16,19,20,21,23,25,26,29,33,34,36,38,39,46,52,55,58]
q1,q2,q3 = statistics.quantiles(data, n=4)
print("Q1: ", q1)
print("Q2: ", q2)
print("Q3: ", q3)
print("IQR = Q3 - Q1: ", q3-q1)
# In[ ]:
# In[ ]:
# **Example 2:**
# - We can use the IQR method of identifying outliers to set up a “fence” outside of Q1 and Q3. Any values that fall outside of this fence are considered outliers.
# In[ ]:
data = [0, 0, 2, 5, 8, 8, 8, 9, 9, 10, 10, 10, 11, 12, 12, 12, 14, 15, 20, 25]
q1,q2,q3 = statistics.quantiles(data, n=4)
print("Q1: ", q1)
print("Q2: ", q2)
print("Q3: ", q3)
iqr = q3 - q1
print("IQR = Q3 - Q1: ", iqr)
# In[ ]:
# Any observations that are more than 1.5 times IQR above Q3 are considered outliers.
upperfence = q3 + iqr*1.5
upperfence
# In[ ]:
# Any observations that are less than 1.5 times IQR below Q1 re considered outliers.
lowerfence = q1 - iqr*1.5
lowerfence
# >There are 4 outliers: 0, 0, 20, and 25.
# In[ ]:
# ### c. Variance
# - The `Variance` is a very simple statistic that gives you an extremely rough idea of how spread out a data set is. As a measure of spread, it’s actually pretty weak. A large variance of 22,000, for example, doesn’t tell you much about the spread of data — other than it’s big! The most important reason the variance exists is to give you a way to find the standard deviation which is the square root of variance.
# - The reason of dividing by `n-1` in case of sample variance is "the formula uses sample mean in the numerator and as a consequence may underestimate the true variance of the population. Dividing by n-1 instead of n corrects for that bias.
#
#
# ### d. Standard Deviation
# - Simply put, the `Standard Deviation` is a measure of how spread out data is around center of the distribution (the mean). It also gives you an idea of where, percentage wise, a certain value falls. For example, let’s say you took a test and it was normally distributed (shaped like a bell). You score one standard deviation above the mean. That tells you your score puts you in the top 84% of test takers.
# - Low standard deviation implies that most values are close to the mean. High standard deviation suggests that the values are more broadly spread out.
#
#
#
# In[ ]:
# **Example 1:**
# In[ ]:
import statistics
data = [2, 4, 6]
print("Sample Mean: ", statistics.mean(data))
print("Sample Variance: ", statistics.variance(data))
print("Population Variance: ", statistics.pvariance(data))
print("Sample Standard Deviation: ", statistics.stdev(data))
print("Population Standard Deviation: ", statistics.pstdev(data))
# In[ ]:
data = [2, 4, 6]
np.std(data)
# In[ ]:
statistics.stdev(data)
# In[ ]:
# In[ ]:
# **Example 2:** `Machine Learning` teacher has compiled the result and wants to know whether most students are performing at the same level, or if there is a high standard deviation.
# In[ ]:
marks_ML = [75, 69, 80, 70, 60, 63, 64, 69, 71]
print("Mean of ML Marks: ", statistics.mean(marks_ML))
print("Standard Deviation of ML Marks: %.2f" % statistics.stdev(marks_ML))
# >**Low standard deviation implies that most values are close to the mean.**
# In[ ]:
# In[ ]:
# **Example 3:** `Artificial Intelligence` teacher has compiled the result and wants to know whether most students are performing at the same level, or if there is a high standard deviation.
# In[ ]:
marks_AI = [44, 95, 25, 60, 76, 81, 93, 84, 71, 33, 85, 81]
print("Mean of AI Marks: ", statistics.mean(marks_AI))
print("Standard Deviation of AI Marks: %.2f" % statistics.stdev(marks_AI))
# >**Note: High standard deviation implies that the values are more broadly spread out.**
# In[ ]:
# **Example 4:** Suppose there are 10000 students in my Data Science class. You get 85% marks in your Data Science Exam. The mean of the overall result is 60% with a standard deviation of 10%. How many students are above you?
# In[ ]:
# Let us first generate the random marks of ten thousand students, with a mean of 60 and standard deviation of 10
mu = 60
sigma = 10
np.random.seed(54)
x = np.random.normal(mu, sigma, 10000)
x
# In[ ]:
# Let us verify, whether the mean and std dev of above distribution `x` is 60 and 10 respectively
print("np.mean(x): ", np.mean(x))
print("np.std(x): ", np.std(x))
# >- **Let us calculate the number of students above you out of ten thousand and visualize this by drawing a graph.**
# In[ ]:
a = len(np.where(x > 85)[0])
a
# In[ ]:
sns.displot(x, color='green')
plt.axvline(mu, color='orange')
for i in [-3, -2, -1, 1, 2, 3]:
plt.axvline(mu+i*sigma, color='red')
plt.axvline(85, color='purple')
plt.show()
# In[ ]:
# In[ ]:
# **Example 5:** This is continuation of above example. Suppose your marks in Data Science are still 85%, but this time the mean of the overall result has increased to 90% with a standard deviation of 2. How many students are above you?
# In[ ]:
# Let us first generate the random marks of ten thousand students, with a mean of 90 and standard deviation of 2
mu = 90
sigma = 2
np.random.seed(54)
x = np.random.normal(mu, sigma, 10000)
x
# In[ ]:
# Let us verify, whether the mean and std dev of above distribution `x` is 90 and 2 respectively
print("np.mean(x): ", np.mean(x))
print("np.std(x): ", np.std(x))
# >- **Let us calculate the number of students above you out of ten thousand and visualize this by drawing a graph.**
# In[ ]:
a = len(np.where(x > 85)[0])
a
# In[ ]:
sns.displot(x, color='grey')
plt.axvline(mu, color='orange')
for i in [-3, -2, -1, 1, 2, 3]:
plt.axvline(mu+i*sigma, color='red')
plt.axvline(85, color='purple')
plt.show()
# In[ ]:
# In[ ]:
# ## 6. Measures of Asymmetry
#
#
# In[ ]:
# In[ ]:
# ### a. Skewness
#
#
#
#
#
# - **`Skewness`:** is the measure of how much the probability distribution of a random variable deviates from the normal distribution. The skewness for a normal distribution is zero
# - **`Positive Skewness / Right-Skewed Distribution: (median < mean):`** If the given distribution is shifted to the left and with its tail on the right side, it is a positively skewed distribution. In this type, the majority of the observations are concentrated on the left, and the value of skewness is positive.
# - **`Negative Skewness / Left-Skewed Distribution: (median > mean):`** If the given distribution is shifted to the right and with its tail on the left side, it is a negatively skewed distribution. In this type, the majority of the observations are concentrated on the right, and the value of skewness is negative.
# - **`Measuring Skewness`:** Skewness can be measured using several methods; however, `Pearson mode skewness` and `Pearson median skewness` are the two frequently used methods.
#
#
#
#
# In[ ]:
# In[ ]:
# **Example 1:**
#
#
#
# In[ ]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import statistics
import scipy.stats as st
import statsmodels as sm
# In[6]:
import scipy
import statistics
import seaborn as sns
s = scipy.stats.skewnorm.rvs(20, size=10000)
print("Mean: ", statistics.mean(s))
print("Median: ", statistics.median(s))
print("Mode: ", statistics.mode(s))
print("Skew: ", scipy.stats.skew(s))
sns.displot(s, kde=True);
# In[ ]:
from scipy import stats
import statistics
data1 = [30, 10, 15, 27, 15, 45, 31, 19, 54, 60, 95]
print("Mean: ", statistics.mean(data1))
print("Median: ", statistics.median(data1))
print("Mode: ", statistics.mode(data1))
print("Skew: ", stats.skew(data1))
# In[3]:
import stats
import seaborn as sns
s = stats.skewnorm.rvs(20, size=10000)
print("Mean: ", statistics.mean(data))
print("Median: ", statistics.median(data))
print("Mode: ", statistics.mode(data))
print("Skew: ", stats.skew(data))
sns.displot(s, kde=True);
# In[ ]:
# In[ ]:
# **Example 2:**
#
#
# In[ ]:
from scipy import stats
import statistics
data2 = [20, 33, 88, 91, 85, 89, 91, 95]
print("Mean: ", statistics.mean(data2))
print("Median: ", statistics.median(data2))
print("Mode: ", statistics.mode(data2))
print("Skew: ", stats.skew(data2))
# In[ ]:
s = stats.skewnorm.rvs(-20, size=10000)
print("Mean: ", statistics.mean(data))
print("Median: ", statistics.median(data))
print("Mode: ", statistics.mode(data))
print("Skew: ", stats.skew(data))
sns.displot(s, kde=True);
# In[ ]:
# In[ ]:
# **Example 3:**
#
#
# In[ ]:
import numpy as np
from scipy import stats
import statistics
from matplotlib import pyplot as plt
mean = 25
stdev = 5
data3 = np.random.normal(loc=mean, scale=stdev,size=1000)
print("Mean: ", statistics.mean(data3))
print("Median: ", statistics.median(data3))
print("Mode: ", statistics.mode(data3))
print("Skew: ", stats.skew(data3))
plt.hist(data3, 50);
# In[ ]:
s = stats.skewnorm.rvs(0, size=10000)
print("Mean: ", statistics.mean(data))
print("Median: ", statistics.median(data))
print("Mode: ", statistics.mode(data))
print("Skew: ", stats.skew(data))
sns.displot(s, kde=True);
# In[ ]:
np.mean(data)
# In[ ]:
from scipy import stats
from matplotlib import pyplot as plt
import numpy as np
data = stats.skewnorm.rvs(0, size=1000) # first argument is "skewness"; 0 has no skew
print("Mean: ", statistics.mean(data))
print("Median: ", statistics.median(data))
print("Mode: ", statistics.mode(data))
print("Skew: ", stats.skew(data))
#fig, ax = plt.subplots()
#plt.hist(data, bins=50);
sns.displot(s, kde=True);
#plt.axvline(data = np.mean(data), color='black')
#plt.axvline(data = np.median(data), color='black')
plt.show();
# In[ ]:
# ### b. Kurtosis
#
#
# - `Kurtosis`: is a statistical measure, to determine if the data is heavy-tailed or light-tailed in a normal distribution.
#
# - **Leptokurtic (kurtosis>3)**: If a given distribution has a kurtosis greater than 3, it is said to be leptokurtic, which means it tends to produce more outliers than the normal distribution.
# - **Mesokurtic (kurtosis==3)**: This distribution looks similar to a normal distribution.
# - **Platykurtic (kurtosis<3**: If a given distribution has a kurtosis less than 3, it is said to be platykurtic, which means it tends to produce fewer and less extreme outliers than the normal distribution.
#
# - The main difference between skewness and kurtosis is that the skewness refers to the degree of symmetry, whereas the kurtosis refers to the degree of presence of outliers in the distribution.
# In[ ]:
# **Example 1:**
# In[12]:
#Mesokurtic: kurtosis==3
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt
mean = 25
stdev = 5
data = np.random.normal(loc=mean, scale=stdev,size=1000)
print('Kurtosis:', stats.kurtosis(data, fisher=False))
print("Skew: ", stats.skew(data))
#plt.hist(data, 50);
sns.displot(data);
# In[ ]:
# **Example 2:**
# In[9]:
#Leptokurtic: kurtosis > 3 (outliers)
import numpy as np
from scipy import stats
data = np.random.laplace(1.45, 15, 1000)
print('Kurtosis:', stats.kurtosis(data, fisher=False))
print("Skew: ", stats.skew(data))
plt.hist(data, 50);
# In[ ]:
# **Example 3:**
# In[10]:
#Platykurtic: kurtosis < 3 (less extreme outliers)
import numpy as np
from scipy import stats
data = np.random.rand(1000)
print('Kurtosis:', stats.kurtosis(data, fisher=False))
print("Skew: ", stats.skew(data))
plt.hist(data, 50);
# In[ ]:
# In[ ]:
# ## 7. Analysis Strategy
#
#
#
# ### a. Univariate Analysis
# - In a dataset, Univariate analysis is performed on each variable separately. It is possible on numerical as well as categorical variables. It takes data, summarizes that data and finds patterns in the data.
# - Suppose that the heights of students of a class are recorded, there is only one variable that is height and it is not dealing with any cause or relationship.
#
# ### b. Bivariate Analysis
#
#
#
# - Bivariate analysis is an analysis that is performed to determine the relationship between 2 variables. The analysis is related to cause and the relationship between the two variables. One of the variables will be dependent and the other is independent. The data types of the two variables can be:
# - (Numerical-Numerical)
# - (Numerical-Categorical)
# - (Categorical-Categorical)
#
# ### c. Visual Analysis
# - If you want to compare values and analyze trends use:
# - Line Graph
# - Bar Graph
# - Column Graph
# - If you want to show how one variable relates to one or more different variables use:
# - Scatter Plot
# - Bubble Plot
# - Line Plot
# - If you want to understand distribution of your data to understand outliers, normal tendency and range of information in your data use:
# - Histogram
# - Density Plot
# - Box Plot
# - Violin Plot
# - If you want to show composition of something use:
# - Pie Chart
# - Stacked Bar Graph
# - Stacked Column Graph
# - Area Graph
# In[ ]:
# ## 8. Variance, Covariance and Correlation
# - `Variance` tells us how much a (single) quantity varies w.r.t. its mean. Its the spread of data around the mean value. You only know the magnitude here, as in how much the data is spread.
# - `Covariance` tells us direction in which two quantities vary with each other.
# - `Correlation` shows us both, the direction and magnitude of how two quantities vary with each other.
# In[ ]:
# In[ ]:
# ### a. Covariance
# - Covariance is used to measure as to how the mean values of two random variables move together. For example the height and weight of a person in a population. The formula to calculate sample covariance is:
#
# - The formula to calculate population covariance is:
#
#
#
#
# - Covariance measures the direction of the relationship between two variables.
# - `Positive covariance`: Indicates that two variables tend to move in the same direction.
# - `Negative covariance`: Reveals that two variables tend to move in inverse directions.
# - `Zero covarince`: Indicates that two variables have no relationship between each other.
# - **Covariance Matrix:** For multi-dimensional data, there applies a generalization of covariance in terms of a covariance matrix. The covariance matrix is also known as the variance-covariance matrix, as the diagonal values of the covariance matrix show variances and the other values are the covariances. The covariance matrix for two variables is a square matrix which can be written as follows:
#
#
# ### b. Correlation
# - Correlation is a measure that tells us the direction as well as the magnitude of how two quantities vary with each other (e.g., height and weight).
# - The Pearson Correlation Coefficient (r) is used to quantify the strength and direction of linear relationship between two quantitative variables:
#
#
# - The value of correlation coefficient range from -1 to +1, shows the strength of the correlation:
# - 1 indicates a perfect positive correlation.
# - -1 indicates a perfect negative correlation.
# - 0 indicates that there is no relationship between the different variables.
#
#
#
#
# - **Correlation Matrix:** is a table showing correlation coefficients between various variables. The rows and columns contain the value of the variables, and each cell shows the correlation coefficient.
#
# In[ ]:
# In[ ]:
# ### Example 1: Strong Positive Correlation
# >**Let us compute the variance and covariance for two variables `weight` and `height` of six persons**
# In[ ]:
from matplotlib import pyplot as plt
import statistics
weight = [61,62,73,74,82,86]
height = [157,168,170,181,191,185]
print("mean(weight): ", statistics.mean(weight))
print("mean(height): ", statistics.mean(height))
print("var(weight): ", statistics.variance(weight))
print("var(height): ", statistics.variance(height))
print("cov(weight, height): ", statistics.covariance(weight,height))
print("Pearson Correlation Coefficient (r): ",statistics.correlation(weight, height))
plt.scatter(x=weight, y=height);
# #### Covariance Matrix
#
# In[ ]:
print("var(weight): ", statistics.variance(weight))
print("var(height): ", statistics.variance(height))
print("cov(weight, height): ", statistics.covariance(weight,height))
print("Covariance Matrix: \n", np.cov(weight,height))
# #### Correlation Matrix
#
# In[ ]:
print("Pearson Correlation Coefficient 'r': ",statistics.correlation(weight, height))
print("Correlation Matrix: \n", np.corrcoef(weight,height))
# In[ ]:
# ### Example 2: Perfect Negative Correlation
# >**Let us compute the variance and covariance for two variables `time spent on yoga` and `stress level`**
# In[ ]:
from matplotlib import pyplot as plt
import statistics
yoga = [1,2,3,4,5,6,7,8,9]
stress = [90,80,70,60,50,40,30,20,10]
print("mean(yoga): ", statistics.mean(yoga))
print("mean(stress): ", statistics.mean(stress))
print("var(yoga): ", statistics.variance(yoga))
print("var(stress): ", statistics.variance(stress))
print("cov(yoga, stress): ", statistics.covariance(yoga,stress))
print("Pearson Correlation Coefficient 'r': ",statistics.correlation(yoga, stress))
plt.scatter(x=yoga, y=stress);
# #### Covariance Matrix
#
# In[ ]:
print("var(yoga): ", statistics.variance(yoga))
print("var(stress): ", statistics.variance(stress))
print("cov(yoga, stress): ", statistics.covariance(yoga,stress))
print("Covariance Matrix: \n", np.cov(yoga,stress))
# #### Correlation Matrix
#
# In[ ]:
print("Pearson Correlation Coefficient 'r': ",statistics.correlation(yoga, stress))
print("Correlation Matrix: \n", np.corrcoef(yoga,stress))
# In[ ]:
# ### Example 3: Standardized Data
#
# In[ ]:
# Correlation matrix of original data
weight = [86,82,74,73,62,61]
height = [157,168,170,181,191,185]
print("Correlation Matrix: \n", np.corrcoef(weight,height))
# In[ ]:
# Covariance matrix of standardized data
weight = [1.28, 0.89, 0.10, 0, -1.08, -1.18]
height = [-1.46, -0.58, -0.42, 0.45, 1.25, 0.77]
print("Covariance Matrix: \n", np.cov(weight,height))
# >**Covariance and Correlation matrix will be identical in this case**
# In[ ]:
# In[ ]:
# ## 9. Example Datasets
# ### a. TITANIC Dataset
#
# In[ ]:
import pandas as pd
df_titanic = pd.read_csv('datasets/titanic3.csv')
df_titanic
# In[ ]:
# ### b. IRIS Dataset
#
# In[ ]:
df_iris = pd.read_csv('datasets/iris.csv')
df_iris
# In[ ]:
# ### c. TIPS Dataset
#
# In[ ]:
df_tips = pd.read_csv('datasets/tips.csv')
df_tips
# In[ ]:
# ## 10. Univariate Analysis and Data Visualization
# In[ ]:
import seaborn as sns
import statistics
from scipy import stats
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
# **Example 1:** Check the spread of data of `sepal_width` column in `IRIS` dataset using **Scatter Plot**
# In[ ]:
print("mean(sepal_width): %.2f" % statistics.mean(df_iris.loc[:, 'sepal_width']))
print("median(sepal_width): %.2f" % statistics.median(df_iris.loc[:, 'sepal_width']))
print("mode(sepal_width): %.2f" % statistics.mode(df_iris.loc[:, 'sepal_width']))
print("variance(sepal_width): %.2f" % statistics.variance(df_iris.loc[:, 'sepal_width']))
sns.relplot(x = df_iris.index, y = 'sepal_width', data = df_iris, kind='scatter');
# In[ ]:
# In[ ]:
# **Example 2:** Check the measures of center for the `age` column in `TITANIC` dataset using **Box Plot**
# In[ ]:
#There are 264 NaN values under age column, which may disturb our calculations, so replace them with mean
df_titanic.loc[(df_titanic.age.isna()),'age'] = df_titanic.age.mean()
print("min(age): ", df_titanic.loc[:, 'age'].min())
print("max(age): ", df_titanic.loc[:, 'age'].max())
print("median(age): ", df_titanic.loc[:, 'age'].median())
print("q1, q2, q3: ", statistics.quantiles(df_titanic.loc[:, 'age'], n=4))
sns.catplot(y='age', kind='box', data = df_titanic);
# In[ ]:
# In[ ]:
# **Example 3:** Check the distribution of data of `sepal_width` column in `IRIS` dataset using **Histogram**
# In[ ]:
# sepal_width Column of IRIS Dataset
print("skew(sepal_width): %.2f" % stats.skew(df_iris.loc[:, 'sepal_width']))
print('Kurtosis(sepal_width): %.2f' % stats.kurtosis(df_iris.loc[:, 'sepal_width'], fisher=False))
sns.displot(x= 'sepal_width', data=df_iris, kind='hist', kde=True);
# In[ ]:
# In[ ]:
# **Example 4:** Check the distribution of data of `age` column in `TITANIC` dataset using **Histogram**
# In[ ]:
# age Column of TITANIC Dataset
print("skew(age): %.2f" % stats.skew(df_titanic.loc[:, 'age']))
print('Kurtosis(age): %.2f' % stats.kurtosis(df_titanic.loc[:, 'age'], fisher=False))
sns.displot(x= 'age', data=df_titanic, kind='hist', kde=True);
# In[ ]:
# **Example 5:** Check the distribution of data of `tip` column in `TIPS` dataset using **Histogram**
# In[ ]:
# tip Column of TIPS Dataset
print("skew(tip): %.2f" % stats.skew(df_tips.loc[:, 'tip']))
print('Kurtosis(tip): %.2f' % stats.kurtosis(df_tips.loc[:, 'tip'], fisher=False))
sns.displot(x= 'tip', data=df_tips, kind='hist', kde=True);
# In[ ]:
# In[ ]:
# ## 11. Bivariate Analysis and Data Visualization
# ### a. Two Categorical Variables
# - Data concerning two categorical (i.e., nominal- or ordinal-level) variables can be displayed using a bar chart.
# **Example 1:** Compare the `sex` and `survived` columns of `TITANIC` dataset using **Bar Plot**
# In[ ]:
sns.catplot(x ='sex', y ='survived',kind='bar', data = df_titanic);
# In[ ]:
# **Example 2:** Check the survival. count of passengers bbased on `sex` column of `TITANIC` dataset using **Count Plot** and adding the hue argument
# In[ ]:
sns.catplot(x ='sex',kind='count', data = df_titanic, hue='survived');
# In[ ]:
# **Example 3:** Check the survival count of passengers based on `passenger class` column of `TITANIC` dataset using **Count Plot** and adding the col argument
# In[ ]:
sns.catplot(x ='pclass',kind='count', data = df_titanic, col='survived');
# In[ ]:
# **Example 4:** Check the survival rate of passengers based on `passenger class` column of `TITANIC` dataset using **Bar Plot**
# In[ ]:
sns.catplot(x ='pclass', y ='survived',kind='bar', data = df_titanic);
# In[ ]:
# ### b. One Quantitative and One Categorical Variable
# - Often times we want to compare groups in terms of a quantitative variable. For example, we may want to compare the age of males and females. In this case age is a quantitate variable while biological sex is a categorical variable. Graphs with groups can be used to compare the distributions of heights in these two groups.
# - This is an example of constructing side-by-side boxplots with groups. The side-by-side boxplots allow us to easily compare the median, IQR, and range of the two groups. The histograms with groups allow us to compare the shape, central tendency, and variability of the two groups.
# **Example 1:** Compare the `sex` and `age` columns of `TITANIC` dataset using **Box Plot**
# In[ ]:
sns.catplot(x ='sex', y='age', kind='box', data = df_titanic);
# In[ ]:
# >- **Have separate Boxplots for survived as not survived using `col` argument**
# In[ ]:
sns.catplot(x ='sex', y='age', kind='box', data = df_titanic, col='survived');
# In[ ]:
# **Example 2:** Compare the `pclass` and `age` columns of `TITANIC` dataset using **Box Plot**
# In[ ]:
sns.catplot(x ='pclass', y='age', kind='box', data = df_titanic);
# In[ ]:
# >- **Have separate Boxplots for survived as not survived using `col` argument**
# In[ ]:
sns.catplot(x ='pclass', y='age', kind='box', data = df_titanic, col='survived');
# In[ ]:
# ### c. Two Quantitative Variables
# - This is done using scatterplots, correlation, and simple linear regression.
# - A scatterplot is a graph used to display data concerning two quantitative variables.
# - Correlation is a measure of the direction and strength of the relationship between two quantitative variables.
# - Simple linear regression uses one quantitative variable to predict a second quantitative variable.
# - **Scatter Plot:**
# - A graphical representation of two quantitative variables in which the explanatory variable is on the x-axis and the response variable is on the y-axis. When examining a scatterplot, we need to consider the following:
# - Direction (positive or negative)
# - Form (linear or non-linear)
# - Strength (weak, moderate, strong)
# - Bivariate outliers
# **Example 1:** Compare the `sepal_length` and `sepal_width` columns of `IRIS` dataset using **Scatter Plot**
# In[ ]:
print("corr(sepal_length,sepal_width): %.2f" %
statistics.correlation(df_iris['sepal_length'], df_iris['sepal_width']))
sns.relplot(x='sepal_length', y='sepal_width', data=df_iris, kind='scatter');
# In[ ]:
# **Example 2:** Compare the `petal_length` and `petal_width` columns of `IRIS` dataset using **Scatter Plot**
# In[ ]:
print("corr(petal_length,petal_width): %.2f" %
statistics.correlation(df_iris['petal_length'], df_iris['petal_width']))
sns.relplot(x='petal_length', y='petal_width', data=df_iris, kind='scatter');
# In[ ]:
# **Example 3:** Compare the `total_bill` and `tip` columns of `TIPS` dataset using **Scatter Plot**
# In[ ]:
print("Pearson Correlation Coefficient (r): %.2f" %
statistics.correlation(df_tips.loc[:, 'total_bill'],df_tips.loc[:, 'tip']))
sns.relplot(x='total_bill', y='tip', data=df_tips, kind='scatter');
# In[ ]:
# **Example 4:** Check relationships using **Correlation Matrix** and **Heat Map** for the numberic columns of `TIPS` dataset
# In[ ]:
df_tips.corr()
# In[ ]:
sns.heatmap(df_tips.corr(), annot=True);
# In[ ]:
# In[ ]:
# **Example 5:** Check relationships using **Correlation Matrix** and **Heat Map** for the numberic columns of `IRIS` dataset
# In[ ]:
df_iris.corr()
# In[ ]:
sns.heatmap(df_iris.corr(), annot=True);
# In[ ]:
# **Example 6:** Check relationships using **Correlation Matrix** and **Heat Map** for the numeric columns of `TITANIC` dataset
# In[ ]:
df_titanic.corr()
# In[ ]:
sns.heatmap(df_titanic.corr(), annot=True);
# In[ ]: