This chapter of an Introduction to Health Data Science by Dr JH Klopper is licensed under Attribution-NonCommercial-NoDerivatives 4.0 International
This chapter serves as an introduction to the fundamental concepts and techniques of exploratory data analysis (EDA), a critical step in any data analysis process.
EDA is an approach and a set of techniques that allows the initial understanding and apprecaition of the information in complex datasets. It is an essential component of data science, machine learning, and statistical modeling. It is the first step in the data analysis process, exploring the data to understand its main characteristics before making any assumptions, conducting statistical test, or building predictive models.
The primary goal of EDA is to maximize the data scientist's insight into a dataset and into the underlying structure of the data and to provide all of the specific items that a data scientist would want to extract from a data set. These include the following.
EDA is not merely a set of techniques, but a mindset. It encourages the data scientist to remain open-minded, to explore the data from multiple angles, to question assumptions, and to be ready to modify initial hypotheses based on the insights gained from the data.
This data explores the summary of data, termed descriptive statistics.
import pandas as pd
from scipy import stats
The data for this chapter is in the heart_failure.csv
file. It is imported below and metadata about the data is returned using the attributes and methods used in chapter 8.
# Import the heart_failure.csv file and assign it to the vaiable df
df = pd.read_csv('https://raw.githubusercontent.com/juanklopper/TutorialData/main/heart_failure.csv')
# View the first 5 rows of the dataframe
df.head()
age | anaemia | creatinine_phosphokinase | diabetes | ejection_fraction | hypertension | platelets | serum_creatinine | serum_sodium | sex | smoking | time | death | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 75.0 | No | 582 | No | 20 | Yes | 265000.00 | 1.9 | 130 | Male | No | 4 | Yes |
1 | 55.0 | No | 7861 | No | 38 | No | 263358.03 | 1.1 | 136 | Male | No | 6 | Yes |
2 | 65.0 | No | 146 | No | 20 | No | 162000.00 | 1.3 | 129 | Male | Yes | 7 | Yes |
3 | 50.0 | Yes | 111 | No | 20 | No | 210000.00 | 1.9 | 137 | Male | No | 7 | Yes |
4 | 65.0 | Yes | 160 | Yes | 20 | No | 327000.00 | 2.7 | 116 | Female | No | 8 | Yes |
# Return the shape of the dataframe
df.shape # Returns a tuple with number of rows and columns
(299, 13)
# Return info about the dataframe
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 299 entries, 0 to 298 Data columns (total 13 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 age 299 non-null float64 1 anaemia 299 non-null object 2 creatinine_phosphokinase 299 non-null int64 3 diabetes 299 non-null object 4 ejection_fraction 299 non-null int64 5 hypertension 299 non-null object 6 platelets 299 non-null float64 7 serum_creatinine 299 non-null float64 8 serum_sodium 299 non-null int64 9 sex 299 non-null object 10 smoking 299 non-null object 11 time 299 non-null int64 12 death 299 non-null object dtypes: float64(3), int64(4), object(6) memory usage: 30.5+ KB
The info
method used above shows various column names (statistical variables) that are object types. This indicates categorical variables that are were not encoded by numbers when the data files was generated.
Exploratory data analysis of categorical variables starts by determining all the possible values that the variable can take, called the sample space of the variable. The different elements in the sample space are called classes or levels. The number or frequency of occurrence of each class can be determined. Finally, the relative frequency or proportion of each class can also be calculated by dividing the frequency of the class by the total number of observations.
The anaemia
column indicates whether a suject has anemia. The sample space (all the values or elements or classes) that the anaemia
variable can takes can be determined using the unique
method.
# Determine the unique classes for the anaemia column
df['anaemia'].unique()
array(['No', 'Yes'], dtype=object)
The sample space elements are No
(no anemia present) and Yes
(anemia present). The value_counts
method returns the frequency of each of the classes.
# Determine the value counts of the classes in the anaemia column
df.anaemia.value_counts()
anaemia No 170 Yes 129 Name: count, dtype: int64
A total of $170$ subjects did not have anemia and $129$ did.
The passing the value of True
to the normalize
argument returns the relative frequency (or proportion) of each class.
# Determine the relative frequency of the classes in the anaemia column
df.anaemia.value_counts(normalize=True)
anaemia No 0.568562 Yes 0.431438 Name: proportion, dtype: float64
About $56.9\%$ of the subjects did not have anemia and about $43.1\%$ of subjects did. The mode of this variable is therefor No
.
Task
Determine the sample space elements, the frequency, and the relevant frequencies of the diabetes
variable.
Solution
df.diabetes.unique()
array(['No', 'Yes'], dtype=object)
df.diabetes.value_counts()
diabetes No 174 Yes 125 Name: count, dtype: int64
df.diabetes.value_counts(normalize=True)
diabetes No 0.58194 Yes 0.41806 Name: proportion, dtype: float64
Comparative descriptive statistics for categorical variables expressed the frequency (or proportion) of the classes in the data, after grouping the vlaues by the sample space elements of another categorical variable. Below, the frequency of the anaemia
variable is determined for each of the classes in the diabetes
variable, by making use of the groupby
method.
# Determine the frequency of the classes in the anaemia column after grouping by the classses in the diabetes column
df.groupby('diabetes').anaemia.value_counts()
diabetes anaemia No No 98 Yes 76 Yes No 72 Yes 53 Name: count, dtype: int64
The value_counts
method can be added to show the proportion of each. By multiplying by $100$, the percentages are obtained.
# Express the proportions as percentages
df.groupby('diabetes').anaemia.value_counts(normalize=True) * 100
diabetes anaemia No No 56.321839 Yes 43.678161 Yes No 57.600000 Yes 42.400000 Name: proportion, dtype: float64
The pandas crosstab
function can create a contingency table of the comparitive frequencies. The variable that is passed first, populates the rowss of the table.
# Create a cross tabulation of the anaemia and diabetes columns
pd.crosstab(df.diabetes, df.anaemia)
anaemia | No | Yes |
---|---|---|
diabetes | ||
No | 98 | 76 |
Yes | 72 | 53 |
The joint frequencies from the table above is expressed as a percentage below.
# Express the cross tabulation as percentages
pd.crosstab(df.diabetes, df.anaemia, normalize=True) * 100
anaemia | No | Yes |
---|---|---|
diabetes | ||
No | 32.775920 | 25.418060 |
Yes | 24.080268 | 17.725753 |
Generate a contingency table, indicating the joint percentages of the anaemia
variable for each of the classes in the death
column.
Solution
pd.crosstab(df.death, df.anaemia, normalize=True) * 100
anaemia | No | Yes |
---|---|---|
death | ||
No | 40.133779 | 27.759197 |
Yes | 16.722408 | 15.384615 |
Various statistics can be calculated for continuous numerical variables. The include measures of central tendency and measures of dispersion.
Measures of central tendency are statistical indicators that represent the center point or typical value of a dataset. These measures indicate where most values in a dataset fall and are also referred to as the central location of a distribution. The three main measures of central tendency are as follows.
Mean: The mean, often called the average, is calculated by adding all data points in the dataset and then dividing by the number of data points. The mean is sensitive to outliers, meaning that extremely high or low values can skew the mean.
Median: The median is the middle value in a dataset when the data points are arranged in ascending or descending order. If the dataset has an even number of observations, the median is the average of the two middle numbers. The median is not affected by outliers or skewed data.
Mode: The mode is the value that appears most frequently in a dataset. A dataset may have one mode, more than one mode, or no mode at all. The mode was used to explore categorical variables in the previous section.
These measures help provide a summary of the dataset and can give a general sense of the typical value that might be expected.
The mean for a numerical variable can be calculated using the mean
method for a pandas series object. The mean of the age
variable is calculated below.
# Calculate the mean of the age variable
df.age.mean()
60.83389297658862
The mean for a variable can be calculated for different groups, created from the classes of a categorical variable. The mean of the age
variable is calculated for each of the classes in the death
variable, using the groupby
method.
# Calculate the mean age by the groups in the death column
df.groupby('death').age.mean()
death No 58.761906 Yes 65.215281 Name: age, dtype: float64
The median
method calculates the median of a pandas series object that contains numerical data. The median of the age
variable is calculated below for each class in the death
variable.
# Calculate the median age by the groups in the death column
df.groupby('death').age.median()
death No 60.0 Yes 65.0 Name: age, dtype: float64
Task
Determine the median ejection_fraction
for each of the classes in the anaemia
variable.
Solution
df.groupby('anaemia')['ejection_fraction'].median()
anaemia No 38.0 Yes 38.0 Name: ejection_fraction, dtype: float64
Measures of dispersion, also known as measures of variability, provide insights into the spread or variability of a dataset. They indicate how spread out the values in a dataset are around the center (mean or median). The main measures of dispersion are as follows.
Range: The range is the simplest measure of dispersion and is calculated as the difference between the highest and the lowest value in the dataset.
Variance: Variance measures how far each number in the set is from the mean (average) and thus from every other number in the set. It's often used in statistical and probability theory.
Standard Deviation: The standard deviation is the square root of the variance and provides a measure of dispersion in the same units as the original data. It's the most commonly used measure of spread.
Interquartile Range (IQR): The IQR is the range between the first quartile (25th percentile) and the third quartile (75th percentile) of the data. It is used to measure statistical dispersion and data variability by dividing a data set into quartiles.
These measures help to understand the variability within a dataset, the reliability of statistical estimations, and the level of uncertainty. They are crucial in many statistical analyses as they provide context to the measures of central tendency.
The minimum and maximum values of a numerical variable (expressed as a pandas series object) are calculatd using the min
and max
methods.
# Determine the minimum value of the age column
df.age.min()
40.0
# Determine the maximum value of the age column
df.age.max()
95.0
# Determine the range of the age column
df.age.max() - df.age.min()
55.0
The sample variance of a numerical variable (expressed as a pandas series object) is determined using the var
method. The ddof
argument is set to $1$ to calculate the sample variance. The ddof
argument is the delta degrees of freedom and is used to calculate the sample variance. The default value is $0$ and is used to calculate the population variance. The difference between these two equations are shown in (1) for the sake of interest. In (1) $s^{2}$ is the sample variance, $\sigma^{2}$ is the population variance, $x_{i}$ is the values for the numerical variable in each of the $n$ subjects in the sample, $N$ is the population size, and $\bar{x}$ is the mean of the variable, and $n$ is the sample size.
# Determine the sample variance of the age column
df.age.var(ddof=1)
141.48648290797084
The std
methhod determines the standard deviation. The sample standard deviation is calculated below.
# Determine the sample standard deviation of the age column
df.age.std(ddof=1)
11.894809074044478
The stats
module of the scipy package contains many functions for statistical analysis. The iqr
function in this module determines the interquartile range. A quartile values takes a fraction (of $1$) as argument. If the fraction is $0.25$ then the value that is returned is the first quartile (the $25^{th}$ percentile). This is a values from the data such that a quarter of the values are less than it. If the fraction is $0.75$ then the value that is returned is the third quartile (the $75^{th}$ percentile). This is the value in the data set for which three-quarter of values are less than it. The interquartile range of the age
values is calculated below.
# Determine the interquartile range of the age column
stats.iqr(df.age)
19.0
Task
Determine the variance in ejection_fraction
for each of the classes in the anaemia
variable.
Solution
df.groupby('anaemia').age.var(ddof=1)
anaemia No 141.763348 Yes 139.675064 Name: age, dtype: float64
The correlation between two continuous numerical variable indicates how the values in one variable changes as the value of the other changes. The correlation requires a value for each variable from each subject in a sample. The correlation between two variables is a number between $-1$ and $1$. A value of $-1$ indicates a perfect negative correlation, a value of $0$ indicates no correlation, and a value of $1$ indicates a perfect positive correlation. A positive value inidcates a positive correlation (as the values of one variable increases, so does the other.) A negative values indicates a negative correlation (as the values of one variable increases, the other decreases.)
As a rule of thumb the following values are used to indicate the strength of the correlation.
The same rule of thumb is used for negative values.
The correlation between two variables can be calculated using the corr
method. The correlation between the age
and ejection_fraction
variables is calculated below.
# Calculate the correlation between the age and ejection-fraction variables
df.age.corr(df['ejection_fraction'])
0.060098363232912864
There is a weak positive correlation between the two variables.
The describe
method calculate various statistics for a pandas series object that contains numerical data. The method is used for the age
variable below.
# Describe the age column
df.age.describe()
count 299.000000 mean 60.833893 std 11.894809 min 40.000000 25% 51.000000 50% 60.000000 75% 70.000000 max 95.000000 Name: age, dtype: float64
Note that the value for $50\%$ is the fiftieth percentile value. Half of the ages are less than this value, i.e. it is the median.
How do you import the pandas package in Python?
What function do you use to read a CSV file in pandas?
How do you display the first 5 rows of a DataFrame in pandas?
How do you calculate the mean of a column named Age
in a DataFrame named df
?
How do you calculate the median of a column named Salary
in a DataFrame named df
?
How do you calculate the standard deviation of a column named Score
in a DataFrame named df
?
How do you find the number of missing values in each column of a DataFrame named df
?
How do you calculate the correlation between two columns, Age
and Salary
, in a DataFrame named df
?
How do you select a subset of a DataFrame df
where the column Age
is greater than $30$?
How do you calculate the range (maximum - minimum) of a column named Score
in a DataFrame named df
?
How do you group a DataFrame df
by a column named Department
and calculate the mean of Salary
within each group?
How do you group a DataFrame df
by two columns, Department
and Job Title
, and count the number of rows within each group?
How do you use the groupby method to find the maximum Age
in each Department
in a DataFrame df
?
How do you create a cross-tabulation table that shows the frequency count of Department
(rows) and Job Title
(columns) in a DataFrame df
?
How do you create a cross-tabulation table that shows the mean Salary
for each combination of Department
(rows) and Job Title
(columns) in a DataFrame df
?