9 | EXPLORATORY DATA ANALYSIS¶

This chapter of an Introduction to Health Data Science by Dr JH Klopper is licensed under Attribution-NonCommercial-NoDerivatives 4.0 International

Introduction¶

This chapter serves as an introduction to the fundamental concepts and techniques of exploratory data analysis (EDA), a critical step in any data analysis process.

EDA is an approach and a set of techniques that allows the initial understanding and apprecaition of the information in complex datasets. It is an essential component of data science, machine learning, and statistical modeling. It is the first step in the data analysis process, exploring the data to understand its main characteristics before making any assumptions, conducting statistical test, or building predictive models.

The primary goal of EDA is to maximize the data scientist's insight into a dataset and into the underlying structure of the data and to provide all of the specific items that a data scientist would want to extract from a data set. These include the following.

A good-fitting, parsimonious model
Estimates for parameters
Uncertainties for those estimates
A ranked list of important factors
Conclusions as to whether individual factors are statistically significant
Optimal settings

EDA is not merely a set of techniques, but a mindset. It encourages the data scientist to remain open-minded, to explore the data from multiple angles, to question assumptions, and to be ready to modify initial hypotheses based on the insights gained from the data.

This data explores the summary of data, termed descriptive statistics.

Packages used in this notebook¶

In [1]:

import pandas as pd
from scipy import stats

Import data¶

The data for this chapter is in the heart_failure.csv file. It is imported below and metadata about the data is returned using the attributes and methods used in chapter 8.

In [2]:

# Import the heart_failure.csv file and assign it to the vaiable df
df = pd.read_csv('https://raw.githubusercontent.com/juanklopper/TutorialData/main/heart_failure.csv')

In [3]:

# View the first 5 rows of the dataframe
df.head()

Out[3]:

	age	anaemia	creatinine_phosphokinase	diabetes	ejection_fraction	hypertension	platelets	serum_creatinine	serum_sodium	sex	smoking	time	death
0	75.0	No	582	No	20	Yes	265000.00	1.9	130	Male	No	4	Yes
1	55.0	No	7861	No	38	No	263358.03	1.1	136	Male	No	6	Yes
2	65.0	No	146	No	20	No	162000.00	1.3	129	Male	Yes	7	Yes
3	50.0	Yes	111	No	20	No	210000.00	1.9	137	Male	No	7	Yes
4	65.0	Yes	160	Yes	20	No	327000.00	2.7	116	Female	No	8	Yes

In [4]:

# Return the shape of the dataframe
df.shape # Returns a tuple with number of rows and columns

Out[4]:

(299, 13)

In [5]:

# Return info about the dataframe
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 299 entries, 0 to 298
Data columns (total 13 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   age                       299 non-null    float64
 1   anaemia                   299 non-null    object 
 2   creatinine_phosphokinase  299 non-null    int64  
 3   diabetes                  299 non-null    object 
 4   ejection_fraction         299 non-null    int64  
 5   hypertension              299 non-null    object 
 6   platelets                 299 non-null    float64
 7   serum_creatinine          299 non-null    float64
 8   serum_sodium              299 non-null    int64  
 9   sex                       299 non-null    object 
 10  smoking                   299 non-null    object 
 11  time                      299 non-null    int64  
 12  death                     299 non-null    object 
dtypes: float64(3), int64(4), object(6)
memory usage: 30.5+ KB

Exploring categorical variables¶

The info method used above shows various column names (statistical variables) that are object types. This indicates categorical variables that are were not encoded by numbers when the data files was generated.

Exploratory data analysis of categorical variables starts by determining all the possible values that the variable can take, called the sample space of the variable. The different elements in the sample space are called classes or levels. The number or frequency of occurrence of each class can be determined. Finally, the relative frequency or proportion of each class can also be calculated by dividing the frequency of the class by the total number of observations.

The anaemia column indicates whether a suject has anemia. The sample space (all the values or elements or classes) that the anaemia variable can takes can be determined using the unique method.

In [6]:

# Determine the unique classes for the anaemia column
df['anaemia'].unique()

Out[6]:

array(['No', 'Yes'], dtype=object)

The sample space elements are No (no anemia present) and Yes (anemia present). The value_counts method returns the frequency of each of the classes.

In [7]:

# Determine the value counts of the classes in the anaemia column
df.anaemia.value_counts()

Out[7]:

anaemia
No     170
Yes    129
Name: count, dtype: int64

A total of $170$ subjects did not have anemia and $129$ did.

The passing the value of True to the normalize argument returns the relative frequency (or proportion) of each class.

In [8]:

# Determine the relative frequency of the classes in the anaemia column
df.anaemia.value_counts(normalize=True)

Out[8]:

anaemia
No     0.568562
Yes    0.431438
Name: proportion, dtype: float64

About $56.9\%$ of the subjects did not have anemia and about $43.1\%$ of subjects did. The mode of this variable is therefor No.

Task

Determine the sample space elements, the frequency, and the relevant frequencies of the diabetes variable.

Solution

In [9]:

df.diabetes.unique()

Out[9]:

array(['No', 'Yes'], dtype=object)

In [10]:

df.diabetes.value_counts()

Out[10]:

diabetes
No     174
Yes    125
Name: count, dtype: int64

In [11]:

df.diabetes.value_counts(normalize=True)

Out[11]:

diabetes
No     0.58194
Yes    0.41806
Name: proportion, dtype: float64

Comparative descriptive statistics for categorical variables expressed the frequency (or proportion) of the classes in the data, after grouping the vlaues by the sample space elements of another categorical variable. Below, the frequency of the anaemia variable is determined for each of the classes in the diabetes variable, by making use of the groupby method.

In [12]:

# Determine the frequency of the classes in the anaemia column after grouping by the classses in the diabetes column
df.groupby('diabetes').anaemia.value_counts()

Out[12]:

diabetes  anaemia
No        No         98
          Yes        76
Yes       No         72
          Yes        53
Name: count, dtype: int64

The value_counts method can be added to show the proportion of each. By multiplying by $100$, the percentages are obtained.

In [13]:

# Express the proportions as percentages
df.groupby('diabetes').anaemia.value_counts(normalize=True) * 100

Out[13]:

diabetes  anaemia
No        No         56.321839
          Yes        43.678161
Yes       No         57.600000
          Yes        42.400000
Name: proportion, dtype: float64

The pandas crosstab function can create a contingency table of the comparitive frequencies. The variable that is passed first, populates the rowss of the table.

In [14]:

# Create a cross tabulation of the anaemia and diabetes columns
pd.crosstab(df.diabetes, df.anaemia)

Out[14]:

anaemia	No	Yes
diabetes
No	98	76
Yes	72	53

The joint frequencies from the table above is expressed as a percentage below.

In [15]:

# Express the cross tabulation as percentages
pd.crosstab(df.diabetes, df.anaemia, normalize=True) * 100

Out[15]:

anaemia	No	Yes
diabetes
No	32.775920	25.418060
Yes	24.080268	17.725753

Generate a contingency table, indicating the joint percentages of the anaemia variable for each of the classes in the death column.

Solution

In [16]:

pd.crosstab(df.death, df.anaemia, normalize=True) * 100

Out[16]:

anaemia	No	Yes
death
No	40.133779	27.759197
Yes	16.722408	15.384615

Exploring numerical variables¶

Various statistics can be calculated for continuous numerical variables. The include measures of central tendency and measures of dispersion.

Measures of central tendency¶

Measures of central tendency are statistical indicators that represent the center point or typical value of a dataset. These measures indicate where most values in a dataset fall and are also referred to as the central location of a distribution. The three main measures of central tendency are as follows.

Mean: The mean, often called the average, is calculated by adding all data points in the dataset and then dividing by the number of data points. The mean is sensitive to outliers, meaning that extremely high or low values can skew the mean.
Median: The median is the middle value in a dataset when the data points are arranged in ascending or descending order. If the dataset has an even number of observations, the median is the average of the two middle numbers. The median is not affected by outliers or skewed data.
Mode: The mode is the value that appears most frequently in a dataset. A dataset may have one mode, more than one mode, or no mode at all. The mode was used to explore categorical variables in the previous section.

These measures help provide a summary of the dataset and can give a general sense of the typical value that might be expected.

The mean for a numerical variable can be calculated using the mean method for a pandas series object. The mean of the age variable is calculated below.

In [17]:

# Calculate the mean of the age variable
df.age.mean()

Out[17]:

60.83389297658862

The mean for a variable can be calculated for different groups, created from the classes of a categorical variable. The mean of the age variable is calculated for each of the classes in the death variable, using the groupby method.

In [18]:

# Calculate the mean age by the groups in the death column
df.groupby('death').age.mean()

Out[18]:

death
No     58.761906
Yes    65.215281
Name: age, dtype: float64

The median method calculates the median of a pandas series object that contains numerical data. The median of the age variable is calculated below for each class in the death variable.

In [19]:

# Calculate the median age by the groups in the death column
df.groupby('death').age.median()

Out[19]:

death
No     60.0
Yes    65.0
Name: age, dtype: float64

Task

Determine the median ejection_fraction for each of the classes in the anaemia variable.

Solution

In [20]:

df.groupby('anaemia')['ejection_fraction'].median()

Out[20]:

anaemia
No     38.0
Yes    38.0
Name: ejection_fraction, dtype: float64

Measures of dispersion¶

Measures of dispersion, also known as measures of variability, provide insights into the spread or variability of a dataset. They indicate how spread out the values in a dataset are around the center (mean or median). The main measures of dispersion are as follows.

Range: The range is the simplest measure of dispersion and is calculated as the difference between the highest and the lowest value in the dataset.
Variance: Variance measures how far each number in the set is from the mean (average) and thus from every other number in the set. It's often used in statistical and probability theory.
Standard Deviation: The standard deviation is the square root of the variance and provides a measure of dispersion in the same units as the original data. It's the most commonly used measure of spread.
Interquartile Range (IQR): The IQR is the range between the first quartile (25th percentile) and the third quartile (75th percentile) of the data. It is used to measure statistical dispersion and data variability by dividing a data set into quartiles.

These measures help to understand the variability within a dataset, the reliability of statistical estimations, and the level of uncertainty. They are crucial in many statistical analyses as they provide context to the measures of central tendency.

The minimum and maximum values of a numerical variable (expressed as a pandas series object) are calculatd using the min and max methods.

In [21]:

# Determine the minimum value of the age column
df.age.min()

Out[21]:

40.0

In [22]:

# Determine the maximum value of the age column
df.age.max()

Out[22]:

95.0

In [23]:

# Determine the range of the age column
df.age.max() - df.age.min()

Out[23]:

55.0

The sample variance of a numerical variable (expressed as a pandas series object) is determined using the var method. The ddof argument is set to $1$ to calculate the sample variance. The ddof argument is the delta degrees of freedom and is used to calculate the sample variance. The default value is $0$ and is used to calculate the population variance. The difference between these two equations are shown in (1) for the sake of interest. In (1) $s^{2}$ is the sample variance, $\sigma^{2}$ is the population variance, $x_{i}$ is the values for the numerical variable in each of the $n$ subjects in the sample, $N$ is the population size, and $\bar{x}$ is the mean of the variable, and $n$ is the sample size.

$$ \begin{align*} &s^{2} = \frac{\sum_{i=1}^{n} \left( x_{i} - \bar{x} \right)^{2}}{n-1} \\ \\ &\sigma^{2} = \frac{\sum_{i=1}^{n} \left( x_{i} - \bar{x} \right)^{2}}{N} \\ \\ \end{align*} $$

In [24]:

# Determine the sample variance of the age column
df.age.var(ddof=1)

Out[24]:

141.48648290797084

The std methhod determines the standard deviation. The sample standard deviation is calculated below.

In [25]:

# Determine the sample standard deviation of the age column
df.age.std(ddof=1)

Out[25]:

11.894809074044478

The stats module of the scipy package contains many functions for statistical analysis. The iqr function in this module determines the interquartile range. A quartile values takes a fraction (of $1$) as argument. If the fraction is $0.25$ then the value that is returned is the first quartile (the $25^{th}$ percentile). This is a values from the data such that a quarter of the values are less than it. If the fraction is $0.75$ then the value that is returned is the third quartile (the $75^{th}$ percentile). This is the value in the data set for which three-quarter of values are less than it. The interquartile range of the age values is calculated below.

In [26]:

# Determine the interquartile range of the age column
stats.iqr(df.age)

Out[26]:

19.0

Task

Determine the variance in ejection_fraction for each of the classes in the anaemia variable.

Solution

In [27]:

df.groupby('anaemia').age.var(ddof=1)

Out[27]:

anaemia
No     141.763348
Yes    139.675064
Name: age, dtype: float64

Correlation between numerical variables¶

The correlation between two continuous numerical variable indicates how the values in one variable changes as the value of the other changes. The correlation requires a value for each variable from each subject in a sample. The correlation between two variables is a number between $-1$ and $1$. A value of $-1$ indicates a perfect negative correlation, a value of $0$ indicates no correlation, and a value of $1$ indicates a perfect positive correlation. A positive value inidcates a positive correlation (as the values of one variable increases, so does the other.) A negative values indicates a negative correlation (as the values of one variable increases, the other decreases.)

As a rule of thumb the following values are used to indicate the strength of the correlation.

$0.00$ to $0.19$ very weak
$0.20$ to $0.39$ weak
$0.40$ to $0.59$ moderate
$0.60$ to $0.79$ strong
$0.80$ to $1.00$ very strong

The same rule of thumb is used for negative values.

The correlation between two variables can be calculated using the corr method. The correlation between the age and ejection_fraction variables is calculated below.

In [28]:

# Calculate the correlation between the age and ejection-fraction variables
df.age.corr(df['ejection_fraction'])

Out[28]:

0.060098363232912864

There is a weak positive correlation between the two variables.

Using pandas built-in descriptive statistics¶

The describe method calculate various statistics for a pandas series object that contains numerical data. The method is used for the age variable below.

In [29]:

# Describe the age column
df.age.describe()

Out[29]:

count    299.000000
mean      60.833893
std       11.894809
min       40.000000
25%       51.000000
50%       60.000000
75%       70.000000
max       95.000000
Name: age, dtype: float64

Note that the value for $50\%$ is the fiftieth percentile value. Half of the ages are less than this value, i.e. it is the median.

Quiz questions¶

Questions¶

How do you import the pandas package in Python?
What function do you use to read a CSV file in pandas?
How do you display the first 5 rows of a DataFrame in pandas?
How do you calculate the mean of a column named Age in a DataFrame named df?
How do you calculate the median of a column named Salary in a DataFrame named df?
How do you calculate the standard deviation of a column named Score in a DataFrame named df?
How do you find the number of missing values in each column of a DataFrame named df?
How do you calculate the correlation between two columns, Age and Salary, in a DataFrame named df?
How do you select a subset of a DataFrame df where the column Age is greater than $30$?
How do you calculate the range (maximum - minimum) of a column named Score in a DataFrame named df?
How do you group a DataFrame df by a column named Department and calculate the mean of Salary within each group?
How do you group a DataFrame df by two columns, Department and Job Title, and count the number of rows within each group?
How do you use the groupby method to find the maximum Age in each Department in a DataFrame df?
How do you create a cross-tabulation table that shows the frequency count of Department (rows) and Job Title (columns) in a DataFrame df?
How do you create a cross-tabulation table that shows the mean Salary for each combination of Department (rows) and Job Title (columns) in a DataFrame df?

In [ ]: