This chapter of an Introduction to Health Data Science by Dr JH Klopper is licensed under Attribution-NonCommercial-NoDerivatives 4.0 International
Data science, by its name and nature, requires access to data. Images, sound files, text, and much more, can be represented as data. This lecture explores the use of Python for tabular data.
Tabular data exists in rows and columns, either extracted from an image, a database, or similar structures, and represented as an array. An array is a set of values in rows and columns (and along even more dimensions as in the case of color images, where rows and column are stacked on top of each other representing red, green, and blue channels). There is a fantastic Python package for importing such tabular data.
The pandas package has much to do with the success of Python as a programming language for statistical analysis and data science. It is an enormous package and is used to import data, to manipulate data, to do calculations with data, and even create graphs and plots using the data.
This notebook provides a glimpse into the usefulness of the pandas package by importing data captured in a spreadsheet file and extracting information from it.
It is useful to import all packages at the start of a notebook. This allows for keeping track of what extensions to the Python language are used in the notebook. Namespace abbreviations are used to make it easier to refer to the packages later in the notebook.
import pandas as pd # Package to work with data
import numpy as np # Numerical analysis package
The data used in this notebook is a spreadsheet file in comma separated values (CSV) format called data.csv
. The file is imported using the pandas read_csv
function. Since it is not a Python function, the package in which the function is hosted, has to be referenced (depending on the way the package was imported). This is done by preceding the function with the pandas namespace abbreviation, pd
.
The data.csv
file is in the same folder (directory) as the notebook. If it was in a different folder, the path to the file would have to be specified.
# Import the data.csv file as assign it to the variable df
df = pd.read_csv('data.csv')
The file is also available on GitHub as PUBH1142Chap0801.csv
and is imported below.
# Import the file https://raw.githubusercontent.com/juanklopper/TutorialData/main/PUBH1142Chap0801.csv from GitHub and assign it to the variable df
df = pd.read_csv('https://raw.githubusercontent.com/juanklopper/TutorialData/main/PUBH1142Chap0801.csv')
The type
function used below, shows that the object assigned to the df
computer variable is a dataframe object.
# Type of df
type(df)
pandas.core.frame.DataFrame
The methods and attributes of the df
object can be printed using the dir
function. The methods are the functions that can be applied to the object. The attributes are the properties of the object.
# Print the first 20 methods and attributes of df suing the dir function
dir(df)[:20]
['Age', 'CholesterolAfter', 'CholesterolBefore', 'DOB', 'Delta', 'Group', 'HR', 'Name', 'Smoke', 'Survey', 'T', 'TAG', 'Vocation', '_AXIS_LEN', '_AXIS_ORDERS', '_AXIS_TO_AXIS_NUMBER', '_HANDLED_TYPES', '__abs__', '__add__', '__and__']
There are many methods and attributes. The first few listed are the column headers (the statistical variable names) in the spreadsheet file.
Task
Read about methods and attributes of a pandas dataframe object.
One such method is the head
method. By default it returns the first five rows of a dataframe object. An integer value can be passed as argument for a different number of rows.
# Use the head method to print the first 5 rows of df
df.head()
Name | DOB | Age | Vocation | Smoke | HR | sBP | CholesterolBefore | TAG | Survey | CholesterolAfter | Delta | Group | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Dylan Patton | 1981-10-07 | 43 | Energy manager | 0 | 47 | 145 | 1.2 | 1.2 | 1 | 0.7 | 0.5 | Active |
1 | Sandra Howard | 1993-01-27 | 53 | Tax adviser | 0 | 51 | 115 | 1.2 | 0.6 | 3 | 1.0 | 0.2 | Active |
2 | Samantha Williams | 1973-12-21 | 33 | IT consultant | 0 | 54 | 120 | 2.0 | 1.3 | 3 | 1.7 | 0.3 | Active |
3 | Ashley Hensley | 1981-12-01 | 43 | Nurse, children's | 0 | 54 | 103 | 2.1 | 1.6 | 4 | 2.1 | 0.0 | Active |
4 | Robert Wilson | 1964-06-23 | 46 | Clinical embryologist | 0 | 61 | 138 | 2.8 | 2.1 | 5 | 2.8 | 0.0 | Active |
The shape
attribute shows use the number of rows and columns, returned as a tuple. Note that unlike a method (which is like a Python function), an attribute has no parentheses.
# Print the shape of df
df.shape
(200, 13)
There are $200$ observations (rows) and $13$ statistical variables (columns) in this tidy data set.
The columns
attribute lists all the column header names, called labels.
# Print the columns of df
df.columns
Index(['Name', 'DOB', 'Age', 'Vocation', 'Smoke', 'HR', 'sBP', 'CholesterolBefore', 'TAG', 'Survey', 'CholesterolAfter', 'Delta', 'Group'], dtype='object')
The majority of dataframe objects will have two axes (rows and columns). We can verify this using the ndim
attribute.
# Print the dimensions of df
df.ndim
2
The size
attribute gives us the total number of data point values (the product of the number of rows and columns).
# Print the size of df
df.size
2600
The last attribute discussed here is the dtype
attribute. It returns the Python data type of the values in each of the columns. This is a very important step. Pandas does its best to interpret the data type. Depending on how the spreadsheet was created and how data was entered, it is not always possible to correctly interpret the type. In this case, the data type may have to be changed manually. Remember that the analysis of data is based on the data type of the variable.
# Print the type of df
df.dtypes
Name object DOB object Age int64 Vocation object Smoke int64 HR int64 sBP int64 CholesterolBefore float64 TAG float64 Survey int64 CholesterolAfter float64 Delta float64 Group object dtype: object
Categorical variables are denoted as an object
type. Numerical variables can be either integer or floating point numbers (numbers with decimal places). These are int64
and float64
(denoting 64-bit precision) respectively.
To analyse data, only certain values may need to be extracted. This is a very useful skill.
Pandas refers to a single column in a dataframe object as a series object. The Age
column is extracted below and saved as a series object, assigned to the variable age_column
. The notation uses square brackets, with the column name represented as a string.
# Assign the Age column in df to the variable age_column
age_column = df['Age']
The age_column
variable now holds a pandas series object.
# Print the type of age_column
type(age_column)
pandas.core.series.Series
Since there are no illegal characters in the column name such as spaces, the dot notation can be used.The age_column computer
variable is overwritten below.
# Assign the Age column in df to the variable age_column and use dot notation
age_column = df.Age
The first five rows of the series object is displayed below using the `head`` method.
# Print the first 5 rows of age_column
age_column.head()
0 43 1 53 2 33 3 43 4 46 Name: Age, dtype: int64
The index column above indicates that the values are not a simple Python list object.
Task
Select the Vocation
column and assign it to the variable vocation
. Confirm that the object assigned to vocation is a pandas series object. The tail
function (with no arguments), prints the last five observations to the screen. Use the tail
function to print the last five observations in vocation
to the screen.
Solution
vocation = df.Vocation
type(vocation)
pandas.core.series.Series
vocation.tail()
195 Sales professional, IT 196 Paramedic 197 Chartered certified accountant 198 Dancer 199 Theme park manager Name: Vocation, dtype: object
At times it may be more useful to work with a numpy array, rather than a pandas series object. The to_numpy
method is used to extract the age values as a numpy array. It is used below and the numpy array is assigned to the variable age
.
# Convert the age_column to a numpy array and assign it to the variable age
age = age_column.to_numpy()
The type
function shows that age
contains a numpy array.
# Print the type of age
type(age)
numpy.ndarray
Pandas series objects and numpy arrays have many methods and attributes. The mean
, min
, and max
methods are used below to calculate the avergage, minimum, and maximum age values respectively.
# Calculate the average age in age_column
age_column.mean()
53.07
# Calculate the minimum age in age_column
age_column.min()
30
# Calculate the maximum age in age_column
age_column.max()
75
Inidividual rows (subjects) can be specified by making use of the iloc
attribute (or property, which is the term used by pandas) for a dataframe object. The iloc
property stands for integer location. Therefor, integers are used to specify the row and column numbers. The first row of data is extracted below.
0.
# Display the first row of data in df using the iloc method
df.iloc[0] # Note the use of square brackets
Name Dylan Patton DOB 1981-10-07 Age 43 Vocation Energy manager Smoke 0 HR 47 sBP 145 CholesterolBefore 1.2 TAG 1.2 Survey 1 CholesterolAfter 0.7 Delta 0.5 Group Active Name: 0, dtype: object
A list of values can be passed to select multiple rows. Rows $2$, $5$, and $7$ are selected below. Note that these rows correspond to index values $1$, $4$, and $6$.
# Select rows 2, 5, and 7 using the iloc method
df.iloc[[1, 4, 6]]
Name | DOB | Age | Vocation | Smoke | HR | sBP | CholesterolBefore | TAG | Survey | CholesterolAfter | Delta | Group | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | Sandra Howard | 1993-01-27 | 53 | Tax adviser | 0 | 51 | 115 | 1.2 | 0.6 | 3 | 1.0 | 0.2 | Active |
4 | Robert Wilson | 1964-06-23 | 46 | Clinical embryologist | 0 | 61 | 138 | 2.8 | 2.1 | 5 | 2.8 | 0.0 | Active |
6 | Frank Zimmerman | 1981-03-04 | 54 | Police officer | 0 | 60 | 129 | 2.9 | 2.4 | 1 | 2.6 | 0.3 | Active |
A range object can be used to select a contiguous number of rows.
# Select the first two rows using the iloc method
df.iloc[0:2]
Name | DOB | Age | Vocation | Smoke | HR | sBP | CholesterolBefore | TAG | Survey | CholesterolAfter | Delta | Group | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Dylan Patton | 1981-10-07 | 43 | Energy manager | 0 | 47 | 145 | 1.2 | 1.2 | 1 | 0.7 | 0.5 | Active |
1 | Sandra Howard | 1993-01-27 | 53 | Tax adviser | 0 | 51 | 115 | 1.2 | 0.6 | 3 | 1.0 | 0.2 | Active |
As with Python list objects, shorthand notation can be used fro contiguous rows. The first two rows are selected below.
# Select the first two rows using the iloc method
df.iloc[:2]
Name | DOB | Age | Vocation | Smoke | HR | sBP | CholesterolBefore | TAG | Survey | CholesterolAfter | Delta | Group | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Dylan Patton | 1981-10-07 | 43 | Energy manager | 0 | 47 | 145 | 1.2 | 1.2 | 1 | 0.7 | 0.5 | Active |
1 | Sandra Howard | 1993-01-27 | 53 | Tax adviser | 0 | 51 | 115 | 1.2 | 0.6 | 3 | 1.0 | 0.2 | Active |
Negative index notation similarly follows the Python list convention. The last three rows are selected below.
# Select the last three rows
df.iloc[-3:]
Name | DOB | Age | Vocation | Smoke | HR | sBP | CholesterolBefore | TAG | Survey | CholesterolAfter | Delta | Group | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
197 | Charles Smith | 1959-01-30 | 61 | Chartered certified accountant | 0 | 99 | 212 | 10.1 | 5.6 | 4 | 9.7 | 0.4 | Control |
198 | Barry Porter | 1979-05-30 | 65 | Dancer | 1 | 98 | 200 | 10.1 | 5.3 | 3 | 10.0 | 0.1 | Control |
199 | Julie Barrett | 1972-07-27 | 66 | Theme park manager | 1 | 102 | 208 | 11.1 | 5.7 | 2 | 10.7 | 0.4 | Active |
Columns can also be index. The iloc
method can take two arguments in the form row,column
. The code below extracts the first five rows and columns $1$ and $3$. Note that the index values have to be used.
# Extract the first 5 rows for the first and third columns using the iloc method
df.iloc[:5, [0, 2]]
Name | Age | |
---|---|---|
0 | Dylan Patton | 43 |
1 | Sandra Howard | 53 |
2 | Samantha Williams | 33 |
3 | Ashley Hensley | 43 |
4 | Robert Wilson | 46 |
The loc
property can be used in a similar fashion. The column names (as a list or a slice) are specified. If the index values were not integers, but strings, those names could also be used. Remember that the row and column names are referred to as labels. Below, the same labels are extracted as above. Note, though, that the row range is now set to 0:4
. When extracting rows and column, ALWAYS use the row, column
notation. Since two columns are required, they are passed as a Python list object (in square brackets) after the comma. Each column name is passed as a string.
# Extract the first five rows for the Name and Age colunms using the loc method
df.loc[:4, ['Name', 'Age']] # Note the difference in row index compared to the iloc method
Name | Age | |
---|---|---|
0 | Dylan Patton | 43 |
1 | Sandra Howard | 53 |
2 | Samantha Williams | 33 |
3 | Ashley Hensley | 43 |
4 | Robert Wilson | 46 |
Task
Print the TAG
and the Smoke
variables to the screen, but only for observations (rows) $10$ through $15$.
Solution
df.loc[10:15, ['TAG', 'Smoke']]
TAG | Smoke | |
---|---|---|
10 | 1.7 | 0 |
11 | 2.0 | 0 |
12 | 2.3 | 0 |
13 | 2.1 | 0 |
14 | 2.4 | 0 |
15 | 2.6 | 2 |
The iat
indexing extracts a single cell by using its row and column index.
# Extract the value in row 2, column 1 using the iat method
df.iat[1, 0]
'Sandra Howard'
Filtering data is one of the most useful tasks in data science. This section explores the filtering data by extracting numpy array objects based on criteria (conditions). Conditional operators include >
, <
, <=
, >=
, ==
, and !=
. These operators can be used to compare values in a numpy array to a single value or to another numpy array.
Remember that a sample space of a variable are all the possible values that a variable can take. This is particularly useful when looking at categorical variables. The unique
method is used to find all the sample space elements in a column. Below, the unique values in the Smoke
column are filtered.
# Print the unique values in the Smoke column
df.Smoke.unique()
array([0, 2, 1])
There are three unique elements in the sample space of this column. This method is great for surprises that might be hidden in a dataframe such as one or more strings in a numerical data column. A common example would be the Age
column that has one or two strings such as thirty-two
in it, instead of the number $32$. Strings in a numerical data column will prevent calculations on that column and such errors in the data must be corrected.
The Smoke
column contain information about the smoking habits of the respondents in the data set. The sample space contains three integers, $0$ for not smoking, $1$ for smoking, and $2$ for previous smoking.
Creating an array that contains the ages of only the patients who do not smoke in our dataframe is created below. To do this, indexing is used directly. A conditional is used to include only $0$ subjects (df.Smoke == 0
). The column is then referenced, which is Age
, followed by the to_numpy
method.
# Select the ages of those who do not smoke and assign it to the variable age_nonsmokers (as a numpy array)
age_nonsmokers = df.loc[df.Smoke == 0]['Age'].to_numpy()
age_nonsmokers # Print the numpy array to the screen
array([43, 53, 33, 43, 46, 48, 54, 58, 44, 31, 45, 35, 49, 56, 57, 35, 50, 49, 63, 45, 51, 40, 47, 41, 47, 38, 54, 30, 46, 64, 40, 45, 65, 55, 53, 54, 72, 32, 38, 59, 53, 42, 38, 51, 37, 36, 48, 49, 62, 39, 74, 42, 72, 61, 33, 30, 44, 71, 49, 75, 43, 55, 38, 36, 46, 60, 57, 69, 56, 66, 60, 42, 32, 31, 56, 35, 63, 54, 68, 72, 40, 54, 62, 74, 62, 41, 61, 61])
The code, above may seem a bit difficult at first. It does read rather like an English language sentence, though. Take the dataframe object. Filter the rows in column Smoke
that are $0$. For all of these rows return the Age
values as a numpy array.
As an alternative, the loc
indexing can be used, passing a row and a column specification as arguments. The row interrogates the Smoke
column and filters only those with a $0$ entry. The column is then specified to the the Age
column.
# Use the loc method to select the ages of those who do not smoke and assign it to the variable age_nonsmokers (as a numpy array)
age_nonsmokers = df.loc[df.Smoke == 0, 'Age'].to_numpy()
age_nonsmokers
array([43, 53, 33, 43, 46, 48, 54, 58, 44, 31, 45, 35, 49, 56, 57, 35, 50, 49, 63, 45, 51, 40, 47, 41, 47, 38, 54, 30, 46, 64, 40, 45, 65, 55, 53, 54, 72, 32, 38, 59, 53, 42, 38, 51, 37, 36, 48, 49, 62, 39, 74, 42, 72, 61, 33, 30, 44, 71, 49, 75, 43, 55, 38, 36, 46, 60, 57, 69, 56, 66, 60, 42, 32, 31, 56, 35, 63, 54, 68, 72, 40, 54, 62, 74, 62, 41, 61, 61])
The different ways to interact with pandas adds to its power. Without the to_numpy
method, the code above returns a pandas series object. Every data scientist has their own preferences and certain tasks require specific approaches.
# Use the loc method to select the ages of those who do not smoke and assign it to the variable age_nonsmokers
age_nonsmokers = df.loc[df.Smoke == 0, 'Age']
age_nonsmokers
0 43 1 53 2 33 3 43 4 46 .. 180 74 183 62 188 41 193 61 197 61 Name: Age, Length: 88, dtype: int64
Statistical analysis can be done on either the series or numpy objects.
# Calculate the average age of those who do not smoke
age_nonsmokers.mean()
50.09090909090909
Task
The to_list
method creates a Python list object. Use the to_list
method to create a Python list object of the Age
values for those that have a value in the HR
(heart rate) column that is more than $72$. Print only the first $10$ values.
Solution
df.loc[df.HR > 72, 'Age'].to_list()[:10]
[57, 52, 49, 70, 45, 53, 50, 33, 30, 44]
Filtering by two criteria (two columns), Age
and Survey
is shown below. The filtering can either refer to and or or. In the first, all the criteria are required to be met and in the second, only one of the criteria need be met (return a True
value).
The symbol for and is &
and for or is |
. Below, &
is used since both criteria must be met. Each filter is created in a set of parentheses. The code uses the row, column
notation.
# Select the non-smokers who choice 3 in the Survey column and return their ages (assigned to the variable non_smokers_choice_3)
non_smokers_choice_3 = df.loc[(df.Smoke == 0) & (df.Survey == 3), 'Age']
non_smokers_choice_3
1 53 2 33 11 35 14 57 27 47 32 54 38 45 45 54 50 53 112 36 130 68 183 62 Name: Age, dtype: int64
In English the code reads: Take the df
dataframe object and look down the rows of the Smoke
and Survey
columns. Return only the rows where Smoke
is $0$ AND Survey
is $3$. Then return the Age
column values for all these rows fulfilling both criteria.
Now filtering by or is shown. The symbol for or is |
. Below, |
is used since only one of the criteria must be met. Each filter is created in a set of parentheses. The code uses the row, column
notation.
It may be useful to create variable that holds the criteria first.
# Filter by Smoke == 0 or Survey > 3 and assign the filter to the variable crit
crit = (df.Smoke == 0) | (df.Survey > 3)
The crit
variable can now be used in the filtering.
# Use the crit criterium and assign to the variable nonsmoker_or_survey_gt_3
nonsmoker_or_survey_gt_3 = df.loc[crit, 'Age']
nonsmoker_or_survey_gt_3[:5] # Print the first 5 rows
0 43 1 53 2 33 3 43 4 46 Name: Age, dtype: int64
Task
Create a variable named selection that will select for all the observations who never smoked, as a heart rate, HR
, less than $80$, and a systolic blood pressure, sBP
, of less than $120$. Use this selection to create a numpy $n$-dimensional array of the Age
and triacylglycerol, TAG
, variable values.
Solution
selection = (df.Smoke == 0) & (df.HR < 80) & (df.sBP < 120)
df.loc[selection, ['Age', 'TAG']].to_numpy
<bound method DataFrame.to_numpy of Age TAG 1 53 0.6 3 43 1.6 8 44 2.4 11 35 2.0 21 45 2.1 41 55 3.1>
Instead of just an array of values, a new dataframe object can also be created using filtering. (Because it is a part of an existing dataframe object, some Data Scientist refer to it as a sub-dataframe object.) It includes all the columns (variables), but only for patients up to and including $49$ years of age. This is very simple to achieve.
# Create a new dataframe object that only contains participants younger than 50 and assign it to the variable df_younger_than_50
df_younger_than_50 = df.loc[df.Age < 50]
df_younger_than_50[:5] # Display the first five rows
Name | DOB | Age | Vocation | Smoke | HR | sBP | CholesterolBefore | TAG | Survey | CholesterolAfter | Delta | Group | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Dylan Patton | 1981-10-07 | 43 | Energy manager | 0 | 47 | 145 | 1.2 | 1.2 | 1 | 0.7 | 0.5 | Active |
2 | Samantha Williams | 1973-12-21 | 33 | IT consultant | 0 | 54 | 120 | 2.0 | 1.3 | 3 | 1.7 | 0.3 | Active |
3 | Ashley Hensley | 1981-12-01 | 43 | Nurse, children's | 0 | 54 | 103 | 2.1 | 1.6 | 4 | 2.1 | 0.0 | Active |
4 | Robert Wilson | 1964-06-23 | 46 | Clinical embryologist | 0 | 61 | 138 | 2.8 | 2.1 | 5 | 2.8 | 0.0 | Active |
5 | Leslie Diaz | 1994-08-25 | 48 | Politician's assistant | 0 | 59 | 122 | 2.8 | 1.4 | 4 | 2.6 | 0.2 | Active |
Since this new dataframe object has only data filetered by a crtiterium, it is easier to calculate any required statistics.
# Mean HR
df_younger_than_50.HR.mean()
69.45348837209302
By making use of the isin
method, certain values in the sample space of a variable can be specified. The code below filters for IT consultant
, Energy manager
, and Clinical embryologist
in the Vocation
column.
To achive this, a Python list object with the sample space elements, is created first.
# Create a Python list object with the string element 'IT consultant', 'Energy manager', 'Clinical embryologist' and assign it to the variable jobs
jobs = ['IT consultant', 'Energy manager', 'Clinical embryologist']
A criterium is created using the jobs
list object.
# Create a criterium
crit = df.Vocation.isin(jobs)
The criterium is used to filter the dataframe object.
# Filter for job in the Vocation column and assign it to the variable df_jobs
df_jobs = df.loc[crit]
df_jobs
Name | DOB | Age | Vocation | Smoke | HR | sBP | CholesterolBefore | TAG | Survey | CholesterolAfter | Delta | Group | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Dylan Patton | 1981-10-07 | 43 | Energy manager | 0 | 47 | 145 | 1.2 | 1.2 | 1 | 0.7 | 0.5 | Active |
2 | Samantha Williams | 1973-12-21 | 33 | IT consultant | 0 | 54 | 120 | 2.0 | 1.3 | 3 | 1.7 | 0.3 | Active |
4 | Robert Wilson | 1964-06-23 | 46 | Clinical embryologist | 0 | 61 | 138 | 2.8 | 2.1 | 5 | 2.8 | 0.0 | Active |
188 | Joan Chavez | 1999-10-07 | 41 | Energy manager | 0 | 93 | 182 | 9.1 | 5.0 | 2 | 8.8 | 0.3 | Control |
The str.contains
method can be used to filter for a string in a column. The code below filters for all the participants who have manager
in their job title.
# Build a criterium with the str.contains method for the word manager
crit = df.Vocation.str.contains('manager', na=False) # Flag for missing values (see later)
# Create a new dataframe object with the criterium and assign it to the variable df_manager
df_manager = df.loc[crit]
df_manager[:5]
Name | DOB | Age | Vocation | Smoke | HR | sBP | CholesterolBefore | TAG | Survey | CholesterolAfter | Delta | Group | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Dylan Patton | 1981-10-07 | 43 | Energy manager | 0 | 47 | 145 | 1.2 | 1.2 | 1 | 0.7 | 0.5 | Active |
34 | Mr. Tyler Strickland DDS | 1940-08-27 | 46 | Tourist information centre manager | 0 | 62 | 136 | 4.1 | 2.3 | 2 | 3.8 | 0.3 | Control |
54 | Stephanie Jacobs | 1977-06-18 | 38 | Estate manager/land agent | 0 | 69 | 139 | 4.4 | 2.7 | 2 | 4.0 | 0.4 | Active |
55 | Juan Johnson | 1956-12-09 | 51 | Logistics and distribution manager | 0 | 65 | 141 | 4.5 | 2.9 | 5 | 4.0 | 0.5 | Control |
71 | Tina Martinez | 1941-05-31 | 74 | Passenger transport manager | 1 | 69 | 146 | 4.8 | 3.1 | 4 | 4.5 | 0.3 | Active |
Another valueble skill is to be able to change actual data in a dataframe object. Fortunately, datadrame objects can be manipulated in many ways.
The names of individual columns can be replaced with the rename
method using a dictionary. Below, the Name
column is renamed to Participant
. To make the changes permanent, the inplace
argument value to True
.
# Replace the Name column with the Patient
df.rename(columns={'Name': 'Patient'}, inplace=True)
df.columns # Call the column names
Index(['Patient', 'DOB', 'Age', 'Vocation', 'Smoke', 'HR', 'sBP', 'CholesterolBefore', 'TAG', 'Survey', 'CholesterolAfter', 'Delta', 'Group'], dtype='object')
The drop
method is used to delete a column. The axis
argument is set to $1$ to indicate that the column is to be deleted. The inplace
argument value is set to True
to make the changes permanent.
The Delta
column is deleted from the dataframe object.
# Delete the Delta column
df.drop(columns='Delta', inplace=True)
In specific types of research, personal data are obfuscated to protect the privacy of the people in the dataset. In a simple case, $2$ is added the age of every participant when capturing the data. In reality, they are all $2$ years younger.
# Add 2 to each value in the Age column
df.Age = df.Age + 2 # Note that the Age column is reassigned
df.Age[:5] # Display the first five rows
0 45 1 55 2 35 3 45 4 48 Name: Age, dtype: int64
There are more than one way to achieve a similar goal. One way is to create a function and then use the apply
method to apply the function to the values in a column.
User-defined functions are created using the def
and return
keywords. The former tells Python to create a new function. After a space follows the name is given to the function. A set of parentheses follow that contains a placeholder for the argument (here called a parameter). In its simplest form, the return
keyword follows. As the name indicates this section returns a value for the function. Below, it is clear from the code that $x$ is the parameter and will take an argument value. The function then adds $2$ to the argument value.
# Create a user-defined function named add2 that adds 2 to a value
def add2(x):
return x + 2
The apply
method is now used and the add2
function is used. The value in each row is now increased by $2$.
# Apply the add2 function to the Age column
df.Age = df.Age.apply(add2)
df.Age[:5]
0 47 1 57 2 37 3 47 4 50 Name: Age, dtype: int64
Task
Create a new variable named AgeHR
that is the product of the values in the variables Age
and HR
.
Solution
df['AgeHR'] = df.Age * df.HR
Dictionaries can be used to change the values in a column. The keys are the original values and the values are the new values. The replace
method is used to change the values.
The code below returns the sample space (unique) elements in the Group
column.
# Return the unique values in the Group column
df.Group.unique()
array(['Active', 'Control'], dtype=object)
The values are Active
and Control
. These are replaced with 1
and 0
respectively.
# Use the replace method to replace the values in the Group column with 1 for Active and 0 for Control
df.Group.replace({'Active': 1, 'Control': 0}, inplace=True)
The first five observations in the Group
column are printed to the screen to confirm the changes.
# Print the first 5 rows of the Group column
df.Group[:5]
0 1 1 1 2 1 3 1 4 1 Name: Group, dtype: int64
Adding columns is an often used technique . It is as simple as stating the new name in square brackets as a string and then adding a list of values. The list of values that are added must be the same length (number of rows) as the dataframe.
The code below creates two new columns, FirstName
and LastName
. The split
method is used to split the values in the Patient
column. The expand
argument is set to True
to create two new columns. The str
method is used to indicate that the split
method is to be applied to a string, in this case a space.
# Split the values in the Patient column on the empty space between the first and last names and assign it to the variable split_participant
split_participant = df.Patient.str.split(' ', expand=True)
Indexing is used to extract the two lists of values.
# Create new columns in df named FirstName and LastName
df['FirstName'] = split_participant[0]
df['LastName'] = split_participant[1]
The first five rows of the FirstName
, LastName
, and the original Patient
columns are printed to the screen.
# Print first 5 rows of FirstName and LastName and Patient
df[['FirstName', 'LastName', 'Patient']][:5]
FirstName | LastName | Patient | |
---|---|---|---|
0 | Dylan | Patton | Dylan Patton |
1 | Sandra | Howard | Sandra Howard |
2 | Samantha | Williams | Samantha Williams |
3 | Ashley | Hensley | Ashley Hensley |
4 | Robert | Wilson | Robert Wilson |
Intervals, called bins, can be created to change a continuous numerical variable into a categorical variable. The pandas cut
function is used to create the bins. The bins
argument is used to specify the intervals. The labels
argument is used to specify the labels for each bins. The right
argument is used to specify whether the right or left side of the interval is closed. The include_lowest
argument is used to specify whether the lowest value is included in the first bin.
The code below creates three bins for the CholesterolBefore
column. The minimum and maximum values are calculated first.
# Minimum CholesterolBefore value
df.CholesterolBefore.min()
1.2
# Maximum CholesterolBefore value
df.CholesterolBefore.max()
11.1
Three bins are created with four values used in the cut
function to create the bins $\text{low}=[0,2.5)$, $\text{normal}=[2.5,5.0)$, and $\text{high}=[5.0,20)$. The intervals safely contain the range of values.
# Create a new column called CholesterolBeforeLevel using the cut function with bins=[0,5,10,20], which is right False and has the labels ['low', 'normal', 'high']
df['CholesterolBeforeLevel'] = pd.cut(df.CholesterolBefore, bins=[0, 5, 10, 20], right=False, labels=['low', 'normal', 'high'])
df.CholesterolBeforeLevel[:5]
0 low 1 low 2 low 3 low 4 low Name: CholesterolBeforeLevel, dtype: category Categories (3, object): ['low' < 'normal' < 'high']
Task
Create a new variable in the dataframe called HRLevel
from the HR
column with intervals $(0,65], (65,78], (78,1000)$ with labels low
, normal
, and high
respectively.
Solution
df['HRLevel'] = pd.cut(df.HR, bins = [0, 65, 78, 1000], right=True, labels=['low', 'normal', 'high'])
df.HRLevel[:5]
0 low 1 low 2 low 3 low 4 low Name: HRLevel, dtype: category Categories (3, object): ['low' < 'normal' < 'high']
It is very often that datasets contain missing data. The numpy library has a specific entity called a nan
value. This stands for not a number. The nan
value is used to indicate that a value is missing.
# Print the numpy nan value
np.nan
nan
A list object or an array can contain a nan
value. Computation is not possible since the missing value is unknown.
# Create a list with three integers and a nan value and assign it to the variable my_list
my_list = [1, 2, 3, np.nan]
my_list
[1, 2, 3, nan]
# Calculate the mean of my_list
np.mean(my_list)
nan
The solution is nan
since it is impossible to know what the mean of the four values are.
The MissingData.csv
file contains missing data.
# Read the MissingData.csv file and assign it to the variable df_missing
df_missing = pd.read_csv('https://raw.githubusercontent.com/juanklopper/TutorialData/main/MissingData.csv')
Printing the first $10$ rows to the screen shows some of the missing values.
# Print the first 10 rows of df_missing
df_missing[:10]
age | salary | previouscompany | |
---|---|---|---|
0 | 57.0 | NaN | 1.0 |
1 | 56.0 | 50927.0 | NaN |
2 | 46.0 | 75500.0 | 3.0 |
3 | NaN | 84417.0 | NaN |
4 | 60.0 | 63002.0 | 1.0 |
5 | 54.0 | 54652.0 | NaN |
6 | NaN | 65739.0 | 1.0 |
7 | 64.0 | 89397.0 | 3.0 |
8 | 60.0 | 77797.0 | 4.0 |
9 | 61.0 | NaN | 1.0 |
The dropna
method deletes all rows that contain missing data for any of the columns in that row. All the data for that row (observation) are lost. The removal is not permanent. Below, the dataframe with dropped rows is reassigned to a new variable, leaving the oriiginal intact.
# Delete rows with missing values and assign to a new variable complete_data_df
complete_data_df = df_missing.dropna()
The axis
argument is set to $0$ to indicate that rows are to be dropped. This is the default. If the value is set to $1$, then any column with a missing value is dropped.
The isna
method can be used to return True
and False
values along a column. This is done for the age
column below. True
indicates that a value is missing.
# Use .isna() for the age column
df_missing.age.isna()
0 False 1 False 2 False 3 True 4 False 5 False 6 True 7 False 8 False 9 False 10 False 11 True 12 False 13 False 14 False 15 True 16 False 17 False 18 False 19 False 20 False 21 False 22 False 23 False 24 False 25 False 26 False 27 False 28 False 29 False 30 False 31 False Name: age, dtype: bool
Since True
is stored as $1$, the sum
method can be used to count all the rows (observations) with missing data.
# Sum over all the missing data in the age column
df_missing.age.isna().sum()
4
There are $4$ missing values in the age
column.
The process of creating values to fill in missing data is called data imputation and is a seperate and complicated subject. The pandas library provides a fillna
method for filling in the missing data with simple calculations.
The fillna
method takes a method
argument. This can be set to ffil
and bfill
. As the argument values names imply, the value in the previous or subsequent row will be used to fill in the missing value.
# Use forward fill to impute the missing values in the age column
df_missing.age.fillna(method='ffill')[:5]
/var/folders/f9/9znwn5gx653_4p99htssbq400000gn/T/ipykernel_1250/170363983.py:2: FutureWarning: Series.fillna with 'method' is deprecated and will raise in a future version. Use obj.ffill() or obj.bfill() instead. df_missing.age.fillna(method='ffill')[:5]
0 57.0 1 56.0 2 46.0 3 46.0 4 60.0 Name: age, dtype: float64
The missing value for index $3$ has been replaced by the $46.0$ value at index $2$.
Values for missing data can also be used. The median of the age
column is used to replace the missing values. Note that the median
method used for a pandas series object ignores the missing data unlike a Python list object or a numpy array.
# Fill the missing age values with the median age
df_missing.age.fillna(df_missing.age.median())[:5]
0 57.0 1 56.0 2 46.0 3 57.0 4 60.0 Name: age, dtype: float64
It is common to use default values when data is not available at the time of capture. The DefaultMissingData.csv
file uses the values $999$, Nil
and Missing
to indicate missing data.
# Import the DefaultMissingData.scv file and assign ot to the variable default_missing_df
default_missing_df = pd.read_csv('https://raw.githubusercontent.com/juanklopper/TutorialData/main/DefaultMissingData.csv')
default_missing_df[:5]
age | salary | previous_company | |
---|---|---|---|
0 | 57 | Nil | 1 |
1 | 56 | 50927 | Missing |
2 | 46 | 75500 | 3 |
3 | 999 | 84417 | Missing |
4 | 60 | 63002 | 1 |
The na_values
argument can take a Python list object that contains all the default values that indicate missing values.
# Reimport the csv file and use the na_values argument to specify the missing values 999, Nil, and Missing
default_missing_df = pd.read_csv('DefaultMissingData.csv', na_values=[999, 'Nil', 'Missing'])
default_missing_df[:5]
age | salary | previous_company | |
---|---|---|---|
0 | 57.0 | NaN | 1.0 |
1 | 56.0 | 50927.0 | NaN |
2 | 46.0 | 75500.0 | 3.0 |
3 | NaN | 84417.0 | NaN |
4 | 60.0 | 63002.0 | 1.0 |
The missing values are now correctly indicates as nan
values.
How do you import the pandas library in Python?
How do you read a CSV file named data.csv
into a pandas DataFrame?
How do you display the first $5$ rows of a DataFrame named df
?
How do you find the number of rows and columns in a DataFrame df
?
How do you print the names of all the columns in a DataFrame df
?
How do you get the summary information of the DataFrame df
including data types and non-null values?
How do you select only the rows in DataFrame df
where the column A
is greater than $5$?
How do you select the columns A
and B
from the DataFrame df
?
How do you rename the column A
to Alpha
in DataFrame df
?
How do you replace all occurrences of the value $999$ in DataFrame df
with nan
?
How do you drop all rows in DataFrame df
that contain missing values?
How do you fill all missing values in DataFrame df
with $0$?
How do you calculate the mean of column A
in DataFrame df
?
How do you create a new column C
in DataFrame df
that is the sum of columns A
and B
?
How do you save the DataFrame df
to a CSV file named output.csv
? (Find the solution by searching the pandas documentation.)