8 | WORKING WITH DATA¶

This chapter of an Introduction to Health Data Science by Dr JH Klopper is licensed under Attribution-NonCommercial-NoDerivatives 4.0 International

Introduction¶

Data science, by its name and nature, requires access to data. Images, sound files, text, and much more, can be represented as data. This lecture explores the use of Python for tabular data.

Tabular data exists in rows and columns, either extracted from an image, a database, or similar structures, and represented as an array. An array is a set of values in rows and columns (and along even more dimensions as in the case of color images, where rows and column are stacked on top of each other representing red, green, and blue channels). There is a fantastic Python package for importing such tabular data.

The pandas package has much to do with the success of Python as a programming language for statistical analysis and data science. It is an enormous package and is used to import data, to manipulate data, to do calculations with data, and even create graphs and plots using the data.

This notebook provides a glimpse into the usefulness of the pandas package by importing data captured in a spreadsheet file and extracting information from it.

Packages used in this notebook¶

It is useful to import all packages at the start of a notebook. This allows for keeping track of what extensions to the Python language are used in the notebook. Namespace abbreviations are used to make it easier to refer to the packages later in the notebook.

In [1]:

import pandas as pd  # Package to work with data

import numpy as np  # Numerical analysis package

Importing data¶

The data used in this notebook is a spreadsheet file in comma separated values (CSV) format called data.csv. The file is imported using the pandas read_csv function. Since it is not a Python function, the package in which the function is hosted, has to be referenced (depending on the way the package was imported). This is done by preceding the function with the pandas namespace abbreviation, pd.

The data.csv file is in the same folder (directory) as the notebook. If it was in a different folder, the path to the file would have to be specified.

In [2]:

# Import the data.csv file as assign it to the variable df
df = pd.read_csv('data.csv')

The file is also available on GitHub as PUBH1142Chap0801.csv and is imported below.

In [3]:

# Import the file https://raw.githubusercontent.com/juanklopper/TutorialData/main/PUBH1142Chap0801.csv from GitHub and assign it to the variable df
df = pd.read_csv('https://raw.githubusercontent.com/juanklopper/TutorialData/main/PUBH1142Chap0801.csv')

The type function used below, shows that the object assigned to the df computer variable is a dataframe object.

In [4]:

# Type of df
type(df)

Out[4]:

pandas.core.frame.DataFrame

The methods and attributes of the df object can be printed using the dir function. The methods are the functions that can be applied to the object. The attributes are the properties of the object.

In [5]:

# Print the first 20 methods and attributes of df suing the dir function
dir(df)[:20]

Out[5]:

['Age',
 'CholesterolAfter',
 'CholesterolBefore',
 'DOB',
 'Delta',
 'Group',
 'HR',
 'Name',
 'Smoke',
 'Survey',
 'T',
 'TAG',
 'Vocation',
 '_AXIS_LEN',
 '_AXIS_ORDERS',
 '_AXIS_TO_AXIS_NUMBER',
 '_HANDLED_TYPES',
 '__abs__',
 '__add__',
 '__and__']

There are many methods and attributes. The first few listed are the column headers (the statistical variable names) in the spreadsheet file.

Task

Read about methods and attributes of a pandas dataframe object.

One such method is the head method. By default it returns the first five rows of a dataframe object. An integer value can be passed as argument for a different number of rows.

In [6]:

# Use the head method to print the first 5 rows of df
df.head()

Out[6]:

	Name	DOB	Age	Vocation	HR	sBP	CholesterolBefore	TAG	Survey	CholesterolAfter	Delta	Group
0	Dylan Patton	1981-10-07	43	Energy manager	47	145	1.2	1.2	1	0.7	0.5	Active
1	Sandra Howard	1993-01-27	53	Tax adviser	51	115	1.2	0.6	3	1.0	0.2	Active
2	Samantha Williams	1973-12-21	33	IT consultant	54	120	2.0	1.3	3	1.7	0.3	Active
3	Ashley Hensley	1981-12-01	43	Nurse, children's	54	103	2.1	1.6	4	2.1	0.0	Active
4	Robert Wilson	1964-06-23	46	Clinical embryologist	61	138	2.8	2.1	5	2.8	0.0	Active

The shape attribute shows use the number of rows and columns, returned as a tuple. Note that unlike a method (which is like a Python function), an attribute has no parentheses.

In [7]:

# Print the shape of df
df.shape

Out[7]:

(200, 13)

There are $200$ observations (rows) and $13$ statistical variables (columns) in this tidy data set.

The columns attribute lists all the column header names, called labels.

In [8]:

# Print the columns of df
df.columns

Out[8]:

Index(['Name', 'DOB', 'Age', 'Vocation', 'Smoke', 'HR', 'sBP',
       'CholesterolBefore', 'TAG', 'Survey', 'CholesterolAfter', 'Delta',
       'Group'],
      dtype='object')

The majority of dataframe objects will have two axes (rows and columns). We can verify this using the ndim attribute.

In [9]:

# Print the dimensions of df
df.ndim

Out[9]:

The size attribute gives us the total number of data point values (the product of the number of rows and columns).

In [10]:

# Print the size of df
df.size

Out[10]:

The last attribute discussed here is the dtype attribute. It returns the Python data type of the values in each of the columns. This is a very important step. Pandas does its best to interpret the data type. Depending on how the spreadsheet was created and how data was entered, it is not always possible to correctly interpret the type. In this case, the data type may have to be changed manually. Remember that the analysis of data is based on the data type of the variable.

In [11]:

# Print the type of df
df.dtypes

Out[11]:

Name                  object
DOB                   object
Age                    int64
Vocation              object
Smoke                  int64
HR                     int64
sBP                    int64
CholesterolBefore    float64
TAG                  float64
Survey                 int64
CholesterolAfter     float64
Delta                float64
Group                 object
dtype: object

Categorical variables are denoted as an object type. Numerical variables can be either integer or floating point numbers (numbers with decimal places). These are int64 and float64 (denoting 64-bit precision) respectively.

Selecting rows and columns¶

To analyse data, only certain values may need to be extracted. This is a very useful skill.

Pandas refers to a single column in a dataframe object as a series object. The Age column is extracted below and saved as a series object, assigned to the variable age_column. The notation uses square brackets, with the column name represented as a string.

In [12]:

# Assign the Age column in df to the variable age_column
age_column = df['Age']

The age_column variable now holds a pandas series object.

In [13]:

# Print the type of age_column
type(age_column)

Out[13]:

pandas.core.series.Series

Since there are no illegal characters in the column name such as spaces, the dot notation can be used.The age_column computer variable is overwritten below.

In [14]:

# Assign the Age column in df to the variable age_column and use dot notation
age_column = df.Age

The first five rows of the series object is displayed below using the `head`` method.

In [15]:

# Print the first 5 rows of age_column
age_column.head()

Out[15]:

0    43
1    53
2    33
3    43
4    46
Name: Age, dtype: int64

The index column above indicates that the values are not a simple Python list object.

Task

Select the Vocation column and assign it to the variable vocation. Confirm that the object assigned to vocation is a pandas series object. The tail function (with no arguments), prints the last five observations to the screen. Use the tail function to print the last five observations in vocation to the screen.

Solution

In [16]:

vocation = df.Vocation
type(vocation)

Out[16]:

pandas.core.series.Series

In [17]:

vocation.tail()

Out[17]:

195            Sales professional, IT
196                         Paramedic
197    Chartered certified accountant
198                            Dancer
199                Theme park manager
Name: Vocation, dtype: object

At times it may be more useful to work with a numpy array, rather than a pandas series object. The to_numpy method is used to extract the age values as a numpy array. It is used below and the numpy array is assigned to the variable age.

In [18]:

# Convert the age_column to a numpy array and assign it to the variable age
age = age_column.to_numpy()

The type function shows that age contains a numpy array.

In [19]:

# Print the type of age
type(age)

Out[19]:

numpy.ndarray

Pandas series objects and numpy arrays have many methods and attributes. The mean, min, and max methods are used below to calculate the avergage, minimum, and maximum age values respectively.

In [20]:

# Calculate the average age in age_column
age_column.mean()

Out[20]:

53.07

In [21]:

# Calculate the minimum age in age_column
age_column.min()

Out[21]:

In [22]:

# Calculate the maximum age in age_column
age_column.max()

Out[22]:

Inidividual rows (subjects) can be specified by making use of the iloc attribute (or property, which is the term used by pandas) for a dataframe object. The iloc property stands for integer location. Therefor, integers are used to specify the row and column numbers. The first row of data is extracted below. 0.

In [24]:

# Display the first row of data in df using the iloc method
df.iloc[0] # Note the use of square brackets

Out[24]:

Name                   Dylan Patton
DOB                      1981-10-07
Age                              43
Vocation             Energy manager
Smoke                             0
HR                               47
sBP                             145
CholesterolBefore               1.2
TAG                             1.2
Survey                            1
CholesterolAfter                0.7
Delta                           0.5
Group                        Active
Name: 0, dtype: object

A list of values can be passed to select multiple rows. Rows $2$, $5$, and $7$ are selected below. Note that these rows correspond to index values $1$, $4$, and $6$.

In [25]:

# Select rows 2, 5, and 7 using the iloc method
df.iloc[[1, 4, 6]]

Out[25]:

	Name	DOB	Age	Vocation	HR	sBP	CholesterolBefore	TAG	Survey	CholesterolAfter	Delta	Group
1	Sandra Howard	1993-01-27	53	Tax adviser	51	115	1.2	0.6	3	1.0	0.2	Active
4	Robert Wilson	1964-06-23	46	Clinical embryologist	61	138	2.8	2.1	5	2.8	0.0	Active
6	Frank Zimmerman	1981-03-04	54	Police officer	60	129	2.9	2.4	1	2.6	0.3	Active

A range object can be used to select a contiguous number of rows.

In [26]:

# Select the first two rows using the iloc method
df.iloc[0:2]

Out[26]:

	Name	DOB	Age	Vocation	Smoke	HR	sBP	CholesterolBefore	TAG	Survey	CholesterolAfter	Delta	Group
0	Dylan Patton	1981-10-07	43	Energy manager	0	47	145	1.2	1.2	1	0.7	0.5	Active
1	Sandra Howard	1993-01-27	53	Tax adviser	0	51	115	1.2	0.6	3	1.0	0.2	Active

As with Python list objects, shorthand notation can be used fro contiguous rows. The first two rows are selected below.

In [27]:

# Select the first two rows using the iloc method
df.iloc[:2]

Out[27]:

	Name	DOB	Age	Vocation	Smoke	HR	sBP	CholesterolBefore	TAG	Survey	CholesterolAfter	Delta	Group
0	Dylan Patton	1981-10-07	43	Energy manager	0	47	145	1.2	1.2	1	0.7	0.5	Active
1	Sandra Howard	1993-01-27	53	Tax adviser	0	51	115	1.2	0.6	3	1.0	0.2	Active

Negative index notation similarly follows the Python list convention. The last three rows are selected below.

In [28]:

# Select the last three rows
df.iloc[-3:]

Out[28]:

	Name	DOB	Age	Vocation	Smoke	HR	sBP	CholesterolBefore	TAG	Survey	CholesterolAfter	Delta	Group
197	Charles Smith	1959-01-30	61	Chartered certified accountant	0	99	212	10.1	5.6	4	9.7	0.4	Control
198	Barry Porter	1979-05-30	65	Dancer	1	98	200	10.1	5.3	3	10.0	0.1	Control
199	Julie Barrett	1972-07-27	66	Theme park manager	1	102	208	11.1	5.7	2	10.7	0.4	Active

Columns can also be index. The iloc method can take two arguments in the form row,column. The code below extracts the first five rows and columns $1$ and $3$. Note that the index values have to be used.

In [32]:

# Extract the first 5 rows for the first and third columns using the iloc method
df.iloc[:5, [0, 2]]

Out[32]:

	Name	Age
0	Dylan Patton	43
1	Sandra Howard	53
2	Samantha Williams	33
3	Ashley Hensley	43
4	Robert Wilson	46

The loc property can be used in a similar fashion. The column names (as a list or a slice) are specified. If the index values were not integers, but strings, those names could also be used. Remember that the row and column names are referred to as labels. Below, the same labels are extracted as above. Note, though, that the row range is now set to 0:4. When extracting rows and column, ALWAYS use the row, column notation. Since two columns are required, they are passed as a Python list object (in square brackets) after the comma. Each column name is passed as a string.

In [30]:

# Extract the first five rows for the Name and Age colunms using the loc method
df.loc[:4, ['Name', 'Age']] # Note the difference in row index compared to the iloc method

Out[30]:

	Name	Age
0	Dylan Patton	43
1	Sandra Howard	53
2	Samantha Williams	33
3	Ashley Hensley	43
4	Robert Wilson	46

Task

Print the TAG and the Smoke variables to the screen, but only for observations (rows) $10$ through $15$.

Solution

In [30]:

df.loc[10:15, ['TAG', 'Smoke']]

Out[30]:

	TAG	Smoke
10	1.7	0
11	2.0	0
12	2.3	0
13	2.1	0
14	2.4	0
15	2.6	2

The iat indexing extracts a single cell by using its row and column index.

In [31]:

# Extract the value in row 2, column 1 using the iat method
df.iat[1, 0]

Out[31]:

'Sandra Howard'

Filtering data¶

Filtering data is one of the most useful tasks in data science. This section explores the filtering data by extracting numpy array objects based on criteria (conditions). Conditional operators include >, <, <=, >=, ==, and !=. These operators can be used to compare values in a numpy array to a single value or to another numpy array.

Finding unique values in a column¶

Remember that a sample space of a variable are all the possible values that a variable can take. This is particularly useful when looking at categorical variables. The unique method is used to find all the sample space elements in a column. Below, the unique values in the Smoke column are filtered.

In [32]:

# Print the unique values in the Smoke column
df.Smoke.unique()

Out[32]:

array([0, 2, 1])

There are three unique elements in the sample space of this column. This method is great for surprises that might be hidden in a dataframe such as one or more strings in a numerical data column. A common example would be the Age column that has one or two strings such as thirty-two in it, instead of the number $32$. Strings in a numerical data column will prevent calculations on that column and such errors in the data must be corrected.

Ages of all non-smokers¶

The Smoke column contain information about the smoking habits of the respondents in the data set. The sample space contains three integers, $0$ for not smoking, $1$ for smoking, and $2$ for previous smoking.

Creating an array that contains the ages of only the patients who do not smoke in our dataframe is created below. To do this, indexing is used directly. A conditional is used to include only $0$ subjects (df.Smoke == 0). The column is then referenced, which is Age, followed by the to_numpy method.

In [33]:

# Select the ages of those who do not smoke and assign it to the variable age_nonsmokers (as a numpy array)
age_nonsmokers = df.loc[df.Smoke == 0]['Age'].to_numpy()
age_nonsmokers # Print the numpy array to the screen

Out[33]:

array([43, 53, 33, 43, 46, 48, 54, 58, 44, 31, 45, 35, 49, 56, 57, 35, 50,
       49, 63, 45, 51, 40, 47, 41, 47, 38, 54, 30, 46, 64, 40, 45, 65, 55,
       53, 54, 72, 32, 38, 59, 53, 42, 38, 51, 37, 36, 48, 49, 62, 39, 74,
       42, 72, 61, 33, 30, 44, 71, 49, 75, 43, 55, 38, 36, 46, 60, 57, 69,
       56, 66, 60, 42, 32, 31, 56, 35, 63, 54, 68, 72, 40, 54, 62, 74, 62,
       41, 61, 61])

The code, above may seem a bit difficult at first. It does read rather like an English language sentence, though. Take the dataframe object. Filter the rows in column Smoke that are $0$. For all of these rows return the Age values as a numpy array.

As an alternative, the loc indexing can be used, passing a row and a column specification as arguments. The row interrogates the Smoke column and filters only those with a $0$ entry. The column is then specified to the the Age column.

In [34]:

# Use the loc method to select the ages of those who do not smoke and assign it to the variable age_nonsmokers (as a numpy array)
age_nonsmokers = df.loc[df.Smoke == 0, 'Age'].to_numpy()
age_nonsmokers

Out[34]:

array([43, 53, 33, 43, 46, 48, 54, 58, 44, 31, 45, 35, 49, 56, 57, 35, 50,
       49, 63, 45, 51, 40, 47, 41, 47, 38, 54, 30, 46, 64, 40, 45, 65, 55,
       53, 54, 72, 32, 38, 59, 53, 42, 38, 51, 37, 36, 48, 49, 62, 39, 74,
       42, 72, 61, 33, 30, 44, 71, 49, 75, 43, 55, 38, 36, 46, 60, 57, 69,
       56, 66, 60, 42, 32, 31, 56, 35, 63, 54, 68, 72, 40, 54, 62, 74, 62,
       41, 61, 61])

The different ways to interact with pandas adds to its power. Without the to_numpy method, the code above returns a pandas series object. Every data scientist has their own preferences and certain tasks require specific approaches.

In [35]:

# Use the loc method to select the ages of those who do not smoke and assign it to the variable age_nonsmokers
age_nonsmokers = df.loc[df.Smoke == 0, 'Age']
age_nonsmokers

Out[35]:

0      43
1      53
2      33
3      43
4      46
       ..
180    74
183    62
188    41
193    61
197    61
Name: Age, Length: 88, dtype: int64

Statistical analysis can be done on either the series or numpy objects.

In [36]:

# Calculate the average age of those who do not smoke
age_nonsmokers.mean()

Out[36]:

50.09090909090909

Task

The to_list method creates a Python list object. Use the to_list method to create a Python list object of the Age values for those that have a value in the HR (heart rate) column that is more than $72$. Print only the first $10$ values.

Solution

In [37]:

df.loc[df.HR > 72, 'Age'].to_list()[:10]

Out[37]:

[57, 52, 49, 70, 45, 53, 50, 33, 30, 44]

Non-smoker ages where survey question choice is 3¶

Filtering by two criteria (two columns), Age and Survey is shown below. The filtering can either refer to and or or. In the first, all the criteria are required to be met and in the second, only one of the criteria need be met (return a True value).

The symbol for and is & and for or is |. Below, & is used since both criteria must be met. Each filter is created in a set of parentheses. The code uses the row, column notation.

In [38]:

# Select the non-smokers who choice 3 in the Survey column and return their ages (assigned to the variable non_smokers_choice_3)
non_smokers_choice_3 = df.loc[(df.Smoke == 0) & (df.Survey == 3), 'Age']
non_smokers_choice_3

Out[38]:

1      53
2      33
11     35
14     57
27     47
32     54
38     45
45     54
50     53
112    36
130    68
183    62
Name: Age, dtype: int64

In English the code reads: Take the df dataframe object and look down the rows of the Smoke and Survey columns. Return only the rows where Smoke is $0$ AND Survey is $3$. Then return the Age column values for all these rows fulfilling both criteria.

Never smoked or satisfaction score is greater than 3¶

Now filtering by or is shown. The symbol for or is |. Below, | is used since only one of the criteria must be met. Each filter is created in a set of parentheses. The code uses the row, column notation.

It may be useful to create variable that holds the criteria first.

In [39]:

# Filter by Smoke == 0 or Survey > 3 and assign the filter to the variable crit
crit = (df.Smoke == 0) | (df.Survey > 3)

The crit variable can now be used in the filtering.

In [40]:

# Use the crit criterium and assign to the variable nonsmoker_or_survey_gt_3
nonsmoker_or_survey_gt_3 = df.loc[crit, 'Age']
nonsmoker_or_survey_gt_3[:5] # Print the first 5 rows

Out[40]:

0    43
1    53
2    33
3    43
4    46
Name: Age, dtype: int64

Task

Create a variable named selection that will select for all the observations who never smoked, as a heart rate, HR, less than $80$, and a systolic blood pressure, sBP, of less than $120$. Use this selection to create a numpy $n$-dimensional array of the Age and triacylglycerol, TAG, variable values.

Solution

In [41]:

selection = (df.Smoke == 0) & (df.HR < 80) & (df.sBP < 120)
df.loc[selection, ['Age', 'TAG']].to_numpy

Out[41]:

<bound method DataFrame.to_numpy of     Age  TAG
1    53  0.6
3    43  1.6
8    44  2.4
11   35  2.0
21   45  2.1
41   55  3.1>

Create a new dataframe object that only contains participants younger than 50¶

Instead of just an array of values, a new dataframe object can also be created using filtering. (Because it is a part of an existing dataframe object, some Data Scientist refer to it as a sub-dataframe object.) It includes all the columns (variables), but only for patients up to and including $49$ years of age. This is very simple to achieve.

In [42]:

# Create a new dataframe object that only contains participants younger than 50 and assign it to the variable df_younger_than_50
df_younger_than_50 = df.loc[df.Age < 50]
df_younger_than_50[:5] # Display the first five rows

Out[42]:

	Name	DOB	Age	Vocation	HR	sBP	CholesterolBefore	TAG	Survey	CholesterolAfter	Delta	Group
0	Dylan Patton	1981-10-07	43	Energy manager	47	145	1.2	1.2	1	0.7	0.5	Active
2	Samantha Williams	1973-12-21	33	IT consultant	54	120	2.0	1.3	3	1.7	0.3	Active
3	Ashley Hensley	1981-12-01	43	Nurse, children's	54	103	2.1	1.6	4	2.1	0.0	Active
4	Robert Wilson	1964-06-23	46	Clinical embryologist	61	138	2.8	2.1	5	2.8	0.0	Active
5	Leslie Diaz	1994-08-25	48	Politician's assistant	59	122	2.8	1.4	4	2.6	0.2	Active

Since this new dataframe object has only data filetered by a crtiterium, it is easier to calculate any required statistics.

In [43]:

# Mean HR
df_younger_than_50.HR.mean()

Out[43]:

69.45348837209302

Create a new dataframe for participants with a restricted list of job titles¶

By making use of the isin method, certain values in the sample space of a variable can be specified. The code below filters for IT consultant, Energy manager, and Clinical embryologist in the Vocation column.

To achive this, a Python list object with the sample space elements, is created first.

In [44]:

# Create a Python list object with the string element 'IT consultant', 'Energy manager', 'Clinical embryologist' and assign it to the variable jobs
jobs = ['IT consultant', 'Energy manager', 'Clinical embryologist']

A criterium is created using the jobs list object.

In [45]:

# Create a criterium
crit = df.Vocation.isin(jobs)

The criterium is used to filter the dataframe object.

In [46]:

# Filter for job in the Vocation column and assign it to the variable df_jobs
df_jobs = df.loc[crit]
df_jobs

Out[46]:

	Name	DOB	Age	Vocation	HR	sBP	CholesterolBefore	TAG	Survey	CholesterolAfter	Delta	Group
0	Dylan Patton	1981-10-07	43	Energy manager	47	145	1.2	1.2	1	0.7	0.5	Active
2	Samantha Williams	1973-12-21	33	IT consultant	54	120	2.0	1.3	3	1.7	0.3	Active
4	Robert Wilson	1964-06-23	46	Clinical embryologist	61	138	2.8	2.1	5	2.8	0.0	Active
188	Joan Chavez	1999-10-07	41	Energy manager	93	182	9.1	5.0	2	8.8	0.3	Control

The str.contains method can be used to filter for a string in a column. The code below filters for all the participants who have manager in their job title.

In [47]:

# Build a criterium with the str.contains method for the word manager
crit = df.Vocation.str.contains('manager', na=False) # Flag for missing values (see later)

In [48]:

# Create a new dataframe object with the criterium and assign it to the variable df_manager
df_manager = df.loc[crit]
df_manager[:5]

Out[48]:

	Name	DOB	Age	Vocation	Smoke	HR	sBP	CholesterolBefore	TAG	Survey	CholesterolAfter	Delta	Group
0	Dylan Patton	1981-10-07	43	Energy manager	0	47	145	1.2	1.2	1	0.7	0.5	Active
34	Mr. Tyler Strickland DDS	1940-08-27	46	Tourist information centre manager	0	62	136	4.1	2.3	2	3.8	0.3	Control
54	Stephanie Jacobs	1977-06-18	38	Estate manager/land agent	0	69	139	4.4	2.7	2	4.0	0.4	Active
55	Juan Johnson	1956-12-09	51	Logistics and distribution manager	0	65	141	4.5	2.9	5	4.0	0.5	Control
71	Tina Martinez	1941-05-31	74	Passenger transport manager	1	69	146	4.8	3.1	4	4.5	0.3	Active

Updating or changing the values in a dataframe¶

Another valueble skill is to be able to change actual data in a dataframe object. Fortunately, datadrame objects can be manipulated in many ways.

Renaming columns¶

The names of individual columns can be replaced with the rename method using a dictionary. Below, the Name column is renamed to Participant. To make the changes permanent, the inplace argument value to True.

In [49]:

# Replace the Name column with the Patient
df.rename(columns={'Name': 'Patient'}, inplace=True)
df.columns # Call the column names

Out[49]:

Index(['Patient', 'DOB', 'Age', 'Vocation', 'Smoke', 'HR', 'sBP',
       'CholesterolBefore', 'TAG', 'Survey', 'CholesterolAfter', 'Delta',
       'Group'],
      dtype='object')

Deleting columns¶

The drop method is used to delete a column. The axis argument is set to $1$ to indicate that the column is to be deleted. The inplace argument value is set to True to make the changes permanent.

The Delta column is deleted from the dataframe object.

In [50]:

# Delete the Delta column
df.drop(columns='Delta', inplace=True)

Add a value of 2 to the age of each participant¶

In specific types of research, personal data are obfuscated to protect the privacy of the people in the dataset. In a simple case, $2$ is added the age of every participant when capturing the data. In reality, they are all $2$ years younger.

In [51]:

# Add 2 to each value in the Age column
df.Age = df.Age + 2 # Note that the Age column is reassigned
df.Age[:5] # Display the first five rows

Out[51]:

0    45
1    55
2    35
3    45
4    48
Name: Age, dtype: int64

There are more than one way to achieve a similar goal. One way is to create a function and then use the apply method to apply the function to the values in a column.

User-defined functions are created using the def and return keywords. The former tells Python to create a new function. After a space follows the name is given to the function. A set of parentheses follow that contains a placeholder for the argument (here called a parameter). In its simplest form, the return keyword follows. As the name indicates this section returns a value for the function. Below, it is clear from the code that $x$ is the parameter and will take an argument value. The function then adds $2$ to the argument value.

In [52]:

# Create a user-defined function named add2 that adds 2 to a value
def add2(x):
    return x + 2

The apply method is now used and the add2 function is used. The value in each row is now increased by $2$.

In [53]:

# Apply the add2 function to the Age column
df.Age = df.Age.apply(add2)
df.Age[:5]

Out[53]:

0    47
1    57
2    37
3    47
4    50
Name: Age, dtype: int64

Task

Create a new variable named AgeHR that is the product of the values in the variables Age and HR.

Solution

In [54]:

df['AgeHR'] = df.Age * df.HR

Changing the values in a column¶

Dictionaries can be used to change the values in a column. The keys are the original values and the values are the new values. The replace method is used to change the values.

The code below returns the sample space (unique) elements in the Group column.

In [55]:

# Return the unique values in the Group column
df.Group.unique()

Out[55]:

array(['Active', 'Control'], dtype=object)

The values are Active and Control. These are replaced with 1 and 0 respectively.

In [56]:

# Use the replace method to replace the values in the Group column with 1 for Active and 0 for Control
df.Group.replace({'Active': 1, 'Control': 0}, inplace=True)

The first five observations in the Group column are printed to the screen to confirm the changes.

In [57]:

# Print the first 5 rows of the Group column
df.Group[:5]

Out[57]:

0    1
1    1
2    1
3    1
4    1
Name: Group, dtype: int64

Splitting the values in a column into two columns¶

Adding columns is an often used technique . It is as simple as stating the new name in square brackets as a string and then adding a list of values. The list of values that are added must be the same length (number of rows) as the dataframe.

The code below creates two new columns, FirstName and LastName. The split method is used to split the values in the Patient column. The expand argument is set to True to create two new columns. The str method is used to indicate that the split method is to be applied to a string, in this case a space.

In [58]:

# Split the values in the Patient column on the empty space between the first and last names and assign it to the variable split_participant
split_participant = df.Patient.str.split(' ', expand=True)

Indexing is used to extract the two lists of values.

In [59]:

# Create new columns in df named FirstName and LastName
df['FirstName'] = split_participant[0]
df['LastName'] = split_participant[1]

The first five rows of the FirstName, LastName, and the original Patient columns are printed to the screen.

In [60]:

# Print first 5 rows of FirstName and LastName and Patient
df[['FirstName', 'LastName', 'Patient']][:5]

Out[60]:

	FirstName	LastName	Patient
0	Dylan	Patton	Dylan Patton
1	Sandra	Howard	Sandra Howard
2	Samantha	Williams	Samantha Williams
3	Ashley	Hensley	Ashley Hensley
4	Robert	Wilson	Robert Wilson

Categorize a numerical variable¶

Intervals, called bins, can be created to change a continuous numerical variable into a categorical variable. The pandas cut function is used to create the bins. The bins argument is used to specify the intervals. The labels argument is used to specify the labels for each bins. The right argument is used to specify whether the right or left side of the interval is closed. The include_lowest argument is used to specify whether the lowest value is included in the first bin.

The code below creates three bins for the CholesterolBefore column. The minimum and maximum values are calculated first.

In [61]:

# Minimum CholesterolBefore value
df.CholesterolBefore.min()

Out[61]:

1.2

In [62]:

# Maximum CholesterolBefore value
df.CholesterolBefore.max()

Out[62]:

11.1

Three bins are created with four values used in the cut function to create the bins $\text{low}=[0,2.5)$, $\text{normal}=[2.5,5.0)$, and $\text{high}=[5.0,20)$. The intervals safely contain the range of values.

In [63]:

# Create a new column called CholesterolBeforeLevel using the cut function with bins=[0,5,10,20], which is right False and has the labels ['low', 'normal', 'high']
df['CholesterolBeforeLevel'] = pd.cut(df.CholesterolBefore, bins=[0, 5, 10, 20], right=False, labels=['low', 'normal', 'high'])
df.CholesterolBeforeLevel[:5]

Out[63]:

0    low
1    low
2    low
3    low
4    low
Name: CholesterolBeforeLevel, dtype: category
Categories (3, object): ['low' < 'normal' < 'high']

Task

Create a new variable in the dataframe called HRLevel from the HR column with intervals $(0,65], (65,78], (78,1000)$ with labels low, normal, and high respectively.

Solution

In [64]:

df['HRLevel'] = pd.cut(df.HR, bins = [0, 65, 78, 1000], right=True, labels=['low', 'normal', 'high'])
df.HRLevel[:5]

Out[64]:

0    low
1    low
2    low
3    low
4    low
Name: HRLevel, dtype: category
Categories (3, object): ['low' < 'normal' < 'high']

Missing data¶

The numpy nan value¶

It is very often that datasets contain missing data. The numpy library has a specific entity called a nan value. This stands for not a number. The nan value is used to indicate that a value is missing.

In [65]:

# Print the numpy nan value
np.nan

Out[65]:

nan

A list object or an array can contain a nan value. Computation is not possible since the missing value is unknown.

In [66]:

# Create a list with three integers and a nan value and assign it to the variable my_list
my_list = [1, 2, 3, np.nan]
my_list

Out[66]:

[1, 2, 3, nan]

In [67]:

# Calculate the mean of my_list
np.mean(my_list)

Out[67]:

nan

The solution is nan since it is impossible to know what the mean of the four values are.

A dataframe object with missing data¶

The MissingData.csv file contains missing data.

In [33]:

# Read the MissingData.csv file and assign it to the variable df_missing
df_missing = pd.read_csv('https://raw.githubusercontent.com/juanklopper/TutorialData/main/MissingData.csv')

Printing the first $10$ rows to the screen shows some of the missing values.

In [34]:

# Print the first 10 rows of df_missing
df_missing[:10]

Out[34]:

	age	salary	previouscompany
0	57.0	NaN	1.0
1	56.0	50927.0	NaN
2	46.0	75500.0	3.0
3	NaN	84417.0	NaN
4	60.0	63002.0	1.0
5	54.0	54652.0	NaN
6	NaN	65739.0	1.0
7	64.0	89397.0	3.0
8	60.0	77797.0	4.0
9	61.0	NaN	1.0

Deleting missing data¶

The dropna method deletes all rows that contain missing data for any of the columns in that row. All the data for that row (observation) are lost. The removal is not permanent. Below, the dataframe with dropped rows is reassigned to a new variable, leaving the oriiginal intact.

In [35]:

# Delete rows with missing values and assign to a new variable complete_data_df
complete_data_df = df_missing.dropna()

The axis argument is set to $0$ to indicate that rows are to be dropped. This is the default. If the value is set to $1$, then any column with a missing value is dropped.

The isna method can be used to return True and False values along a column. This is done for the age column below. True indicates that a value is missing.

In [72]:

# Use .isna() for the age column
df_missing.age.isna()

Out[72]:

0     False
1     False
2     False
3      True
4     False
5     False
6      True
7     False
8     False
9     False
10    False
11     True
12    False
13    False
14    False
15     True
16    False
17    False
18    False
19    False
20    False
21    False
22    False
23    False
24    False
25    False
26    False
27    False
28    False
29    False
30    False
31    False
Name: age, dtype: bool

Since True is stored as $1$, the sum method can be used to count all the rows (observations) with missing data.

In [36]:

# Sum over all the missing data in the age column
df_missing.age.isna().sum()

Out[36]:

There are $4$ missing values in the age column.

Replacing missing values¶

The process of creating values to fill in missing data is called data imputation and is a seperate and complicated subject. The pandas library provides a fillna method for filling in the missing data with simple calculations.

The fillna method takes a method argument. This can be set to ffil and bfill. As the argument values names imply, the value in the previous or subsequent row will be used to fill in the missing value.

In [37]:

# Use forward fill to impute the missing values in the age column
df_missing.age.fillna(method='ffill')[:5]

/var/folders/f9/9znwn5gx653_4p99htssbq400000gn/T/ipykernel_1250/170363983.py:2: FutureWarning: Series.fillna with 'method' is deprecated and will raise in a future version. Use obj.ffill() or obj.bfill() instead.
  df_missing.age.fillna(method='ffill')[:5]

Out[37]:

0    57.0
1    56.0
2    46.0
3    46.0
4    60.0
Name: age, dtype: float64

The missing value for index $3$ has been replaced by the $46.0$ value at index $2$.

Values for missing data can also be used. The median of the age column is used to replace the missing values. Note that the median method used for a pandas series object ignores the missing data unlike a Python list object or a numpy array.

In [38]:

# Fill the missing age values with the median age
df_missing.age.fillna(df_missing.age.median())[:5]

Out[38]:

0    57.0
1    56.0
2    46.0
3    57.0
4    60.0
Name: age, dtype: float64

Default missing values¶

It is common to use default values when data is not available at the time of capture. The DefaultMissingData.csv file uses the values $999$, Nil and Missing to indicate missing data.

In [39]:

# Import the DefaultMissingData.scv file and assign ot to the variable default_missing_df
default_missing_df = pd.read_csv('https://raw.githubusercontent.com/juanklopper/TutorialData/main/DefaultMissingData.csv')
default_missing_df[:5]

Out[39]:

	age	salary	previous_company
0	57	Nil	1
1	56	50927	Missing
2	46	75500	3
3	999	84417	Missing
4	60	63002	1

The na_values argument can take a Python list object that contains all the default values that indicate missing values.

In [40]:

# Reimport the csv file and use the na_values argument to specify the missing values 999, Nil, and Missing
default_missing_df = pd.read_csv('DefaultMissingData.csv', na_values=[999, 'Nil', 'Missing'])
default_missing_df[:5]

Out[40]:

	age	salary	previous_company
0	57.0	NaN	1.0
1	56.0	50927.0	NaN
2	46.0	75500.0	3.0
3	NaN	84417.0	NaN
4	60.0	63002.0	1.0

The missing values are now correctly indicates as nan values.

Quiz questions¶

Questions¶

How do you import the pandas library in Python?
How do you read a CSV file named data.csv into a pandas DataFrame?
How do you display the first $5$ rows of a DataFrame named df?
How do you find the number of rows and columns in a DataFrame df?
How do you print the names of all the columns in a DataFrame df?
How do you get the summary information of the DataFrame df including data types and non-null values?
How do you select only the rows in DataFrame df where the column A is greater than $5$?
How do you select the columns A and B from the DataFrame df?
How do you rename the column A to Alpha in DataFrame df?
How do you replace all occurrences of the value $999$ in DataFrame df with nan?
How do you drop all rows in DataFrame df that contain missing values?
How do you fill all missing values in DataFrame df with $0$?
How do you calculate the mean of column A in DataFrame df?
How do you create a new column C in DataFrame df that is the sum of columns A and B?
How do you save the DataFrame df to a CSV file named output.csv? (Find the solution by searching the pandas documentation.)