Department of Data Science

Course: Tools and Techniques for Data Science

Instructor: Muhammad Arif Butt, Ph.D.

Lecture 3.14 (Pandas-06)

Modifying Dataframes Part-I¶

In [ ]:

# To install this library in Jupyter notebook
#import sys
#!{sys.executable} -m pip install pandas

In [ ]:

import pandas as pd
pd.__version__ , pd.__path__

Learning agenda of this notebook¶

Modifying Column labels of Dataframe
Modifying Row indices of Dataframe
Modifying Row(s) Data (Records) of a Dataframe
- Modifying a single Row
- Modifying multiple Rows
  - map() Method
  - df.remove() Method
  - df.apply() Method
  - df.applymap() Method

In [ ]:

Read a Sample Dataframe¶

In [ ]:

import pandas as pd
df = pd.read_csv('datasets/groupdata.csv')
df.head()

In [ ]:

# `shape` attribute of a dataframe object return a two value tuple containing rows and columns
# Note the rows count does not include the column labels and column count does not include the row index
df.shape

In [ ]:

# `index` attribute of a dataframe object return the list of row indices and its datatype
df.index

In [ ]:

# `columns` attribute of a dataframe object return the list of column labels and its datatype
df.columns

In [ ]:

# `dtypes` attribute of a dataframe object return the data type of each column in the dataframe
df.dtypes

In [ ]:

1. Modifying Column Names of a Dataframe¶

Every dataframe has column labels associated with its columns
These by default are integer values from 0,1,2,3...
However, while creating a dataframe from scratch, or while reading them from a file you can set them to more meaningful string values.
While reading from csv file the first row in the file is taken as the column labels
We can change the column labels, if we want
Let us practically see this for better understanding

In [ ]:

! cat datasets/groupdatawithoutcollables.csv

In [ ]:

a. While Reading a Dataset in a Dataframe¶

Pass a List of column names to names argument of pd.read_csv() method

In [ ]:

import pandas as pd
df = pd.read_csv('datasets/groupdatawithoutcollables.csv', names = ['roll no', 'name', 'age', 'address', 'session', 
                                                                'group', 'gender','subj1', 'subj2', 'scholarship'])

df.head(3)

In [ ]:

b. After Dataframe is Loaded (Use `columns` attribute of dataframe)¶

In [ ]:

df = pd.read_csv('datasets/groupdatawithoutcollables.csv', header = None)
df.head(3)

In [ ]:

df.columns = ['roll no', 'name', 'age', 'address', 'session', 'group', 'gender', 'subj1', 'subj2', 'scholarship']
df.head(3)

In [ ]:

Suppose we have a dataframe in which there are certain column labels having spaces in between the names.

We want to rename all such columns by replacing the space character with an underscore

One way to do this is call replace() method of String class on all the column names of dataframe

In [ ]:

df.columns

In [ ]:

df.columns.str.replace(' ', '_')

In [ ]:

df.columns = df.columns.str.replace(' ', '_')

In [ ]:

df.columns

In [ ]:

df.head()

In [ ]:

Suppose we have a dataframe in which there are column labels having names in different cases.

We want to rename all such columns such that the names are all lower or all upper case.

One way to do this is to generate a new list as per the requirement using List comprehension.

In [ ]:

list1 = [x.upper() for x in df.columns]
list1

In [ ]:

df.columns = list1
df.head(3)

In [ ]:

c. After Dataframe is Loaded (Use `df.rename()` method)¶

What if your dataframe has lots and lots of columns having appropriate column names, and you just want to change just one or two column names and not all of them.
Use df.rename() method to modify one or more column names to new one

df.rename(mapper, axis=None, inplace=False)

Where,
- mapper: can be a dictionary having comma separated key:value pairs, where, key is the old column name, while the value is the new column name
- axis: If you want to change the column names use axis = 1 (column axis that moves from left to right)
- inplace: If you want this change to occur inplace make this argument True, in which case the method will return None

In [ ]:

df = pd.read_csv('datasets/groupdata.csv')
df.head(3)

In [ ]:

#Since the inplace argument is by default False, so the rename() method will return a new dataframe
df.rename(mapper={'roll no': 'rollno', 'name':'fname'}, axis=1)

In [ ]:

df.columns

In [ ]:

#Since the inplace argument is now set to True, so the rename() method will return None
#however, the `df` will be changed
df.rename(mapper={ 'roll no': 'rollno'}, axis=1, inplace=True)

In [ ]:

df.columns

In [ ]:

2. Modifying Row Indices of a Dataframe¶

Every dataframe has row index associated with every row, normally are integer values from 0,1,2,3...
After you have sliced a datafreame on a condition or sorted a dataframe, these row indices will be randomized.
We have seen in detail in our previous session the two methods namely df.set_index() and df.reset_index(), to handle this issue.

In [ ]:

3. Modifying Data of a Single Row/Record of a Dataframe¶

In [ ]:

df = pd.read_csv('datasets/groupdata.csv')
df.head(3)

In [ ]:

a. Grep the row/record you want to modify¶

Let us suppose we want to change the subj1 and subj2 marks of Shaista

In [ ]:

# Returns a Series object
df.loc[2,:]

In [ ]:

# Returns a Dataframe object
df.loc[df.name=='Shaista', :]

In [ ]:

b. Option 1:¶

One way is to pass a new list of values and assign it to the appropriate series (row)

In [ ]:

# Any of the following two LOC will work
df.loc[2,:] = ['MS03', 'Shaista', 35, 'Karachi', 'AFTERNOON', 'group B', 'Female', 99, 99, 8500.0]
df.loc[df.name=='Shaista', :] = ['MS03', 'Shaista', 35, 'Karachi', 'AFTERNOON', 'group B', 'Female', 99, 99, 8500.0]
df.head(3)

In [ ]:

c. Option 2:¶

A better way is to assign only those two values that we want to change instead of assigning the complete list of values in that row

In [ ]:

# Returns a series
df.loc[2, ['subj1', 'subj2']] 

In [ ]:

# Returns a dataframe
df.loc[df.name=='Shaista', ['subj1', 'subj2']]

In [ ]:

df.loc[2, ['subj1', 'subj2']] = [100, 100]
df.loc[df.name=='Shaista', ['subj1', 'subj2']] = [100, 100]
df.head(3)

In [ ]:

Note: You can also use df.iloc[] method instead of df.loc[] to change multiple or single value of a row. Other than these two you may also try using df.at[] method to change a single value of a row.

df.loc[filter, 'column(s)'] = 'value(s)'

In [ ]:

4. Modify Data of Multiple Rows and¶

Uptill now we have learnt to modify a single, multiple or all the values of a single row in a dataframe.
What if we want to modify multiple rows at a time?
The following methods will come for your rescue:
- map()
- df.replace()
- df.apply()
- df.applymap()

In [ ]:

a. The Python Built-in `map()` Method¶

The map(aFunction, *iterables) function simply returns a map object after applying aFunction() to all the elements of iterable(s).
Later you can type cast the map object to appropriate data structure
The original iterable(s) remains unchanged.

In [ ]:

import pandas as pd
df = pd.read_csv('datasets/groupdata.csv')
df.head(3)

Example: Using built-in function with map()

In [ ]:

# Passing a Series object (a column of dataframe) to map() as argument
# The Python built-in `len()` function is applied to all the values of name column and return a map object
map(len, df['name'])

In [ ]:

# Type cast the map object to Series
pd.Series(map(len, df['name']))

In [ ]:

# Another way is to call the map() method by a Series object using dot notation
df['name'].map(len)

In [ ]:

# Third way is to access the column name as well using dot notation
df.name.map(len)

In [ ]:

Example: Using a user-defined function with map()

In [ ]:

df = pd.read_csv('datasets/groupdata.csv')
df.head(3)

In [ ]:

# Let us pass a user-defined function
def myfunc(x):
    if (x <= 50):
        return "Young"
    else:
        return "Old"

df['age'].map(myfunc)

In [ ]:

# If you want to save this as a new column in the dataframe you can do that
df['newcol'] = df['age'].map(myfunc)

In [ ]:

df.head()

In [ ]:

Example: Using a Lambda function with map()

In [ ]:

df['age'].map(lambda x: "Young" if x<=50 else "Old")

In [ ]:

Example: Using a Lambda Function with map()

In [ ]:

# You cannot pass upper to map() as we have passed len to map() 
# as upper() is not a built-in function rather is a method of string class
#df['name'].map(upper)

In [ ]:

df['name'].map(lambda x: x.upper())

In [ ]:

Example: Passing a Dictionary {oldval:newval} to map() for changing selected values of a categorical column

In [ ]:

df = pd.read_csv('datasets/groupdata.csv')
df.head()

In [ ]:

df['session'].map({'MORNING':'M', 'AFTERNOON':'A'})

Limitations of map() Method

If there are values for which there is no match, the old values are changed and have become NaN. Solution is use df.replace() method

You can use it on an iterable or Series object not with entire dataframe. Solution is use df.apply() and df.applymap()

b. The `df.replace()` Method¶

The df.replace() method is used to replace values given in to_replace with value
The matching values in the entire dataframe are replaced with new values dynamically.
This differs from updating with .loc or .iloc, which require you to specify a location to update with some value.

df.replace(to_replace, value, inplace=False)

In [ ]:

df = pd.read_csv('datasets/groupdata.csv')
df.head()

In [ ]:

df['session'].replace({'MORNING':'M', 'AFTERNOON':'A'})

Note that now there are no NaN values, rather the values that do not have a match remains as such

Another important point is replace() method works equally well with dataframe

In [ ]:

# Calling replace on entire dataframe
df.replace({'MORNING':'M', 'AFTERNOON':'A', 'group A':'GROUP-A'})

In [ ]:

# Above operation is not inplace
df

In [ ]:

c. The `df.apply()` Method¶

The df.apply() method is used to run a function along the mentioned axis of the dataframe.
In simple words, apply() method runs a function on all the elements of a series of a dataframe

df.apply(func, axis=0, args)

Where,
- func: It can be a built-in, user-defined or a lambda function that is applied to every series of the dataframe as per the axis argument. (Objects passed to the func are series objects)
- axis: The default value of axis argument is zero, so the func is applied to each column. If you want to apply the func to the values of a row, mention axis as one.
- args : If you want to pass additional arguments to func in addition to the element of series, you can pass them as a tuple.

In [ ]:

import pandas as pd
df = pd.read_csv('datasets/groupdata.csv')
df.head(3)

In [ ]:

# Let us pass the built-in function `len()` and compute the length of each name under the name column of df
# So now the len() method is applied to all the values of a single column and return a series object
df['name'].apply(len)

In [ ]:

# Let us pass a user-defined function, with an additional argument as well. This was not possible with map() method
def myfunc(x, age):
    if (x <= age):
        return "Young"
    else:
        return "Old"

df['age'].apply(myfunc, args = (50,))

In [ ]:

# Let us use Lambda function to convert each name under the name column of df to upper case
df['name'].apply(lambda x : x.upper())

In [ ]:

def myfunc(x, age):
    if (x <= age):
        return "Young"
    else:
        return "Old"

In [ ]:

# If you are satisfied with the result, you may assign it to the specific column
df['name'] = df['name'].apply(lambda x : x.upper())

In [ ]:

# Verify
df.head(3)

In [ ]:

# Can anyone guess what this LOC will do?
df['subj1'] = df['subj1'].apply(lambda x : x+5)

In [ ]:

df.head(3)

In [ ]:

Uptill now we have applied the df.apply() method on a specific column of a dataframe. Let us apply it on a row of dataframe

In [ ]:

# Since we have different dtypes in each row, so let us create a dataframe hving numeric columns only
df = pd.read_csv('datasets/groupdata.csv')
df_numeric = df.loc[:,['age','subj1','subj2','scholarship']]
df_string = df.loc[:,['roll no','name','address','session', 'group', 'gender']]

In [ ]:

df_numeric.head()

In [ ]:

# Although not much meaningful, let us add a number to each value of the row
df_numeric.loc[0].apply(lambda x : x+5)

In [ ]:

# If you want to commit this to the datafream you can do that

In [ ]:

df_numeric.loc[0] = df_numeric.loc[0].apply(lambda x : x+5)

In [ ]:

df_numeric.head()

In [ ]:

Let us use the df.apply() method on entire dataframe

In [ ]:

df_numeric.apply(lambda x: x+5).head()

In [ ]:

df.apply(min)

In [ ]:

min(df['subj1'])

The min() function has been applied on each column of the dataframe and for each column the minimum value has been computed and the df.apply() method has returned a Series object

In [ ]:

b. The `df.applymap()` Method¶

The df.map() method applies a function to datafreame element wise.

df.applymap(func, axis=0)

Where,
- func: A function that is passed a single value and returns a single value.

Note: A Series object do not have a applymap() method, so you cannot call it with a Series object

In [ ]:

df = pd.read_csv('datasets/groupdata.csv')
df_string = df.loc[:,['roll no','name','address','session', 'group', 'gender']]
df_numeric = df.loc[:,['age','subj1','subj2','scholarship']]

In [ ]:

df_string.head()

In [ ]:

df_numeric.head()

In [ ]:

df_string.head()

In [ ]:

df_string.applymap(str.upper).head()

In [ ]:

df_numeric.head(5)

In [ ]:

# The applymap() method will apply the len function on each element of dataframe 
df_numeric.applymap(lambda x : x+5).head(5)

In [ ]:

Department of Data Science

Course: Tools and Techniques for Data Science

Instructor: Muhammad Arif Butt, Ph.D.

Lecture 3.14 (Pandas-06)

Modifying Dataframes Part-I¶

Learning agenda of this notebook¶

Read a Sample Dataframe¶

1. Modifying Column Names of a Dataframe¶

a. While Reading a Dataset in a Dataframe¶

b. After Dataframe is Loaded (Use columns attribute of dataframe)¶

c. After Dataframe is Loaded (Use df.rename() method)¶

2. Modifying Row Indices of a Dataframe¶

3. Modifying Data of a Single Row/Record of a Dataframe¶

a. Grep the row/record you want to modify¶

b. Option 1:¶

c. Option 2:¶

4. Modify Data of Multiple Rows and¶

a. The Python Built-in map() Method¶

b. The df.replace() Method¶

c. The df.apply() Method¶

b. The df.applymap() Method¶

b. After Dataframe is Loaded (Use `columns` attribute of dataframe)¶

c. After Dataframe is Loaded (Use `df.rename()` method)¶

a. The Python Built-in `map()` Method¶

b. The `df.replace()` Method¶

c. The `df.apply()` Method¶

b. The `df.applymap()` Method¶