df.isna()
methodmath
Column using df.loc[mask,col]=value
group
Column using df.loc[mask,col]=value
fillna()
session
Columndf.dropna()
method
! cat datasets/group-marks.csv
# import the pandas library
import pandas as pd
df = pd.read_csv('datasets/group-marks.csv')
df.head()
df.shape
pd.read.csv()
method detects a missing value (nothing between two commas in a csv file or an empty cell in Excel) it flags it with NaN. There can be many reasons for these NaN values, one can be that the data is gathered via google form from people and this field might be optional and skipped.
df.isna()
method is recommended to use than df.isnull()
, which return a boolean same-sized object that indicates whether an element is NA value or not. Missing values get mapped to True. Everything else gets mapped to False values. Remember, characters such as empty strings ''
or numpy.inf
are not considered NA values.df.notna()
method is recommended to use than df.notnull()
methods return a boolean same-sized object that indicates whether an element is NA value or not. Non-missing values get mapped to True.df.math.isna()
df.loc[df.math.isna(), :]
df.isna().head()
df.notna().head()
# Now we can use sum() on this dataframe object of Boolean values (True is mapped to 1)
df.isna().sum()
# Similarly, we can use sum() on this dataframe object of Boolean values (True is is mapped to 1)
df.notna().sum()
math
Column¶math
Column having Null/Missing values¶df.isna()
method works equally good on Series objects as wellmask = df.math.isna()
mask
df.loc[df.math.isna(), :]
# This will return only those rows of dataframe having null values under the math column
df[mask] # df[df.math.isna()]
df.loc[mask, :] # df.loc[df.math.isna(), :]
math
Column¶math
columndf
# Compute the mean of math column
df.math.mean()
By seeing the error, it appears that the
math
column do not have theint64
orfloat64
type. Let us check this out
# Check out the data type of math column
df['math'].dtypes
# We can also use the `df.info()` method to display the count of Non-Null columns, their datatypes, their names
# and memory usage of that dataframe.
df.info()
df['math']
# We can replace all such values using the `replace()` method
import numpy as np
df['math'] = df.math.replace('No Idea', np.nan).head()
df
df.math.replace('No Idea', np.nan, inplace=True)
# Note the marks of Saadia in math are changed from string `No Idea` to `NaN`
# Since this seems working fine let us make inplace=True to make these changes in the original dataframe
df.replace('No Idea', np.nan, inplace=True)
df.head()
# Let us check the data type of math column
df['math'].dtypes
# It is still Object, which is natural, however, we can change the datatype to `df.astype()` method
df['math'] = df['math'].astype(float)
# Let us check the data type of math column
df['math'].dtypes
# Let us compute the average of math marks again
df.math.mean()
# List only those records under math column having Null values
mask = df.math.isna()
df.loc[mask, 'math']
df.math.mean()
# Let us replace these values with mean value of the math column
df.loc[(df.math.isna()),'math'] = df.math.mean()
# Confirm the result
df.isna().sum()
#df.info()
df.head()
group
Column¶group
column contains categorical values, i.e., a value that can take on one of a limited, and usually fixed, number of possible values.group
Column having Null/Missing values¶df
mask = df.group.isna()
mask
df[mask] # df[df.group.isna()]
df.loc[mask, :] # df.loc[df.group.isna()]
group
Column¶# Use value_counts() function which return a Series containing counts of unique values (in descending order)
# with the most frequently-occurring element at first. It excludes NA values by default.
df.group.value_counts()
# Another way of doing is use the mode() function on the column
df.group.mode()
# List only those records under group column having Null values
mask = df.group.isna()
df.loc[mask, 'group'] # df.loc[(df.group.isna()), 'group']
df.loc[(df.group.isna()),'group']
# Let us replace these values with maximum occurring value in the `group` column
df.loc[(df.group.isna()),'group'] = 'group C'
# Confirm the result
df.isna().sum()
#df.info()
df.head()
Note that in the original dataframe Arifa group information was missing, and now it is
group C
fillna()
¶fillna()
¶loc
method.object.fillna(value, method, inplace=True)
value
, with which we want to replace the missing values OR the method
to be used to replace the missing valuesinplace=True
# Let us read the dataset again with NA values under math column
import pandas as pd
df = pd.read_csv('datasets/group-marks.csv')
df.head()
- Before proceeding, let us this time handle the string value
No Idea
under the math column while reading the csv file, instead of doing afterwards in the dataframe using thereplace()
method as we have done above.- For this we will use the
na_values
argument to thepd.read_csv()
method, to which you can pass a single value or a list of values to be replaced with NaN
df = pd.read_csv('datasets/group-marks.csv', na_values='No Idea')
df.head()
df.isna().sum()
df.loc[df.math.isna()]
# This time instead of loc, use fillna() method with just two arguments
# inplace=True parameter ensure that this happens in the original dataframe
df.math.fillna(value=df.math.mean(), inplace=True)
# Confirm the result
df.isna().sum()
#df.info()
df.head()
group
Column using fillna()
¶# Let us read the dataset again with NA values
import pandas as pd
df = pd.read_csv('datasets/group-marks.csv', na_values='No Idea')
df.head()
df.isna().sum()
# Once again instead of loc,let us use fillna() method with just two arguments
df.group.fillna('group C', inplace=True)
# Confirm the result
df.isna().sum()
#df.info()
# Let us fill the math, english and scholarship columns as well again
df.math.fillna(df.math.mean(), inplace=True)
df.english.fillna(df.english.mean(), inplace=True)
df.scholarship.fillna(df.scholarship.mean(), inplace=True)
# Confirm the result
df.isna().sum()
math
and group
Column using ffill
and bfill
Arguments¶fillna()
methodobject.fillna(value, method, inplace=True)
ffill
or bfill
as method argument to the ffillna()
method. This will replace the null values with other values from the DataFrameffill
(Forward fill): It fills the NaN value with the previous valuebfill
(Back fill): It fills the NaN value with the Next/Upcoming value# Let us read the dataset again with NA values
import pandas as pd
df = pd.read_csv('datasets/group-marks.csv', na_values='No Idea')
df.head(20)
df.isna().sum()
# forward fill or ffill attribute
# If have NaN value, just carry forward the previous value
# using ffill attribute, you can fill the NaN value with the previous value in that column
df.fillna(method = 'ffill', inplace=True)
df.head()
Is it working fine?
df.fillna(method = 'bfill', inplace=True)
df.head()
# Confirm the result
df.isna().sum()
session
Column¶session
column, you can observe that it is a categorical column containing six different categories (as values).MORNING
and MOR
are sameAFTERNOON
and AFT
are sameEVENING
and EVE
are samesession
column has six different categories (as values) but should have only threeimport pandas as pd
df = pd.read_csv('datasets/group-marks.csv' )
df
df.session
# Let use check out the counts of unique values inside the session Column
df.session.value_counts()
df.session.unique()
map()
¶MOR
, AFT
and EVE
using the map() function.df.map(mapping, na_action=None)
map()
method is used for substituting each value in a Series with another value, that may be derived from a dict
. The map()
method returns a series after performing the mappingignore
as second argument which will propagate NaN values, without passing them to the mapping correspondence.# To do this, let us create a new mapping (dictionary)
dict1 = {
'MORNING' : 'MOR',
'MOR' : 'MOR',
'AFTERNOON' : 'AFT',
'AFT': 'AFT',
'EVENING' : 'EVE',
'EVE': 'EVE'
}
# It returns a series with the same index as caller, the original series remains unchanged.
# So we have assigned the resulting series to `df.session` series
df.session.map(dict1)
df.session = df.session.map(dict1)
# Count of new categories in the column session
# Observe we have managed to properly manage the values inside the session column
df.session.value_counts()
# Let us verify the result
df.head()
import pandas as pd
df = pd.read_csv('datasets/group-marks.csv' )
df.head()
df.scholarship.apply(lambda x: x/170)
df['Scholarship_in_$'] = df.scholarship.apply(lambda x : x/150)
df.head()
df.dropna()
method¶**``` df.dropna(axis, how, subset, inplace)
**
import pandas as pd
df = pd.read_csv('datasets/group-marks.csv')
df.head()
df.shape
# You can use dropna() method to drop all the rows, it it has any na value
df1 = df.dropna()
df1.shape
df1.head()
# Default Arguments to dropna()
df2 = df.dropna(axis=0, how='any')
df2.shape
# If we set how='all` it means drop a row only if all of its values are NA
df2 = df.dropna(axis=0, how='all')
df2.shape
# Use of subset argument and pass it a list of columns based on whose values you want to drop a row
df2 = df.dropna(axis=0, how='any', subset=['math'])
df2.shape
# Use of subset argument
df2 = df.dropna(axis=0, how='any', subset=['session'])
df2.shape
# Having `how=all` and `subset=listofcolumnnames`, then it will
# drop a row only if both the columns have a NA value in that row
df2 = df.dropna(axis=0, how='any', subset=['math', 'session'])
df2.shape
# If we set the axis=1 and how=all, it means drop a column if all the values under it is na
df2 = df.dropna(axis=1, how='all')
df2.shape
df.info()
df2 = df.dropna(axis=1, how='any')
df2.shape
# If we set the axis=1 and how=any, it means drop a column if any value under it is na
df2 = df.dropna(axis=1, how='any')
df2.shape
df2.head()
pd.get_dummies()
, which will create a binary column for each of the categories.pd.get_dummies(data, drop_first=False)
data
which can be a dataframe or a seriesNote: Making a dummy variable will take all the K
distinct values in one coumn and make K
columns out of them
import pandas as pd
df = pd.read_csv('datasets/group-marks.csv')
df.head()
# currently we have 10 columns in the data
df.shape
# Convert all categorical variables into dummy/indicator variables
df = pd.get_dummies(df)
# Let us view the datafreame, keep a note on the number of columns
df.head()
# The Number of columns has gone to 1605 now
df.shape
df
gender
Only¶One Hot Encoding
. You will learn more encoding techniques in the data pre-processing module.import pandas as pd
df1 = pd.read_csv('datasets/group-marks.csv')
df1.head()
rollno | name | gender | group | session | age | scholarship | math | english | urdu | |
---|---|---|---|---|---|---|---|---|---|---|
0 | MS01 | SAADIA | female | group B | MORNING | 28 | 2562 | No Idea | 72.0 | 74 |
1 | MS02 | JUMAIMA | female | group C | AFTERNOON | 33 | 2800 | 69 | 90.0 | 88 |
2 | MS03 | ARIFA | female | NaN | EVENING | 34 | 3500 | NaN | 95.0 | 93 |
3 | MS04 | SAADIA | female | group A | MOR | 44 | 2000 | 47 | 57.0 | 44 |
4 | MS05 | DANISH | male | group C | AFTERNOON | 54 | 2100 | 76 | 78.0 | 55 |
# Convert only gender variable into dummy/indicator variables
df2 = pd.get_dummies(df1[['gender']])
df2.head()
# Since we donot need two separate columns, so simply use the `drop_first` argument of get_dummies to handle this
df2 = pd.get_dummies(df1[['gender']], drop_first=True)
df2.head()
# We will talk about join in the next session in detail.
df3 = df1.join(df2['gender_male'])
df3.head()
import pandas as pd
df = pd.read_csv('datasets/group-marks.csv')
df.head()
rollno | name | gender | group | session | age | scholarship | math | english | urdu | |
---|---|---|---|---|---|---|---|---|---|---|
0 | MS01 | SAADIA | female | group B | MORNING | 28 | 2562 | No Idea | 72.0 | 74 |
1 | MS02 | JUMAIMA | female | group C | AFTERNOON | 33 | 2800 | 69 | 90.0 | 88 |
2 | MS03 | ARIFA | female | NaN | EVENING | 34 | 3500 | NaN | 95.0 | 93 |
3 | MS04 | SAADIA | female | group A | MOR | 44 | 2000 | 47 | 57.0 | 44 |
4 | MS05 | DANISH | male | group C | AFTERNOON | 54 | 2100 | 76 | 78.0 | 55 |
df.session.value_counts()
dict1 = {
'MORNING' : 'MOR',
'AFTERNOON' : 'AFT',
}
#df.session = df.session.map(dict1)
df.session = df.session.replace(dict1)
df
rollno | name | gender | group | session | age | scholarship | math | english | urdu | |
---|---|---|---|---|---|---|---|---|---|---|
0 | MS01 | SAADIA | female | group B | MOR | 28 | 2562 | No Idea | 72.0 | 74 |
1 | MS02 | JUMAIMA | female | group C | AFT | 33 | 2800 | 69 | 90.0 | 88 |
2 | MS03 | ARIFA | female | NaN | EVENING | 34 | 3500 | NaN | 95.0 | 93 |
3 | MS04 | SAADIA | female | group A | MOR | 44 | 2000 | 47 | 57.0 | 44 |
4 | MS05 | DANISH | male | group C | AFT | 54 | 2100 | 76 | 78.0 | 55 |
5 | MS06 | SAFIA | female | group B | AFT | 23 | 3800 | NaN | 83.0 | 78 |
6 | MS07 | SARA | female | group B | EVENING | 47 | 3000 | 88 | 95.0 | 92 |
7 | MS08 | ABDULLAH | male | group B | EVE | 33 | 2000 | 40 | 43.0 | 39 |
8 | MS09 | KHAN | male | group D | MOR | 27 | 2500 | 64 | NaN | 67 |
9 | MS10 | HASEENA | female | group B | AFT | 33 | 2800 | 38 | 60.0 | 50 |
10 | MS11 | MUSTJAB | male | group C | MOR | 46 | 3000 | 58 | 54.0 | 52 |
11 | MS12 | ABRAR | male | group D | MOR | 53 | 3312 | 40 | 52.0 | 43 |
12 | MS13 | MAHOOR | female | NaN | MOR | 25 | 2345 | 65 | 81.0 | 73 |
13 | MS14 | USAMA | male | group A | AFT | 26 | 2654 | 78 | 72.0 | 70 |
14 | MS15 | NAVAIRA | female | group A | AFT | 25 | 2137 | 50 | 53.0 | 58 |
15 | MS16 | SAWAIRA | female | group C | EVENING | 29 | 2567 | 69 | 75.0 | 78 |
16 | MS17 | NOFIL | male | group C | MOR | 22 | 3500 | 88 | 89.0 | 86 |
17 | MS18 | SHUMAILA | female | group B | AFT | 31 | 2500 | 18 | NaN | 28 |
18 | MS19 | ABUBAKAR | male | group C | AFT | 33 | 3000 | 46 | 42.0 | 46 |
19 | MS20 | NADRA | female | group C | MOR | 36 | 2200 | 54 | 58.0 | 45 |
20 | MS21 | HASEEB | male | group D | AFT | 34 | 3500 | 66 | 69.0 | 63 |
21 | MS22 | HOORIA | female | group B | EVENING | 32 | 2000 | 65 | 75.0 | 70 |
22 | MS23 | RAUF | male | group D | MOR | 31 | 2500 | 44 | 54.0 | 53 |
23 | MS24 | LAIBA | female | group C | AFT | 37 | 3000 | NaN | 73.0 | 73 |
24 | MS25 | FATIMA | male | group D | AFT | 33 | 2500 | 74 | 71.0 | 80 |
25 | MS26 | IBRAR | male | group A | EVENING | 39 | 3500 | 73 | 74.0 | 72 |
26 | MS27 | USAMA | male | group B | EVE | 32 | 2500 | 69 | 54.0 | 55 |
27 | MS28 | NOOR | female | group C | MOR | 47 | 3000 | 67 | 69.0 | 75 |
28 | MS29 | UMAR | male | group C | AFT | 44 | 4000 | 70 | 70.0 | 65 |
29 | MS30 | ROSHAN | female | group D | MOR | 42 | 3500 | 62 | 70.0 | 75 |
30 | MS31 | UNAIZA | female | group D | EVENING | 41 | 2500 | 69 | 74.0 | 74 |
31 | MS32 | FATIMA | female | group B | EVE | 19 | 3000 | 63 | NaN | 61 |
32 | MS33 | SHAISTA | female | NaN | MOR | 29 | 3500 | 56 | 72.0 | 65 |
33 | MS34 | ALI | male | group D | EVENING | 34 | 3000 | 40 | 42.0 | 38 |
34 | MS35 | ABDULLAH | male | group E | MOR | 45 | 2500 | 97 | 87.0 | 82 |
35 | MS36 | OSAMA | male | group E | AFT | 31 | 3500 | 81 | 81.0 | 79 |
36 | MS37 | AMBREEN | female | group D | EVENING | 28 | 3000 | 74 | 81.0 | 62 |
37 | MS38 | NASEEM | female | group D | MOR | 26 | 2500 | 50 | 64.0 | 59 |
38 | MS39 | SAFIA | female | group D | AFT | 36 | 3500 | 75 | 90.0 | 88 |
39 | MS40 | NABEEL | male | group B | AFT | 33 | 3000 | 57 | 56.0 | 57 |
40 | MS41 | SIKANDER | male | group C | EVENING | 35 | 2500 | 55 | 61.0 | 54 |
41 | MS42 | ZAIN | female | group C | EVE | 27 | 3500 | 58 | 73.0 | 68 |
42 | MS43 | AYESHA | female | group B | MOR | 38 | 3000 | 53 | 58.0 | 65 |
43 | MS44 | HASSAN | male | group B | AFT | 31 | 2500 | 59 | 65.0 | 66 |
44 | MS45 | ZAINAB | female | group E | MOR | 28 | 3500 | NaN | 56.0 | 54 |
45 | MS46 | TALHA | male | group B | EVENING | 41 | 3000 | 65 | 54.0 | 57 |
46 | MS47 | SABA | female | group A | EVE | 36 | 2500 | 55 | 65.0 | 62 |
47 | MS48 | RAFIA | female | group C | MOR | 30 | 3500 | 66 | 71.0 | 76 |
48 | MS49 | FATIMA | female | group D | MOR | 40 | 2500 | 57 | 74.0 | 76 |
49 | MS50 | KAKAMANNA | male | group C | AFT | 37 | 3000 | 66 | 78.0 | 81 |
df1 = pd.get_dummies(df.session)
df1.head()
AFT | AFTERNOON | EVE | EVENING | MOR | MORNING | |
---|---|---|---|---|---|---|
0 | 0 | 0 | 0 | 0 | 0 | 1 |
1 | 0 | 1 | 0 | 0 | 0 | 0 |
2 | 0 | 0 | 0 | 1 | 0 | 0 |
3 | 0 | 0 | 0 | 0 | 1 | 0 |
4 | 0 | 1 | 0 | 0 | 0 | 0 |
df1 = pd.get_dummies(df.gender, drop_first=True)
df3 = df.join(df1['male'])
df3.head()
rollno | name | gender | group | session | age | scholarship | math | english | urdu | male | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | MS01 | SAADIA | female | group B | MORNING | 28 | 2562 | No Idea | 72.0 | 74 | 0 |
1 | MS02 | JUMAIMA | female | group C | AFTERNOON | 33 | 2800 | 69 | 90.0 | 88 | 0 |
2 | MS03 | ARIFA | female | NaN | EVENING | 34 | 3500 | NaN | 95.0 | 93 | 0 |
3 | MS04 | SAADIA | female | group A | MOR | 44 | 2000 | 47 | 57.0 | 44 | 0 |
4 | MS05 | DANISH | male | group C | AFTERNOON | 54 | 2100 | 76 | 78.0 | 55 | 1 |
df1 = pd.get_dummies(df[['gender']], drop_first=True)
df3 = df.join(df1['gender_male'])
df3.head()
rollno | name | gender | group | session | age | scholarship | math | english | urdu | gender_male | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | MS01 | SAADIA | female | group B | MORNING | 28 | 2562 | No Idea | 72.0 | 74 | 0 |
1 | MS02 | JUMAIMA | female | group C | AFTERNOON | 33 | 2800 | 69 | 90.0 | 88 | 0 |
2 | MS03 | ARIFA | female | NaN | EVENING | 34 | 3500 | NaN | 95.0 | 93 | 0 |
3 | MS04 | SAADIA | female | group A | MOR | 44 | 2000 | 47 | 57.0 | 44 | 0 |
4 | MS05 | DANISH | male | group C | AFTERNOON | 54 | 2100 | 76 | 78.0 | 55 | 1 |
df3 = df.join(df1['gender_male'])
df3.head()
rollno | name | gender | group | session | age | scholarship | math | english | urdu | gender_male | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | MS01 | SAADIA | female | group B | MORNING | 28 | 2562 | No Idea | 72.0 | 74 | 0 |
1 | MS02 | JUMAIMA | female | group C | AFTERNOON | 33 | 2800 | 69 | 90.0 | 88 | 0 |
2 | MS03 | ARIFA | female | NaN | EVENING | 34 | 3500 | NaN | 95.0 | 93 | 0 |
3 | MS04 | SAADIA | female | group A | MOR | 44 | 2000 | 47 | 57.0 | 44 | 0 |
4 | MS05 | DANISH | male | group C | AFTERNOON | 54 | 2100 | 76 | 78.0 | 55 | 1 |