This chapter introduces you to the reality of messy and incomplete data. You will learn how to find where your data has missing values and explore multiple approaches on how to deal with them. You will also use string manipulation techniques to deal with unwanted characters in your dataset. This is the Summary of lecture "Feature Engineering for Machine Learning in Python", via datacamp.
import pandas as pd
import numpy as np
Most data sets contain missing values, often represented as NaN (Not a Number). If you are working with Pandas you can easily check how many missing values exist in each column.
Let's find out how many of the developers taking the survey chose to enter their age (found in the Age
column of so_survey_df
) and their gender (Gender
column of so_survey_df
).
so_survey_df = pd.read_csv('./dataset/Combined_DS_v10.csv')
so_survey_df.head()
SurveyDate | FormalEducation | ConvertedSalary | Hobby | Country | StackOverflowJobsRecommend | VersionControl | Age | Years Experience | Gender | RawSalary | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2/28/18 20:20 | Bachelor's degree (BA. BS. B.Eng.. etc.) | NaN | Yes | South Africa | NaN | Git | 21 | 13 | Male | NaN |
1 | 6/28/18 13:26 | Bachelor's degree (BA. BS. B.Eng.. etc.) | 70841.0 | Yes | Sweeden | 7.0 | Git;Subversion | 38 | 9 | Male | 70,841.00 |
2 | 6/6/18 3:37 | Bachelor's degree (BA. BS. B.Eng.. etc.) | NaN | No | Sweeden | 8.0 | Git | 45 | 11 | NaN | NaN |
3 | 5/9/18 1:06 | Some college/university study without earning ... | 21426.0 | Yes | Sweeden | NaN | Zip file back-ups | 46 | 12 | Male | 21,426.00 |
4 | 4/12/18 22:41 | Bachelor's degree (BA. BS. B.Eng.. etc.) | 41671.0 | Yes | UK | 8.0 | Git | 39 | 7 | Male | £41,671.00 |
# Subset the DataFrame
sub_df = so_survey_df[['Age', 'Gender']]
# Print the number of non-missing values
print(sub_df.notnull().sum())
Age 999 Gender 693 dtype: int64
While having a summary of how much of your data is missing can be useful, often you will need to find the exact locations of these missing values. Using the same subset of the StackOverflow data from the last exercise (sub_df
), you will show how a value can be flagged as missing.
# Print the top 10 entries of the DataFrame
sub_df.head(10)
Age | Gender | |
---|---|---|
0 | 21 | Male |
1 | 38 | Male |
2 | 45 | NaN |
3 | 46 | Male |
4 | 39 | Male |
5 | 39 | Male |
6 | 34 | Male |
7 | 24 | Female |
8 | 23 | Male |
9 | 36 | NaN |
# Print the locations of the missing values
sub_df.head(10).isnull()
Age | Gender | |
---|---|---|
0 | False | False |
1 | False | False |
2 | False | True |
3 | False | False |
4 | False | False |
5 | False | False |
6 | False | False |
7 | False | False |
8 | False | False |
9 | False | True |
# Print the locations of the missing values
sub_df.head(10).notnull()
Age | Gender | |
---|---|---|
0 | True | True |
1 | True | True |
2 | True | False |
3 | True | True |
4 | True | True |
5 | True | True |
6 | True | True |
7 | True | True |
8 | True | True |
9 | True | False |
The simplest way to deal with missing values in your dataset when they are occurring entirely at random is to remove those rows, also called 'listwise deletion'.
Depending on the use case, you will sometimes want to remove all missing values in your data while other times you may want to only remove a particular column if too many values are missing in that column.
# Print the number of rows and columns
print(so_survey_df.shape)
(999, 11)
# Create a new DataFrame dropping all incomplete rows
no_missing_values_rows = so_survey_df.dropna()
# Print the shape of the new DataFrame
print(no_missing_values_rows.shape)
(264, 11)
# Create a new DataFrame dropping all columns with incomplete rows
no_missing_values_cols = so_survey_df.dropna(axis=1)
# Print the shape fo the new DataFrame
print(no_missing_values_cols.shape)
(999, 7)
# Drop all rows where Gender is missing
no_gender = so_survey_df.dropna(subset=['Gender'])
# Print the shape of the new DataFrame
print(no_gender.shape)
(693, 11)
While removing missing data entirely maybe a correct approach in many situations, this may result in a lot of information being omitted from your models.
You may find categorical columns where the missing value is a valid piece of information in itself, such as someone refusing to answer a question in a survey. In these cases, you can fill all missing values with a new category entirely, for example 'No response given'.
# Print the count of occurrence
print(so_survey_df['Gender'].value_counts())
Male 632 Female 53 Transgender 2 Female;Male 2 Non-binary. genderqueer. or gender non-conforming 1 Male;Non-binary. genderqueer. or gender non-conforming 1 Female;Male;Transgender;Non-binary. genderqueer. or gender non-conforming 1 Female;Transgender 1 Name: Gender, dtype: int64
# Replace missing values
so_survey_df['Gender'].fillna('Not Given', inplace=True)
# Print the count of each value
print(so_survey_df['Gender'].value_counts())
Male 632 Not Given 306 Female 53 Transgender 2 Female;Male 2 Non-binary. genderqueer. or gender non-conforming 1 Male;Non-binary. genderqueer. or gender non-conforming 1 Female;Male;Transgender;Non-binary. genderqueer. or gender non-conforming 1 Female;Transgender 1 Name: Gender, dtype: int64
In the last lesson, you dealt with different methods of removing data missing values and filling in missing values with a fixed string. These approaches are valid in many cases, particularly when dealing with categorical columns but have limited use when working with continuous values. In these cases, it may be most valid to fill the missing values in the column with a value calculated from the entries present in the column.
# Print the first five rows of StackOverflowJobsRecommend column
so_survey_df['StackOverflowJobsRecommend'].head()
0 NaN 1 7.0 2 8.0 3 NaN 4 8.0 Name: StackOverflowJobsRecommend, dtype: float64
# Fill missing values with the mean
so_survey_df['StackOverflowJobsRecommend'].fillna(so_survey_df['StackOverflowJobsRecommend'].mean(),
inplace=True)
# Round the StackOverflowJobsRecommend values
so_survey_df['StackOverflowJobsRecommend'] = round(so_survey_df['StackOverflowJobsRecommend'])
# Print the first five rows of StackOverflowJobsRecommend column
so_survey_df['StackOverflowJobsRecommend'].head()
0 7.0 1 7.0 2 8.0 3 7.0 4 8.0 Name: StackOverflowJobsRecommend, dtype: float64
In this exercise, you will work with the RawSalary
column of so_survey_df
which contains the wages of the respondents along with the currency symbols and commas, such as $42,000
. When importing data from Microsoft Excel, more often that not you will come across data in this form.
# Remove the commas in the column
so_survey_df['RawSalary'] = so_survey_df['RawSalary'].str.replace(',', '')
# Remove the dollar signs in the column
so_survey_df['RawSalary'] = so_survey_df['RawSalary'].str.replace('$', '')
In the last exercise, you could tell quickly based off of the df.head()
call which characters were causing an issue. In many cases this will not be so apparent. There will often be values deep within a column that are preventing you from casting a column as a numeric type so that it can be used in a model or further feature engineering.
One approach to finding these values is to force the column to the data type desired using pd.to_numeric()
, coercing any values causing issues to NaN, Then filtering the DataFrame by just the rows containing the NaN values.
Try to cast the RawSalary
column as a float and it will fail as an additional character can now be found in it. Find the character and remove it so the column can be cast as a float.
# Attempt to convert the column to numeric values
numeric_vals = pd.to_numeric(so_survey_df['RawSalary'], errors='coerce')
# find the indexes of missing values
idx = so_survey_df['RawSalary'].isna()
# Print the relevant raws
print(so_survey_df['RawSalary'][idx])
0 NaN 2 NaN 6 NaN 8 NaN 11 NaN ... 989 NaN 990 NaN 992 NaN 994 NaN 997 NaN Name: RawSalary, Length: 334, dtype: object
# Replace the offending characters
so_survey_df['RawSalary'] = so_survey_df['RawSalary'].str.replace('£', '')
# Convert the column to float
so_survey_df['RawSalary'] = so_survey_df['RawSalary'].astype(float)
# Print the column
so_survey_df['RawSalary']
0 NaN 1 70841.0 2 NaN 3 21426.0 4 41671.0 ... 994 NaN 995 58746.0 996 55000.0 997 NaN 998 1000000.0 Name: RawSalary, Length: 999, dtype: float64
When applying multiple operations on the same column (like in the previous exercises), you made the changes in several steps, assigning the results back in each step. However, when applying multiple successive operations on the same column, you can "chain" these operations together for clarity and ease of management. This can be achieved by calling multiple methods sequentially:
# Method chaining
df['column'] = df['column'].method1().method2().method3()
# Same as
df['column'] = df['column'].method1()
df['column'] = df['column'].method2()
df['column'] = df['column'].method3()
In this exercise you will repeat the steps you performed in the last two exercises, but do so using method chaining.
so_survey_df = pd.read_csv('./dataset/Combined_DS_v10.csv')
# Use method chaining
so_survey_df['RawSalary'] = so_survey_df['RawSalary']\
.str.replace(',', '')\
.str.replace('$', '')\
.str.replace('£', '')\
.astype(float)
# Print the RawSalary column
print(so_survey_df['RawSalary'])
0 NaN 1 70841.0 2 NaN 3 21426.0 4 41671.0 ... 994 NaN 995 58746.0 996 55000.0 997 NaN 998 1000000.0 Name: RawSalary, Length: 999, dtype: float64