Introduction to Data Preprocessing¶

In this chapter you'll learn exactly what it means to preprocess data. You'll take the first steps in any preprocessing journey, including exploring data types and dealing with missing data. This is the Summary of lecture "Preprocessing for Machine Learning in Python", via datacamp.

toc: true
badges: true
comments: true
author: Chanseok Kang
categories: [Python, Datacamp, Machine_Learning]
image:

In [1]:

import pandas as pd

What is data preprocessing?¶

Data Preprocessing
- Beyond cleaning and exploratory data analysis
- Prepping data for modeling
- Modeling in python requires numerical input

Missing data - columns¶

We have a dataset comprised of volunteer information from New York City. The dataset has a number of features, but we want to get rid of features that have at least 3 missing values.

In [2]:

volunteer = pd.read_csv('./dataset/volunteer_opportunities.csv')
volunteer.head()

Out[2]:

	opportunity_id	content_id	vol_requests	title	hits	summary	is_priority	category_id	category_desc	...	end_date_date	status	Latitude	Longitude	Community Board	Community Council	Census Tract	BIN	BBL	NTA
0	4996	37004	50	Volunteers Needed For Rise Up & Stay Put! Home...	737	Building on successful events last summer and ...	NaN	NaN	NaN	...	July 30 2011	approved	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1	5008	37036	2	Web designer	22	Build a website for an Afghan business	NaN	1.0	Strengthening Communities	...	February 01 2011	approved	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2	5016	37143	20	Urban Adventures - Ice Skating at Lasker Rink	62	Please join us and the students from Mott Hall...	NaN	1.0	Strengthening Communities	...	January 29 2011	approved	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
3	5022	37237	500	Fight global hunger and support women farmers ...	14	The Oxfam Action Corps is a group of dedicated...	NaN	1.0	Strengthening Communities	...	March 31 2012	approved	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
4	5055	37425	15	Stop 'N' Swap	31	Stop 'N' Swap reduces NYC's waste by finding n...	NaN	4.0	Environment	...	February 05 2011	approved	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

5 rows × 35 columns

In [3]:

volunteer.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 665 entries, 0 to 664
Data columns (total 35 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   opportunity_id      665 non-null    int64  
 1   content_id          665 non-null    int64  
 2   vol_requests        665 non-null    int64  
 3   event_time          665 non-null    int64  
 4   title               665 non-null    object 
 5   hits                665 non-null    int64  
 6   summary             665 non-null    object 
 7   is_priority         62 non-null     object 
 8   category_id         617 non-null    float64
 9   category_desc       617 non-null    object 
 10  amsl                0 non-null      float64
 11  amsl_unit           0 non-null      float64
 12  org_title           665 non-null    object 
 13  org_content_id      665 non-null    int64  
 14  addresses_count     665 non-null    int64  
 15  locality            595 non-null    object 
 16  region              665 non-null    object 
 17  postalcode          659 non-null    float64
 18  primary_loc         0 non-null      float64
 19  display_url         665 non-null    object 
 20  recurrence_type     665 non-null    object 
 21  hours               665 non-null    int64  
 22  created_date        665 non-null    object 
 23  last_modified_date  665 non-null    object 
 24  start_date_date     665 non-null    object 
 25  end_date_date       665 non-null    object 
 26  status              665 non-null    object 
 27  Latitude            0 non-null      float64
 28  Longitude           0 non-null      float64
 29  Community Board     0 non-null      float64
 30  Community Council   0 non-null      float64
 31  Census Tract        0 non-null      float64
 32  BIN                 0 non-null      float64
 33  BBL                 0 non-null      float64
 34  NTA                 0 non-null      float64
dtypes: float64(13), int64(8), object(14)
memory usage: 182.0+ KB

In [4]:

volunteer.dropna(axis=1, thresh=3).shape

Out[4]:

(665, 24)

In [5]:

volunteer.shape

Out[5]:

(665, 35)

Missing data - rows¶

Taking a look at the volunteer dataset again, we want to drop rows where the category_desc column values are missing. We're going to do this using boolean indexing, by checking to see if we have any null values, and then filtering the dataset so that we only have rows with those values.

In [6]:

# Check how many values are missing in the category_desc column
print(volunteer['category_desc'].isnull().sum())

# Subset the volunteer dataset
volunteer_subset = volunteer[volunteer['category_desc'].notnull()]

# Print out the shape of the subset
print(volunteer_subset.shape)

48
(617, 35)

Working with data types¶

dtypes in pandas
- object: string/mixed types
- int64: integer
- float64: float
- datetime64 (or timedelta): datetime

Exploring data types¶

Taking another look at the dataset comprised of volunteer information from New York City, we want to know what types we'll be working with as we start to do more preprocessing.

In [7]:

volunteer.dtypes

Out[7]:

opportunity_id          int64
content_id              int64
vol_requests            int64
event_time              int64
title                  object
hits                    int64
summary                object
is_priority            object
category_id           float64
category_desc          object
amsl                  float64
amsl_unit             float64
org_title              object
org_content_id          int64
addresses_count         int64
locality               object
region                 object
postalcode            float64
primary_loc           float64
display_url            object
recurrence_type        object
hours                   int64
created_date           object
last_modified_date     object
start_date_date        object
end_date_date          object
status                 object
Latitude              float64
Longitude             float64
Community Board       float64
Community Council     float64
Census Tract          float64
BIN                   float64
BBL                   float64
NTA                   float64
dtype: object

Converting a column type¶

If you take a look at the volunteer dataset types, you'll see that the column hits is type object. But, if you actually look at the column, you'll see that it consists of integers. Let's convert that column to type int.

In [8]:

# Print the head of the hits column
print(volunteer['hits'].head())

# Convert the hits column to type int
volunteer['hits'] = volunteer['hits'].astype(int)

# Look at the dtypes of the dataset
print(volunteer.dtypes)

0    737
1     22
2     62
3     14
4     31
Name: hits, dtype: int64
opportunity_id          int64
content_id              int64
vol_requests            int64
event_time              int64
title                  object
hits                    int64
summary                object
is_priority            object
category_id           float64
category_desc          object
amsl                  float64
amsl_unit             float64
org_title              object
org_content_id          int64
addresses_count         int64
locality               object
region                 object
postalcode            float64
primary_loc           float64
display_url            object
recurrence_type        object
hours                   int64
created_date           object
last_modified_date     object
start_date_date        object
end_date_date          object
status                 object
Latitude              float64
Longitude             float64
Community Board       float64
Community Council     float64
Census Tract          float64
BIN                   float64
BBL                   float64
NTA                   float64
dtype: object

Class distribution¶

Stratified sampling
- A way of sampling that takes into account the distribution of classes or features in your dataset

Class imbalance¶

In the volunteer dataset, we're thinking about trying to predict the category_desc variable using the other features in the dataset. First, though, we need to know what the class distribution (and imbalance) is for that label.

In [9]:

volunteer['category_desc'].value_counts()

Out[9]:

Strengthening Communities    307
Helping Neighbors in Need    119
Education                     92
Health                        52
Environment                   32
Emergency Preparedness        15
Name: category_desc, dtype: int64

Stratified sampling¶

We know that the distribution of variables in the category_desc column in the volunteer dataset is uneven. If we wanted to train a model to try to predict category_desc, we would want to train the model on a sample of data that is representative of the entire dataset. Stratified sampling is a way to achieve this.

In [10]:

from sklearn.model_selection import train_test_split

# Create a data with all columns except category_desc
volunteer_X = volunteer.dropna(subset=['category_desc'], axis=0)

# Create a category_desc labels dataset
volunteer_y = volunteer.dropna(subset=['category_desc'], axis=0)[['category_desc']]

# Use stratified sampling to split up the dataset according to the volunteer_y dataset
X_train, X_test, y_train, y_test = train_test_split(volunteer_X, volunteer_y, stratify=volunteer_y)

# Print out the category_desc counts on the training y labels
print(y_train['category_desc'].value_counts())

Strengthening Communities    230
Helping Neighbors in Need     89
Education                     69
Health                        39
Environment                   24
Emergency Preparedness        11
Name: category_desc, dtype: int64

Warning: stratify sampling on train_test_split cannot handle the NaN data, so you need to drop NaN values before sampling