In this chapter you'll learn exactly what it means to preprocess data. You'll take the first steps in any preprocessing journey, including exploring data types and dealing with missing data. This is the Summary of lecture "Preprocessing for Machine Learning in Python", via datacamp.
import pandas as pd
We have a dataset comprised of volunteer information from New York City. The dataset has a number of features, but we want to get rid of features that have at least 3 missing values.
volunteer = pd.read_csv('./dataset/volunteer_opportunities.csv')
volunteer.head()
opportunity_id | content_id | vol_requests | event_time | title | hits | summary | is_priority | category_id | category_desc | ... | end_date_date | status | Latitude | Longitude | Community Board | Community Council | Census Tract | BIN | BBL | NTA | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 4996 | 37004 | 50 | 0 | Volunteers Needed For Rise Up & Stay Put! Home... | 737 | Building on successful events last summer and ... | NaN | NaN | NaN | ... | July 30 2011 | approved | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
1 | 5008 | 37036 | 2 | 0 | Web designer | 22 | Build a website for an Afghan business | NaN | 1.0 | Strengthening Communities | ... | February 01 2011 | approved | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
2 | 5016 | 37143 | 20 | 0 | Urban Adventures - Ice Skating at Lasker Rink | 62 | Please join us and the students from Mott Hall... | NaN | 1.0 | Strengthening Communities | ... | January 29 2011 | approved | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
3 | 5022 | 37237 | 500 | 0 | Fight global hunger and support women farmers ... | 14 | The Oxfam Action Corps is a group of dedicated... | NaN | 1.0 | Strengthening Communities | ... | March 31 2012 | approved | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
4 | 5055 | 37425 | 15 | 0 | Stop 'N' Swap | 31 | Stop 'N' Swap reduces NYC's waste by finding n... | NaN | 4.0 | Environment | ... | February 05 2011 | approved | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
5 rows × 35 columns
volunteer.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 665 entries, 0 to 664 Data columns (total 35 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 opportunity_id 665 non-null int64 1 content_id 665 non-null int64 2 vol_requests 665 non-null int64 3 event_time 665 non-null int64 4 title 665 non-null object 5 hits 665 non-null int64 6 summary 665 non-null object 7 is_priority 62 non-null object 8 category_id 617 non-null float64 9 category_desc 617 non-null object 10 amsl 0 non-null float64 11 amsl_unit 0 non-null float64 12 org_title 665 non-null object 13 org_content_id 665 non-null int64 14 addresses_count 665 non-null int64 15 locality 595 non-null object 16 region 665 non-null object 17 postalcode 659 non-null float64 18 primary_loc 0 non-null float64 19 display_url 665 non-null object 20 recurrence_type 665 non-null object 21 hours 665 non-null int64 22 created_date 665 non-null object 23 last_modified_date 665 non-null object 24 start_date_date 665 non-null object 25 end_date_date 665 non-null object 26 status 665 non-null object 27 Latitude 0 non-null float64 28 Longitude 0 non-null float64 29 Community Board 0 non-null float64 30 Community Council 0 non-null float64 31 Census Tract 0 non-null float64 32 BIN 0 non-null float64 33 BBL 0 non-null float64 34 NTA 0 non-null float64 dtypes: float64(13), int64(8), object(14) memory usage: 182.0+ KB
volunteer.dropna(axis=1, thresh=3).shape
(665, 24)
volunteer.shape
(665, 35)
Taking a look at the volunteer
dataset again, we want to drop rows where the category_desc
column values are missing. We're going to do this using boolean indexing, by checking to see if we have any null values, and then filtering the dataset so that we only have rows with those values.
# Check how many values are missing in the category_desc column
print(volunteer['category_desc'].isnull().sum())
# Subset the volunteer dataset
volunteer_subset = volunteer[volunteer['category_desc'].notnull()]
# Print out the shape of the subset
print(volunteer_subset.shape)
48 (617, 35)
Taking another look at the dataset comprised of volunteer information from New York City, we want to know what types we'll be working with as we start to do more preprocessing.
volunteer.dtypes
opportunity_id int64 content_id int64 vol_requests int64 event_time int64 title object hits int64 summary object is_priority object category_id float64 category_desc object amsl float64 amsl_unit float64 org_title object org_content_id int64 addresses_count int64 locality object region object postalcode float64 primary_loc float64 display_url object recurrence_type object hours int64 created_date object last_modified_date object start_date_date object end_date_date object status object Latitude float64 Longitude float64 Community Board float64 Community Council float64 Census Tract float64 BIN float64 BBL float64 NTA float64 dtype: object
If you take a look at the volunteer
dataset types, you'll see that the column hits
is type object
. But, if you actually look at the column, you'll see that it consists of integers. Let's convert that column to type int
.
# Print the head of the hits column
print(volunteer['hits'].head())
# Convert the hits column to type int
volunteer['hits'] = volunteer['hits'].astype(int)
# Look at the dtypes of the dataset
print(volunteer.dtypes)
0 737 1 22 2 62 3 14 4 31 Name: hits, dtype: int64 opportunity_id int64 content_id int64 vol_requests int64 event_time int64 title object hits int64 summary object is_priority object category_id float64 category_desc object amsl float64 amsl_unit float64 org_title object org_content_id int64 addresses_count int64 locality object region object postalcode float64 primary_loc float64 display_url object recurrence_type object hours int64 created_date object last_modified_date object start_date_date object end_date_date object status object Latitude float64 Longitude float64 Community Board float64 Community Council float64 Census Tract float64 BIN float64 BBL float64 NTA float64 dtype: object
In the volunteer
dataset, we're thinking about trying to predict the category_desc
variable using the other features in the dataset. First, though, we need to know what the class distribution (and imbalance) is for that label.
volunteer['category_desc'].value_counts()
Strengthening Communities 307 Helping Neighbors in Need 119 Education 92 Health 52 Environment 32 Emergency Preparedness 15 Name: category_desc, dtype: int64
We know that the distribution of variables in the category_desc
column in the volunteer dataset is uneven. If we wanted to train a model to try to predict category_desc
, we would want to train the model on a sample of data that is representative of the entire dataset. Stratified sampling is a way to achieve this.
from sklearn.model_selection import train_test_split
# Create a data with all columns except category_desc
volunteer_X = volunteer.dropna(subset=['category_desc'], axis=0)
# Create a category_desc labels dataset
volunteer_y = volunteer.dropna(subset=['category_desc'], axis=0)[['category_desc']]
# Use stratified sampling to split up the dataset according to the volunteer_y dataset
X_train, X_test, y_train, y_test = train_test_split(volunteer_X, volunteer_y, stratify=volunteer_y)
# Print out the category_desc counts on the training y labels
print(y_train['category_desc'].value_counts())
Strengthening Communities 230 Helping Neighbors in Need 89 Education 69 Health 39 Environment 24 Emergency Preparedness 11 Name: category_desc, dtype: int64
Warning: stratify sampling on
train_test_split
cannot handle theNaN
data, so you need to drop NaN values before sampling