Feature Engineering¶

In this section you'll learn about feature engineering. You'll explore different ways to create new, more useful, features from the ones already in your dataset. You'll see how to encode, aggregate, and extract information from both numerical and textual features. This is the Summary of lecture "Preprocessing for Machine Learning in Python", via datacamp.

toc: true
badges: true
comments: true
author: Chanseok Kang
categories: [Python, Datacamp, Machine_Learning]
image:

In [1]:

import pandas as pd
import numpy as np

Feature engineering¶

Creation of new features based on existing features
Insight into relationships between features
Extract and expand data
Dataset-dependent

Identifying areas for feature engineering¶

Take an exploratory look at the volunteer dataset, using the variable of that name. Which of the following columns would you want to perform a feature engineering task on?

In [2]:

volunteer = pd.read_csv('./dataset/volunteer_opportunities.csv')
volunteer.head()

Out[2]:

	opportunity_id	content_id	vol_requests	title	hits	summary	is_priority	category_id	category_desc	...	end_date_date	status	Latitude	Longitude	Community Board	Community Council	Census Tract	BIN	BBL	NTA
0	4996	37004	50	Volunteers Needed For Rise Up & Stay Put! Home...	737	Building on successful events last summer and ...	NaN	NaN	NaN	...	July 30 2011	approved	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1	5008	37036	2	Web designer	22	Build a website for an Afghan business	NaN	1.0	Strengthening Communities	...	February 01 2011	approved	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2	5016	37143	20	Urban Adventures - Ice Skating at Lasker Rink	62	Please join us and the students from Mott Hall...	NaN	1.0	Strengthening Communities	...	January 29 2011	approved	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
3	5022	37237	500	Fight global hunger and support women farmers ...	14	The Oxfam Action Corps is a group of dedicated...	NaN	1.0	Strengthening Communities	...	March 31 2012	approved	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
4	5055	37425	15	Stop 'N' Swap	31	Stop 'N' Swap reduces NYC's waste by finding n...	NaN	4.0	Environment	...	February 05 2011	approved	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

5 rows × 35 columns

Encoding categorical variables¶

Encoding categorical variables - binary¶

Take a look at the hiking dataset. There are several columns here that need encoding, one of which is the Accessible column, which needs to be encoded in order to be modeled. Accessible is a binary feature, so it has two values - either Y or N - so it needs to be encoded into 1s and 0s. Use scikit-learn's LabelEncoder method to do that transformation.

In [3]:

hiking = pd.read_json('./dataset/hiking.json')
hiking.head()

Out[3]:

	Prop_ID	Name	Location	Park_Name	Length	Difficulty	Other_Details	Accessible	Limited_Access	lat	lon
0	B057	Salt Marsh Nature Trail	Enter behind the Salt Marsh Nature Center, loc...	Marine Park	0.8 miles	None	<p>The first half of this mile-long trail foll...	Y	N	NaN	NaN
1	B073	Lullwater	Enter Park at Lincoln Road and Ocean Avenue en...	Prospect Park	1.0 mile	Easy	Explore the Lullwater to see how nature thrive...	N	N	NaN	NaN
2	B073	Midwood	Enter Park at Lincoln Road and Ocean Avenue en...	Prospect Park	0.75 miles	Easy	Step back in time with a walk through Brooklyn...	N	N	NaN	NaN
3	B073	Peninsula	Enter Park at Lincoln Road and Ocean Avenue en...	Prospect Park	0.5 miles	Easy	Discover how the Peninsula has changed over th...	N	N	NaN	NaN
4	B073	Waterfall	Enter Park at Lincoln Road and Ocean Avenue en...	Prospect Park	0.5 miles	Easy	Trace the source of the Lake on the Waterfall ...	N	N	NaN	NaN

In [4]:

from sklearn.preprocessing import LabelEncoder

# Set up the LabelEncoder object
enc = LabelEncoder()

# Apply the encoding to the "Accessible" column
hiking['Accessible_enc'] = enc.fit_transform(hiking['Accessible'])

# Compare the two columns
hiking[['Accessible', 'Accessible_enc']].head()

Out[4]:

	Accessible	Accessible_enc
0	Y	1
1	N	0
2	N	0
3	N	0
4	N	0

Encoding categorical variables - one-hot¶

One of the columns in the volunteer dataset, category_desc, gives category descriptions for the volunteer opportunities listed. Because it is a categorical variable with more than two categories, we need to use one-hot encoding to transform this column numerically. Use Pandas' get_dummies() function to do so.

In [5]:

# Transform the category_desc column
category_enc = pd.get_dummies(volunteer['category_desc'])

# Take a look at the encoded columns
print(category_enc.head())

   Education  Emergency Preparedness  Environment  Health  \
0          0                       0            0       0   
1          0                       0            0       0   
2          0                       0            0       0   
3          0                       0            0       0   
4          0                       0            1       0   

   Helping Neighbors in Need  Strengthening Communities  
0                          0                          0  
1                          0                          1  
2                          0                          1  
3                          0                          1  
4                          0                          0

Engineering numerical features¶

Engineering numerical features - taking an average¶

A good use case for taking an aggregate statistic to create a new feature is to take the mean of columns. Here, you have a DataFrame of running times named running_times_5k. For each name in the dataset, take the mean of their 5 run times.

In [6]:

running_times_5k = pd.read_csv('./dataset/running_times_5k.csv')
running_times_5k

Out[6]:

	name	run1	run2	run3	run4	run5
0	Sue	20.1	18.5	19.6	20.3	18.3
1	Mark	16.5	17.1	16.9	17.6	17.3
2	Sean	23.5	25.1	25.2	24.6	23.9
3	Erin	21.7	21.1	20.9	22.1	22.2
4	Jenny	25.8	27.1	26.1	26.7	26.9
5	Russell	30.9	29.6	31.4	30.4	29.9

In [7]:

# Create a list of the columns to average
run_columns = ['run1', 'run2', 'run3', 'run4', 'run5']

# Use apply to create a mean column
running_times_5k['mean'] = running_times_5k.apply(lambda row: row[run_columns].mean(), axis=1)

# Take a look at the results
print(running_times_5k)

      name  run1  run2  run3  run4  run5   mean
0      Sue  20.1  18.5  19.6  20.3  18.3  19.36
1     Mark  16.5  17.1  16.9  17.6  17.3  17.08
2     Sean  23.5  25.1  25.2  24.6  23.9  24.46
3     Erin  21.7  21.1  20.9  22.1  22.2  21.60
4    Jenny  25.8  27.1  26.1  26.7  26.9  26.52
5  Russell  30.9  29.6  31.4  30.4  29.9  30.44

Engineering numerical features - datetime¶

There are several columns in the volunteer dataset comprised of datetimes. Let's take a look at the start_date_date column and extract just the month to use as a feature for modeling.

In [8]:

# First, convert string column to date column
volunteer['start_date_converted'] = pd.to_datetime(volunteer['start_date_date'])

# Extract just the month from the converted column
volunteer['start_date_month'] = volunteer['start_date_converted'].apply(lambda row: row.month)

# Take a look at the converted and new month columns
volunteer[['start_date_converted', 'start_date_month']].head()

Out[8]:

	start_date_converted	start_date_month
0	2011-07-30	7
1	2011-02-01	2
2	2011-01-29	1
3	2011-02-14	2
4	2011-02-05	2

Text classification¶

Engineering features from strings - extraction¶

The Length column in the hiking dataset is a column of strings, but contained in the column is the mileage for the hike. We're going to extract this mileage using regular expressions, and then use a lambda in Pandas to apply the extraction to the DataFrame.

In [9]:

import re

# Write a pattern to extract numbers and decimals
def return_mileage(length):
    pattern = re.compile(r'\d+\.\d+')
    
    if length == None:
        return
    
    # Search the text for matches
    mile = re.match(pattern, length)
    
    # If a value is returned, use group(0) to return the found value
    if mile is not None:
        return float(mile.group(0))
    
# Apply the function to the Length column and take a look at both columns
hiking['Length_num'] = hiking['Length'].apply(lambda row: return_mileage(row))
hiking[['Length', 'Length_num']].head()

Out[9]:

	Length	Length_num
0	0.8 miles	0.80
1	1.0 mile	1.00
2	0.75 miles	0.75
3	0.5 miles	0.50
4	0.5 miles	0.50

Engineering features from strings - tf/idf¶

Let's transform the volunteer dataset's title column into a text vector, to use in a prediction task in the next exercise.

In [10]:

from sklearn.feature_extraction.text import TfidfVectorizer

# Need to drop NaN for train_test_split
volunteer = pd.read_csv('./dataset/volunteer_opportunities.csv')
volunteer = volunteer.dropna(subset=['category_desc'], axis=0)

# Take the title text
title_text = volunteer['title']

# Create the vectorizer method
tfidf_vec = TfidfVectorizer()

# Transform the text into tf-idf vectors
text_tfidf = tfidf_vec.fit_transform(title_text)

Text classification using tf/idf vectors¶

Now that we've encoded the volunteer dataset's title column into tf/idf vectors, let's use those vectors to try to predict the category_desc column.

In [12]:

from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB

nb = GaussianNB()

# Split the dataset according to the class distribution of category_desc
y = volunteer['category_desc']
X_train, X_test, y_train, y_test = train_test_split(text_tfidf.toarray(), y, stratify=y)

# Fit the model to the training data
nb.fit(X_train, y_train)

# Print out the model's accuracy
print(nb.score(X_test, y_test))

0.5225806451612903