Selecting features for modeling¶

This chapter goes over a few different techniques for selecting the most important features from your dataset. You'll learn how to drop redundant features, work with text vectors, and reduce the number of features in your dataset using principal component analysis (PCA). This is the Summary of lecture "Preprocessing for Machine Learning in Python", via datacamp.

toc: true
badges: true
comments: true
author: Chanseok Kang
categories: [Python, Datacamp, Machine_Learning]
image:

In [1]:

import pandas as pd
import numpy as np

Feature selection¶

Selecting features to be used for modeling
Doesn't create new features
Improve model's performance

Identifying areas for feature selection¶

Take an exploratory look at the post-feature engineering hiking dataset.

In [2]:

hiking = pd.read_json('./dataset/hiking.json')
hiking.head()

Out[2]:

	Prop_ID	Name	Location	Park_Name	Length	Difficulty	Other_Details	Accessible	Limited_Access	lat	lon
0	B057	Salt Marsh Nature Trail	Enter behind the Salt Marsh Nature Center, loc...	Marine Park	0.8 miles	None	<p>The first half of this mile-long trail foll...	Y	N	NaN	NaN
1	B073	Lullwater	Enter Park at Lincoln Road and Ocean Avenue en...	Prospect Park	1.0 mile	Easy	Explore the Lullwater to see how nature thrive...	N	N	NaN	NaN
2	B073	Midwood	Enter Park at Lincoln Road and Ocean Avenue en...	Prospect Park	0.75 miles	Easy	Step back in time with a walk through Brooklyn...	N	N	NaN	NaN
3	B073	Peninsula	Enter Park at Lincoln Road and Ocean Avenue en...	Prospect Park	0.5 miles	Easy	Discover how the Peninsula has changed over th...	N	N	NaN	NaN
4	B073	Waterfall	Enter Park at Lincoln Road and Ocean Avenue en...	Prospect Park	0.5 miles	Easy	Trace the source of the Lake on the Waterfall ...	N	N	NaN	NaN

Removing redundant features¶

Remove noisy features
Remove correlated features
- Statistically correlated: features move together directionally
- Linear models assume feature independence
- Pearson correlation coefficient
Remove duplicated features

Selecting relevant features¶

Now let's identify the redundant columns in the volunteer dataset and perform feature selection on the dataset to return a DataFrame of the relevant features.

For example, if you explore the volunteer dataset in the console, you'll see three features which are related to location: locality, region, and postalcode. They contain repeated information, so it would make sense to keep only one of the features.

There are also features that have gone through the feature engineering process: columns like Education and Emergency Preparedness are a product of encoding the categorical variable category_desc, so category_desc itself is redundant now.

Take a moment to examine the features of volunteer in the console, and try to identify the redundant features.

In [3]:

volunteer = pd.read_csv('./dataset/volunteer_sample.csv')
volunteer.dropna(subset=['category_desc'], axis=0, inplace=True)
volunteer.head()

Out[3]:

	vol_requests	title	hits	category_desc	locality	region	postalcode	created_date	vol_requests_lognorm	created_month	Environment	Strengthening Communities
0	2	Web designer	22	Strengthening Communities	5 22nd St\nNew York, NY 10010\n(40.74053152272...	NY	10010.0	2011-01-14	0.693147	1	0	1
1	20	Urban Adventures - Ice Skating at Lasker Rink	62	Strengthening Communities	NaN	NY	10026.0	2011-01-19	2.995732	1	0	1
2	500	Fight global hunger and support women farmers ...	14	Strengthening Communities	NaN	NY	2114.0	2011-01-21	6.214608	1	0	1
3	15	Stop 'N' Swap	31	Environment	NaN	NY	10455.0	2011-01-28	2.708050	1	1	0
4	15	Queens Stop 'N' Swap	135	Environment	NaN	NY	11372.0	2011-01-28	2.708050	1	1	0

In [4]:

volunteer.columns

Out[4]:

Index(['vol_requests', 'title', 'hits', 'category_desc', 'locality', 'region',
       'postalcode', 'created_date', 'vol_requests_lognorm', 'created_month',
       'Education', 'Emergency Preparedness', 'Environment', 'Health',
       'Helping Neighbors in Need', 'Strengthening Communities'],
      dtype='object')

In [5]:

# Create a list of redundant column names to drop
to_drop = ["locality", "region", "category_desc", "created_date", "vol_requests"]

# Drop those columns from the dataset
volunteer_subset = volunteer.drop(to_drop, axis=1)

# Print out the head of the new dataset
volunteer_subset.head()

Out[5]:

	title	hits	postalcode	vol_requests_lognorm	created_month	Environment	Strengthening Communities
0	Web designer	22	10010.0	0.693147	1	0	1
1	Urban Adventures - Ice Skating at Lasker Rink	62	10026.0	2.995732	1	0	1
2	Fight global hunger and support women farmers ...	14	2114.0	6.214608	1	0	1
3	Stop 'N' Swap	31	10455.0	2.708050	1	1	0
4	Queens Stop 'N' Swap	135	11372.0	2.708050	1	1	0

Checking for correlated features¶

Let's take a look at the wine dataset again, which is made up of continuous, numerical features. Run Pearson's correlation coefficient on the dataset to determine which columns are good candidates for eliminating. Then, remove those columns from the DataFrame.

In [6]:

wine = pd.read_csv('./dataset/wine_sample.csv')
wine.head()

Out[6]:

	Flavanoids	Total phenols	Malic acid	OD280/OD315 of diluted wines	Hue
0	3.06	2.80	1.71	3.92	1.04
1	2.76	2.65	1.78	3.40	1.05
2	3.24	2.80	2.36	3.17	1.03
3	3.49	3.85	1.95	3.45	0.86
4	2.69	2.80	2.59	2.93	1.04

In [7]:

# Print out the column correlations of the wine dataset
print(wine.corr())

                              Flavanoids  Total phenols  Malic acid  \
Flavanoids                      1.000000       0.864564   -0.411007   
Total phenols                   0.864564       1.000000   -0.335167   
Malic acid                     -0.411007      -0.335167    1.000000   
OD280/OD315 of diluted wines    0.787194       0.699949   -0.368710   
Hue                             0.543479       0.433681   -0.561296   

                              OD280/OD315 of diluted wines       Hue  
Flavanoids                                        0.787194  0.543479  
Total phenols                                     0.699949  0.433681  
Malic acid                                       -0.368710 -0.561296  
OD280/OD315 of diluted wines                      1.000000  0.565468  
Hue                                               0.565468  1.000000

In [8]:

# Take a minute to find the column where the correlation value is greater than 0.75 at least twice
to_drop = "Flavanoids"

# Drop that column from the DataFrame
wine = wine.drop(to_drop, axis=1)

Selecting features using text vectors¶

Exploring text vectors, part 1¶

Let's expand on the text vector exploration method we just learned about, using the volunteer dataset's title tf/idf vectors. In this first part of text vector exploration, we're going to add to that function we learned about in the slides. We'll return a list of numbers with the function. In the next exercise, we'll write another function to collect the top words across all documents, extract them, and then use that list to filter down our text_tfidf vector.

In [12]:

vocab_csv = pd.read_csv('./dataset/vocab_volunteer.csv', index_col=0).to_dict()
vocab = vocab_csv['0']

In [13]:

volunteer = volunteer[['category_desc', 'title']]
volunteer = volunteer.dropna(subset=['category_desc'], axis=0)

In [14]:

from sklearn.feature_extraction.text import TfidfVectorizer

# Take the title text
title_text = volunteer['title']

# Create the vectorizer method
tfidf_vec = TfidfVectorizer()

# Transform the text into tf-idf vectors
text_tfidf = tfidf_vec.fit_transform(title_text)

In [15]:

# Add in the rest of the parameters
def return_weights(vocab, original_vocab, vector, vector_index, top_n):
    zipped = dict(zip(vector[vector_index].indices, vector[vector_index].data))
    
    # Let's transform that zipped dict into a series
    zipped_series = pd.Series({vocab[i]:zipped[i] for i in vector[vector_index].indices})
    
    # Let's sort the series to pull out the top n weighted words
    zipped_index = zipped_series.sort_values(ascending=False)[:top_n].index
    return [original_vocab[i] for i in zipped_index]

# Print out the weighted words
print(return_weights(vocab, tfidf_vec.vocabulary_, text_tfidf, vector_index=8, top_n=3))

[189, 942, 466]

Exploring text vectors, part 2¶

Using the function we wrote in the previous exercise, we're going to extract the top words from each document in the text vector, return a list of the word indices, and use that list to filter the text vector down to those top words.

In [16]:

def words_to_filter(vocab, original_vocab, vector, top_n):
    filter_list = []
    for i in range(0, vector.shape[0]):
        # here we'll call the function from the previous exercise, 
        # and extend the list we're creating
        filtered = return_weights(vocab, original_vocab, vector, i, top_n)
        filter_list.extend(filtered)
    # Return the list in a set, so we don't get duplicate word indices
    return set(filter_list)

# Call the function to get the list of word indices
filtered_words = words_to_filter(vocab, tfidf_vec.vocabulary_, text_tfidf, top_n=3)

# By converting filtered_words back to a list, 
# we can use it to filter the columns in the text vector
filtered_text = text_tfidf[:, list(filtered_words)]

Training Naive Bayes with feature selection¶

Let's re-run the Naive Bayes text classification model we ran at the end of chapter 3, with our selection choices from the previous exercise, on the volunteer dataset's title and category_desc columns.

In [17]:

from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB

nb = GaussianNB()
y = volunteer['category_desc']

# Split the dataset according to the class distribution of category_desc,
# using the filtered_text vector
X_train, X_test, y_train, y_test = train_test_split(filtered_text.toarray(), y, stratify=y)

# Fit the model to the training data
nb.fit(X_train, y_train)

# Print out the model's accuracy
print(nb.score(X_test, y_test))

0.5483870967741935

You can see that our accuracy score wasn't that different from the score at the end of chapter 3. That's okay; the title field is a very small text field, appropriate for demonstrating how filtering vectors works.

Dimensionality reduction¶

Unsupervised learning method
Combine/decomposes a feature space
Feature extraction
Principal component analysis
- Linear transformation to uncorrelated space
- Captures as much variance as possible in each component
PCA caveats
- Difficult to interpret components
- End of preprocessing journey

Using PCA¶

Let's apply PCA to the wine dataset, to see if we can get an increase in our model's accuracy.

In [18]:

wine = pd.read_csv('./dataset/wine_types.csv')
wine.head()

Out[18]:

	Type	Alcohol	Malic acid	Ash	Alcalinity of ash	Magnesium	Total phenols	Flavanoids	Nonflavanoid phenols	Proanthocyanins	Color intensity	Hue	OD280/OD315 of diluted wines	Proline
0	1	14.23	1.71	2.43	15.6	127	2.80	3.06	0.28	2.29	5.64	1.04	3.92	1065
1	1	13.20	1.78	2.14	11.2	100	2.65	2.76	0.26	1.28	4.38	1.05	3.40	1050
2	1	13.16	2.36	2.67	18.6	101	2.80	3.24	0.30	2.81	5.68	1.03	3.17	1185
3	1	14.37	1.95	2.50	16.8	113	3.85	3.49	0.24	2.18	7.80	0.86	3.45	1480
4	1	13.24	2.59	2.87	21.0	118	2.80	2.69	0.39	1.82	4.32	1.04	2.93	735

In [19]:

from sklearn.decomposition import PCA

# Set up PCA and the X vector for dimensionality reduction
pca = PCA()
wine_X = wine.drop('Type', axis=1)

# Apply PCA to the wine dataset X vector
transformed_X = pca.fit_transform(wine_X)

# Look at the percentage of variance explained by the different components
print(pca.explained_variance_ratio_)

[9.98091230e-01 1.73591562e-03 9.49589576e-05 5.02173562e-05
 1.23636847e-05 8.46213034e-06 2.80681456e-06 1.52308053e-06
 1.12783044e-06 7.21415811e-07 3.78060267e-07 2.12013755e-07
 8.25392788e-08]

Training a model with PCA¶

Now that we have run PCA on the wine dataset, let's try training a model with it.

In [21]:

from sklearn.neighbors import KNeighborsClassifier

y = wine['Type']

knn = KNeighborsClassifier()

# Split the transformed X and the y labels into training and test sets
X_wine_train, X_wine_test, y_wine_train, y_wine_test = train_test_split(transformed_X, y)

# Fit knn to the training data
knn.fit(X_wine_train, y_wine_train)

# Score knn on the test data and print it out
print(knn.score(X_wine_test, y_wine_test))

0.7555555555555555