The New York City Taxi & Limousine Commission
In this project, we are the Data Analyst at the New York City Taxi & Limousine Commission (New York City TLC), and we have been requested to build a machine learning model to predict if a customer will not leave a tip. They want to use the model in an app that will alert taxi drivers to customers who are unlikely to tip, since drivers depend on tips.
We will employ the tree-based modeling techniques to predict on a binary target class.
The purpose of this model is to find ways to generate more revenue for taxi cab drivers.
The goal of this model is to predict whether or not a customer is a generous tipper.
Part 1: Ethical considerations
Consider the ethical implications of the request
Revisit the project objective
Part 2: Feature engineering
Part 3: Modeling
In this stage, let's consider the following questions:
What is the main purpose of the task?
What are the ethical implications of the model?
Do the benefits of such a model outweigh the potential problems?
Should we proceed with the request to build this model? Why or why not?
Can the objective be modified to make it less problematic?
generous
.Import packages and libraries needed to build and evaluate random forest and XGBoost classification models.
# Import packages and libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import roc_auc_score, roc_curve
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score,\
confusion_matrix, ConfusionMatrixDisplay, RocCurveDisplay
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from xgboost import plot_importance
# RUN THIS CELL TO SEE ALL COLUMNS
# This lets us see all of the columns, preventing Juptyer from redacting them.
pd.set_option('display.max_columns', None)
# Load dataset into dataframe
df0 = pd.read_csv('2017_Yellow_Taxi_Trip_Data.csv')
# Import predicted fares and mean distance and duration
nyc_preds_means = pd.read_csv('nyc_preds_means.csv')
Inspect the first few rows of df0
.
# Inspect the first few rows of df0
df0.head()
Unnamed: 0 | VendorID | tpep_pickup_datetime | tpep_dropoff_datetime | passenger_count | trip_distance | RatecodeID | store_and_fwd_flag | PULocationID | DOLocationID | payment_type | fare_amount | extra | mta_tax | tip_amount | tolls_amount | improvement_surcharge | total_amount | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 24870114 | 2 | 03/25/2017 8:55:43 AM | 03/25/2017 9:09:47 AM | 6 | 3.34 | 1 | N | 100 | 231 | 1 | 13.0 | 0.0 | 0.5 | 2.76 | 0.0 | 0.3 | 16.56 |
1 | 35634249 | 1 | 04/11/2017 2:53:28 PM | 04/11/2017 3:19:58 PM | 1 | 1.80 | 1 | N | 186 | 43 | 1 | 16.0 | 0.0 | 0.5 | 4.00 | 0.0 | 0.3 | 20.80 |
2 | 106203690 | 1 | 12/15/2017 7:26:56 AM | 12/15/2017 7:34:08 AM | 1 | 1.00 | 1 | N | 262 | 236 | 1 | 6.5 | 0.0 | 0.5 | 1.45 | 0.0 | 0.3 | 8.75 |
3 | 38942136 | 2 | 05/07/2017 1:17:59 PM | 05/07/2017 1:48:14 PM | 1 | 3.70 | 1 | N | 188 | 97 | 1 | 20.5 | 0.0 | 0.5 | 6.39 | 0.0 | 0.3 | 27.69 |
4 | 30841670 | 2 | 04/15/2017 11:32:20 PM | 04/15/2017 11:49:03 PM | 1 | 4.37 | 1 | N | 4 | 112 | 2 | 16.5 | 0.5 | 0.5 | 0.00 | 0.0 | 0.3 | 17.80 |
Inspect the first few rows of nyc_preds_means
.
# Inspect the first few rows of `nyc_preds_means`
nyc_preds_means.head()
mean_duration | mean_distance | predicted_fare | |
---|---|---|---|
0 | 22.847222 | 3.521667 | 16.434245 |
1 | 24.470370 | 3.108889 | 16.052218 |
2 | 7.250000 | 0.881429 | 7.053706 |
3 | 30.250000 | 3.700000 | 18.731650 |
4 | 14.616667 | 4.435000 | 15.845642 |
Join the two dataframes using a method of your choice.
# Merge datasets
df0 = df0.merge(nyc_preds_means, left_index=True, right_index=True)
df0.head()
Unnamed: 0 | VendorID | tpep_pickup_datetime | tpep_dropoff_datetime | passenger_count | trip_distance | RatecodeID | store_and_fwd_flag | PULocationID | DOLocationID | payment_type | fare_amount | extra | mta_tax | tip_amount | tolls_amount | improvement_surcharge | total_amount | mean_duration | mean_distance | predicted_fare | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 24870114 | 2 | 03/25/2017 8:55:43 AM | 03/25/2017 9:09:47 AM | 6 | 3.34 | 1 | N | 100 | 231 | 1 | 13.0 | 0.0 | 0.5 | 2.76 | 0.0 | 0.3 | 16.56 | 22.847222 | 3.521667 | 16.434245 |
1 | 35634249 | 1 | 04/11/2017 2:53:28 PM | 04/11/2017 3:19:58 PM | 1 | 1.80 | 1 | N | 186 | 43 | 1 | 16.0 | 0.0 | 0.5 | 4.00 | 0.0 | 0.3 | 20.80 | 24.470370 | 3.108889 | 16.052218 |
2 | 106203690 | 1 | 12/15/2017 7:26:56 AM | 12/15/2017 7:34:08 AM | 1 | 1.00 | 1 | N | 262 | 236 | 1 | 6.5 | 0.0 | 0.5 | 1.45 | 0.0 | 0.3 | 8.75 | 7.250000 | 0.881429 | 7.053706 |
3 | 38942136 | 2 | 05/07/2017 1:17:59 PM | 05/07/2017 1:48:14 PM | 1 | 3.70 | 1 | N | 188 | 97 | 1 | 20.5 | 0.0 | 0.5 | 6.39 | 0.0 | 0.3 | 27.69 | 30.250000 | 3.700000 | 18.731650 |
4 | 30841670 | 2 | 04/15/2017 11:32:20 PM | 04/15/2017 11:49:03 PM | 1 | 4.37 | 1 | N | 4 | 112 | 2 | 16.5 | 0.5 | 0.5 | 0.00 | 0.0 | 0.3 | 17.80 | 14.616667 | 4.435000 | 15.845642 |
Upon completion of EDA, let's inspect the newly combined dataframe.
# See the info of df0
df0.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 22699 entries, 0 to 22698 Data columns (total 21 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Unnamed: 0 22699 non-null int64 1 VendorID 22699 non-null int64 2 tpep_pickup_datetime 22699 non-null object 3 tpep_dropoff_datetime 22699 non-null object 4 passenger_count 22699 non-null int64 5 trip_distance 22699 non-null float64 6 RatecodeID 22699 non-null int64 7 store_and_fwd_flag 22699 non-null object 8 PULocationID 22699 non-null int64 9 DOLocationID 22699 non-null int64 10 payment_type 22699 non-null int64 11 fare_amount 22699 non-null float64 12 extra 22699 non-null float64 13 mta_tax 22699 non-null float64 14 tip_amount 22699 non-null float64 15 tolls_amount 22699 non-null float64 16 improvement_surcharge 22699 non-null float64 17 total_amount 22699 non-null float64 18 mean_duration 22699 non-null float64 19 mean_distance 22699 non-null float64 20 predicted_fare 22699 non-null float64 dtypes: float64(11), int64(7), object(3) memory usage: 3.6+ MB
We know from our EDA that customers who pay cash generally have a tip amount of $0. To meet the modeling objective, we'll need to sample the data to select only the customers who pay with credit card.
Let's copy df0
and assign the result to a variable called df1
. Then, use a Boolean mask to filter df1
so it contains only customers who paid with credit card.
# Subset the data to isolate only customers who paid by credit card
df1 = df0[df0['payment_type'] == 1]
df1.head()
Unnamed: 0 | VendorID | tpep_pickup_datetime | tpep_dropoff_datetime | passenger_count | trip_distance | RatecodeID | store_and_fwd_flag | PULocationID | DOLocationID | payment_type | fare_amount | extra | mta_tax | tip_amount | tolls_amount | improvement_surcharge | total_amount | mean_duration | mean_distance | predicted_fare | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 24870114 | 2 | 03/25/2017 8:55:43 AM | 03/25/2017 9:09:47 AM | 6 | 3.34 | 1 | N | 100 | 231 | 1 | 13.0 | 0.0 | 0.5 | 2.76 | 0.0 | 0.3 | 16.56 | 22.847222 | 3.521667 | 16.434245 |
1 | 35634249 | 1 | 04/11/2017 2:53:28 PM | 04/11/2017 3:19:58 PM | 1 | 1.80 | 1 | N | 186 | 43 | 1 | 16.0 | 0.0 | 0.5 | 4.00 | 0.0 | 0.3 | 20.80 | 24.470370 | 3.108889 | 16.052218 |
2 | 106203690 | 1 | 12/15/2017 7:26:56 AM | 12/15/2017 7:34:08 AM | 1 | 1.00 | 1 | N | 262 | 236 | 1 | 6.5 | 0.0 | 0.5 | 1.45 | 0.0 | 0.3 | 8.75 | 7.250000 | 0.881429 | 7.053706 |
3 | 38942136 | 2 | 05/07/2017 1:17:59 PM | 05/07/2017 1:48:14 PM | 1 | 3.70 | 1 | N | 188 | 97 | 1 | 20.5 | 0.0 | 0.5 | 6.39 | 0.0 | 0.3 | 27.69 | 30.250000 | 3.700000 | 18.731650 |
5 | 23345809 | 2 | 03/25/2017 8:34:11 PM | 03/25/2017 8:42:11 PM | 6 | 2.30 | 1 | N | 161 | 236 | 1 | 9.0 | 0.5 | 0.5 | 2.06 | 0.0 | 0.3 | 12.36 | 11.855376 | 2.052258 | 10.441351 |
Let's add a tip_percent
column to the dataframe by performing the following calculation:
Purpose: the customers who tip ≥ 20% are considered generous
.
# Create tip % col
df1['tip_percent'] = round(df1['tip_amount'] / (df1['total_amount'] - df1['tip_amount']), 3)
Now that we have a tip_percent
column, we can add a binary indicator to identify if each customer is generous
(tipped more than 20% will be 1= yes, otherwise, 0 = no).
# Create 'generous' col (target)
df1['generous'] = (df1['tip_percent'] >= 0.2).astype(int)
df1.head()
Unnamed: 0 | VendorID | tpep_pickup_datetime | tpep_dropoff_datetime | passenger_count | trip_distance | RatecodeID | store_and_fwd_flag | PULocationID | DOLocationID | payment_type | fare_amount | extra | mta_tax | tip_amount | tolls_amount | improvement_surcharge | total_amount | mean_duration | mean_distance | predicted_fare | tip_percent | generous | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 24870114 | 2 | 03/25/2017 8:55:43 AM | 03/25/2017 9:09:47 AM | 6 | 3.34 | 1 | N | 100 | 231 | 1 | 13.0 | 0.0 | 0.5 | 2.76 | 0.0 | 0.3 | 16.56 | 22.847222 | 3.521667 | 16.434245 | 0.200 | 1 |
1 | 35634249 | 1 | 04/11/2017 2:53:28 PM | 04/11/2017 3:19:58 PM | 1 | 1.80 | 1 | N | 186 | 43 | 1 | 16.0 | 0.0 | 0.5 | 4.00 | 0.0 | 0.3 | 20.80 | 24.470370 | 3.108889 | 16.052218 | 0.238 | 1 |
2 | 106203690 | 1 | 12/15/2017 7:26:56 AM | 12/15/2017 7:34:08 AM | 1 | 1.00 | 1 | N | 262 | 236 | 1 | 6.5 | 0.0 | 0.5 | 1.45 | 0.0 | 0.3 | 8.75 | 7.250000 | 0.881429 | 7.053706 | 0.199 | 0 |
3 | 38942136 | 2 | 05/07/2017 1:17:59 PM | 05/07/2017 1:48:14 PM | 1 | 3.70 | 1 | N | 188 | 97 | 1 | 20.5 | 0.0 | 0.5 | 6.39 | 0.0 | 0.3 | 27.69 | 30.250000 | 3.700000 | 18.731650 | 0.300 | 1 |
5 | 23345809 | 2 | 03/25/2017 8:34:11 PM | 03/25/2017 8:42:11 PM | 6 | 2.30 | 1 | N | 161 | 236 | 1 | 9.0 | 0.5 | 0.5 | 2.06 | 0.0 | 0.3 | 12.36 | 11.855376 | 2.052258 | 10.441351 | 0.200 | 1 |
Let's convert the tpep_pickup_datetime
and tpep_dropoff_datetime
columns to datetime.
# Convert pickup and dropoff cols to datetime
df1['tpep_pickup_datetime'] = pd.to_datetime(df1['tpep_pickup_datetime'], format='%m/%d/%Y %I:%M:%S %p')
df1['tpep_dropoff_datetime'] = pd.to_datetime(df1['tpep_dropoff_datetime'], format='%m/%d/%Y %I:%M:%S %p')
Now, create a day
column that contains only the day of the week when each passenger was picked up. Then, convert the values to lowercase.
# Create a 'day' col
df1['day'] = df1['tpep_pickup_datetime'].dt.day_name().str.lower()
Next, engineer four new columns that represent time of day bins. Each column should contain binary values (0=no, 1=yes) that indicate whether a trip began (picked up) during the following times:
am_rush
= [06:00–10:00)
daytime
= [10:00–16:00)
pm_rush
= [16:00–20:00)
nighttime
= [20:00–06:00)
To do this, first create the four columns. For now, each new column should be identical and contain the same information: the hour (only) from the tpep_pickup_datetime
column.
# Create 'am_rush' col
#==> ENTER YOUR CODE HERE
df1['am_rush'] = df1['tpep_pickup_datetime'].dt.hour.between(6, 10)
# Create 'daytime' col
#==> ENTER YOUR CODE HERE
df1['daytime'] = df1['tpep_pickup_datetime'].dt.hour.between(10, 16)
# Create 'pm_rush' col
#==> ENTER YOUR CODE HERE
df1['pm_rush'] = df1['tpep_pickup_datetime'].dt.hour.between(16, 20)
# Create 'nighttime' col
#==> ENTER YOUR CODE HERE
df1['nighttime'] = df1['tpep_pickup_datetime'].dt.hour.between(20, 6)
df1.head()
Unnamed: 0 | VendorID | tpep_pickup_datetime | tpep_dropoff_datetime | passenger_count | trip_distance | RatecodeID | store_and_fwd_flag | PULocationID | DOLocationID | payment_type | fare_amount | extra | mta_tax | tip_amount | tolls_amount | improvement_surcharge | total_amount | mean_duration | mean_distance | predicted_fare | tip_percent | generous | day | am_rush | daytime | pm_rush | nighttime | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 24870114 | 2 | 2017-03-25 08:55:43 | 2017-03-25 09:09:47 | 6 | 3.34 | 1 | N | 100 | 231 | 1 | 13.0 | 0.0 | 0.5 | 2.76 | 0.0 | 0.3 | 16.56 | 22.847222 | 3.521667 | 16.434245 | 0.200 | 1 | saturday | True | False | False | False |
1 | 35634249 | 1 | 2017-04-11 14:53:28 | 2017-04-11 15:19:58 | 1 | 1.80 | 1 | N | 186 | 43 | 1 | 16.0 | 0.0 | 0.5 | 4.00 | 0.0 | 0.3 | 20.80 | 24.470370 | 3.108889 | 16.052218 | 0.238 | 1 | tuesday | False | True | False | False |
2 | 106203690 | 1 | 2017-12-15 07:26:56 | 2017-12-15 07:34:08 | 1 | 1.00 | 1 | N | 262 | 236 | 1 | 6.5 | 0.0 | 0.5 | 1.45 | 0.0 | 0.3 | 8.75 | 7.250000 | 0.881429 | 7.053706 | 0.199 | 0 | friday | True | False | False | False |
3 | 38942136 | 2 | 2017-05-07 13:17:59 | 2017-05-07 13:48:14 | 1 | 3.70 | 1 | N | 188 | 97 | 1 | 20.5 | 0.0 | 0.5 | 6.39 | 0.0 | 0.3 | 27.69 | 30.250000 | 3.700000 | 18.731650 | 0.300 | 1 | sunday | False | True | False | False |
5 | 23345809 | 2 | 2017-03-25 20:34:11 | 2017-03-25 20:42:11 | 6 | 2.30 | 1 | N | 161 | 236 | 1 | 9.0 | 0.5 | 0.5 | 2.06 | 0.0 | 0.3 | 12.36 | 11.855376 | 2.052258 | 10.441351 | 0.200 | 1 | saturday | False | False | True | False |
# transform and encode all the time columns into '1' & '0'
cols = ['am_rush', 'daytime', 'pm_rush', 'nighttime']
df1.loc[:, cols] = df1.loc[:, cols].replace({True: 1, False: 0})
df1.head()
Unnamed: 0 | VendorID | tpep_pickup_datetime | tpep_dropoff_datetime | passenger_count | trip_distance | RatecodeID | store_and_fwd_flag | PULocationID | DOLocationID | payment_type | fare_amount | extra | mta_tax | tip_amount | tolls_amount | improvement_surcharge | total_amount | mean_duration | mean_distance | predicted_fare | tip_percent | generous | day | am_rush | daytime | pm_rush | nighttime | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 24870114 | 2 | 2017-03-25 08:55:43 | 2017-03-25 09:09:47 | 6 | 3.34 | 1 | N | 100 | 231 | 1 | 13.0 | 0.0 | 0.5 | 2.76 | 0.0 | 0.3 | 16.56 | 22.847222 | 3.521667 | 16.434245 | 0.200 | 1 | saturday | 1 | 0 | 0 | 0 |
1 | 35634249 | 1 | 2017-04-11 14:53:28 | 2017-04-11 15:19:58 | 1 | 1.80 | 1 | N | 186 | 43 | 1 | 16.0 | 0.0 | 0.5 | 4.00 | 0.0 | 0.3 | 20.80 | 24.470370 | 3.108889 | 16.052218 | 0.238 | 1 | tuesday | 0 | 1 | 0 | 0 |
2 | 106203690 | 1 | 2017-12-15 07:26:56 | 2017-12-15 07:34:08 | 1 | 1.00 | 1 | N | 262 | 236 | 1 | 6.5 | 0.0 | 0.5 | 1.45 | 0.0 | 0.3 | 8.75 | 7.250000 | 0.881429 | 7.053706 | 0.199 | 0 | friday | 1 | 0 | 0 | 0 |
3 | 38942136 | 2 | 2017-05-07 13:17:59 | 2017-05-07 13:48:14 | 1 | 3.70 | 1 | N | 188 | 97 | 1 | 20.5 | 0.0 | 0.5 | 6.39 | 0.0 | 0.3 | 27.69 | 30.250000 | 3.700000 | 18.731650 | 0.300 | 1 | sunday | 0 | 1 | 0 | 0 |
5 | 23345809 | 2 | 2017-03-25 20:34:11 | 2017-03-25 20:42:11 | 6 | 2.30 | 1 | N | 161 | 236 | 1 | 9.0 | 0.5 | 0.5 | 2.06 | 0.0 | 0.3 | 12.36 | 11.855376 | 2.052258 | 10.441351 | 0.200 | 1 | saturday | 0 | 0 | 1 | 0 |
month
column¶Now, create a month
column that contains only the abbreviated name of the month when each passenger was picked up, then convert the result to lowercase.
# Create 'month' col
df1['month'] = df1['tpep_pickup_datetime'].dt.strftime('%b').str.lower()
Examine the first five rows of your dataframe.
# Check the first few rows of df1
df1.head()
Unnamed: 0 | VendorID | tpep_pickup_datetime | tpep_dropoff_datetime | passenger_count | trip_distance | RatecodeID | store_and_fwd_flag | PULocationID | DOLocationID | payment_type | fare_amount | extra | mta_tax | tip_amount | tolls_amount | improvement_surcharge | total_amount | mean_duration | mean_distance | predicted_fare | tip_percent | generous | day | am_rush | daytime | pm_rush | nighttime | month | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 24870114 | 2 | 2017-03-25 08:55:43 | 2017-03-25 09:09:47 | 6 | 3.34 | 1 | N | 100 | 231 | 1 | 13.0 | 0.0 | 0.5 | 2.76 | 0.0 | 0.3 | 16.56 | 22.847222 | 3.521667 | 16.434245 | 0.200 | 1 | saturday | 1 | 0 | 0 | 0 | mar |
1 | 35634249 | 1 | 2017-04-11 14:53:28 | 2017-04-11 15:19:58 | 1 | 1.80 | 1 | N | 186 | 43 | 1 | 16.0 | 0.0 | 0.5 | 4.00 | 0.0 | 0.3 | 20.80 | 24.470370 | 3.108889 | 16.052218 | 0.238 | 1 | tuesday | 0 | 1 | 0 | 0 | apr |
2 | 106203690 | 1 | 2017-12-15 07:26:56 | 2017-12-15 07:34:08 | 1 | 1.00 | 1 | N | 262 | 236 | 1 | 6.5 | 0.0 | 0.5 | 1.45 | 0.0 | 0.3 | 8.75 | 7.250000 | 0.881429 | 7.053706 | 0.199 | 0 | friday | 1 | 0 | 0 | 0 | dec |
3 | 38942136 | 2 | 2017-05-07 13:17:59 | 2017-05-07 13:48:14 | 1 | 3.70 | 1 | N | 188 | 97 | 1 | 20.5 | 0.0 | 0.5 | 6.39 | 0.0 | 0.3 | 27.69 | 30.250000 | 3.700000 | 18.731650 | 0.300 | 1 | sunday | 0 | 1 | 0 | 0 | may |
5 | 23345809 | 2 | 2017-03-25 20:34:11 | 2017-03-25 20:42:11 | 6 | 2.30 | 1 | N | 161 | 236 | 1 | 9.0 | 0.5 | 0.5 | 2.06 | 0.0 | 0.3 | 12.36 | 11.855376 | 2.052258 | 10.441351 | 0.200 | 1 | saturday | 0 | 0 | 1 | 0 | mar |
Drop redundant and irrelevant columns as well as those that would not be available when the model is deployed. This includes information like payment type, trip distance, tip amount, tip percentage, total amount, toll amount, etc. The target variable (generous
) must remain in the data because it will get isolated as the y
data for modeling.
# Drop columns
df1.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 15265 entries, 0 to 22698 Data columns (total 29 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Unnamed: 0 15265 non-null int64 1 VendorID 15265 non-null int64 2 tpep_pickup_datetime 15265 non-null datetime64[ns] 3 tpep_dropoff_datetime 15265 non-null datetime64[ns] 4 passenger_count 15265 non-null int64 5 trip_distance 15265 non-null float64 6 RatecodeID 15265 non-null int64 7 store_and_fwd_flag 15265 non-null object 8 PULocationID 15265 non-null int64 9 DOLocationID 15265 non-null int64 10 payment_type 15265 non-null int64 11 fare_amount 15265 non-null float64 12 extra 15265 non-null float64 13 mta_tax 15265 non-null float64 14 tip_amount 15265 non-null float64 15 tolls_amount 15265 non-null float64 16 improvement_surcharge 15265 non-null float64 17 total_amount 15265 non-null float64 18 mean_duration 15265 non-null float64 19 mean_distance 15265 non-null float64 20 predicted_fare 15265 non-null float64 21 tip_percent 15262 non-null float64 22 generous 15265 non-null int64 23 day 15265 non-null object 24 am_rush 15265 non-null bool 25 daytime 15265 non-null bool 26 pm_rush 15265 non-null bool 27 nighttime 15265 non-null bool 28 month 15265 non-null object dtypes: bool(4), datetime64[ns](2), float64(12), int64(8), object(3) memory usage: 3.1+ MB
drop_cols = ['Unnamed: 0', 'tpep_pickup_datetime', 'tpep_dropoff_datetime', 'payment_type',
'trip_distance', 'store_and_fwd_flag', 'payment_type', 'fare_amount', 'extra',
'mta_tax', 'tip_amount', 'tolls_amount', 'improvement_surcharge', 'total_amount', 'tip_percent']
df1 = df1.drop(drop_cols, axis=1)
df1.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 15265 entries, 0 to 22698 Data columns (total 15 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 VendorID 15265 non-null int64 1 passenger_count 15265 non-null int64 2 RatecodeID 15265 non-null int64 3 PULocationID 15265 non-null int64 4 DOLocationID 15265 non-null int64 5 mean_duration 15265 non-null float64 6 mean_distance 15265 non-null float64 7 predicted_fare 15265 non-null float64 8 generous 15265 non-null int64 9 day 15265 non-null object 10 am_rush 15265 non-null bool 11 daytime 15265 non-null bool 12 pm_rush 15265 non-null bool 13 nighttime 15265 non-null bool 14 month 15265 non-null object dtypes: bool(4), float64(3), int64(6), object(2) memory usage: 1.5+ MB
Many of the columns are categorical and will need to be dummied (converted to binary). Some of these columns are numeric, but they actually encode categorical information, such as RatecodeID
and the pickup and dropoff locations. To make these columns recognizable to the get_dummies()
function as categorical variables, we'll first need to convert them to type(str)
.
cols_to_str
, which is a list of the numeric columns that contain categorical information and must be converted to string: RatecodeID
, PULocationID
, DOLocationID
.cols_to_str
to string.# 1. Define list of cols to convert to string
cols_to_str = ['RatecodeID', 'PULocationID', 'DOLocationID','VendorID',]
# 2. Convert each column to string
for col in cols_to_str:
df1[col] = df1[col].astype(str)
To convert to string, use astype(str)
on the column.
Now convert all the categorical columns to binary.
get_dummies()
on the dataframe and assign the results back to a new dataframe called df2
.# Convert categoricals to binary
df2 = pd.get_dummies(df1, drop_first=True)
df2.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 15265 entries, 0 to 22698 Columns: 347 entries, passenger_count to month_sep dtypes: bool(4), float64(3), int64(2), uint8(338) memory usage: 5.7 MB
Before modeling, we must decide on an evaluation metric.
# Get class balance of 'generous' col
df2['generous'].value_counts(normalize=True)
1 0.526368 0 0.473632 Name: generous, dtype: float64
A little over half of the customers in this dataset were "generous" (tipped ≥ 20%). The dataset is very nearly balanced.
To determine a metric, consider the cost of both kinds of model error:
False positives are worse for cab drivers, because they would pick up a customer expecting a good tip and then not receive one, frustrating the driver.
False negatives are worse for customers, because a cab driver would likely pick up a different customer who was predicted to tip more—even when the original customer would have tipped generously.
**The stakes are relatively even. We want to help taxi drivers make more money, but we don't want this to anger customers. Our metric should weigh both precision and recall equally.
Now you're ready to model. The only remaining step is to split the data into features/target variable and training/testing data.
y
that isolates the target variable (generous
).X
that isolates the features.# Isolate target variable (y)
y = df2['generous']
# Isolate the features (X)
X = df2.drop('generous', axis=1)
# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.2,random_state=42)
Begin with using GridSearchCV
to tune a random forest model.
Instantiate the random forest classifier rf
and set the random state.
Create a dictionary cv_params
of any of the following hyperparameters and their corresponding values to tune. The more you tune, the better your model will fit the data, but the longer it will take.
max_depth
max_features
max_samples
min_samples_leaf
min_samples_split
n_estimators
Define a set scoring
of scoring metrics for GridSearch to capture (precision, recall, F1 score, and accuracy).
Instantiate the GridSearchCV
object rf1
. Pass to it as arguments:
rf
cv_params
scoring
cv=_
)refit=_
)Note: refit
should be set to 'f1'
.
# 1. Instantiate the random forest classifier
rf = RandomForestClassifier(random_state=42)
# 2. Create a dictionary of hyperparameters to tune
cv_params = {'max_depth': [None],
'max_features': [1.0],
'max_samples': [0.7],
'min_samples_leaf': [1],
'min_samples_split': [2],
'n_estimators': [300]
}
# 3. Define a set of scoring metrics to capture
scoring = {'accuracy', 'precision', 'recall', 'f1'}
# 4. Instantiate the GridSearchCV object
rf1 = GridSearchCV(rf, cv_params, scoring=scoring, cv=4, refit='f1')
Now fit the model to the training data.
# fit the model and time how long it takes
from datetime import datetime
start = datetime.now()
print(rf1.fit(X_train, y_train))
end = datetime.now()
print(end - start)
GridSearchCV(cv=4, error_score=nan, estimator=RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None, criterion='gini', max_depth=None, max_features='auto', max_leaf_nodes=None, max_samples=None, min_impurity_decrease=0.0, min_impurity_split=None, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=None, oob_score=False, random_state=42, verbose=0, warm_start=False), iid='deprecated', n_jobs=None, param_grid={'max_depth': [None], 'max_features': [1.0], 'max_samples': [0.7], 'min_samples_leaf': [1], 'min_samples_split': [2], 'n_estimators': [300]}, pre_dispatch='2*n_jobs', refit='f1', return_train_score=False, scoring={'accuracy', 'precision', 'recall', 'f1'}, verbose=0) 0:03:56.572835
Pickle
¶We can use pickle
to save the models and read them back in. This can be particularly helpful when performing a search over many possible hyperparameter values.
import pickle
# Define a path to the folder where you want to save the model
path = '/home/jovyan/work/'
def write_pickle(path, model_object, save_name:str):
'''
save_name is a string.
'''
with open(path + save_name + '.pickle', 'wb') as to_write:
pickle.dump(model_object, to_write)
def read_pickle(path, saved_model_name:str):
'''
saved_model_name is a string.
'''
with open(path + saved_model_name + '.pickle', 'rb') as to_read:
model = pickle.load(to_read)
return model
Examine the best average score across all the validation folds.
# Examine best score
rf1.best_score_
0.7158235030134816
Examine the best combination of hyperparameters.
rf1.best_params_
{'max_depth': None, 'max_features': 1.0, 'max_samples': 0.7, 'min_samples_leaf': 1, 'min_samples_split': 2, 'n_estimators': 300}
Use the make_results()
function to output all of the scores of your model. Note that it accepts three arguments.
def make_results(model_name:str, model_object, metric:str):
'''
Arguments:
model_name (string): what you want the model to be called in the output table
model_object: a fit GridSearchCV object
metric (string): precision, recall, f1, or accuracy
Returns a pandas df with the F1, recall, precision, and accuracy scores
for the model with the best mean 'metric' score across all validation folds.
'''
# Create dictionary that maps input metric to actual metric name in GridSearchCV
metric_dict = {'precision': 'mean_test_precision',
'recall': 'mean_test_recall',
'f1': 'mean_test_f1',
'accuracy': 'mean_test_accuracy',
}
# Get all the results from the CV and put them in a df
cv_results = pd.DataFrame(model_object.cv_results_)
# Isolate the row of the df with the max(metric) score
best_estimator_results = cv_results.iloc[cv_results[metric_dict[metric]].idxmax(), :]
# Extract Accuracy, precision, recall, and f1 score from that row
f1 = best_estimator_results.mean_test_f1
recall = best_estimator_results.mean_test_recall
precision = best_estimator_results.mean_test_precision
accuracy = best_estimator_results.mean_test_accuracy
# Create table of results
table = pd.DataFrame({'model': [model_name],
'precision': [precision],
'recall': [recall],
'F1': [f1],
'accuracy': [accuracy],
},
)
return table
Call make_results()
on the GridSearch object.
# View the results
results = make_results('RF CV', rf1, 'f1')
results
model | precision | recall | F1 | accuracy | |
---|---|---|---|---|---|
0 | RF CV | 0.677001 | 0.759645 | 0.715824 | 0.682689 |
Use the model to predict on the test data. Assign the results to a variable called rf_preds
.
# Get scores on test data
rf_preds = rf1.best_estimator_.predict(X_test)
Use the below get_test_scores()
function we will use to output the scores of the model on the test data.
def get_test_scores(model_name:str, preds, y_test_data):
'''
Generate a table of test scores.
In:
model_name (string): Your choice: how the model will be named in the output table
preds: numpy array of test predictions
y_test_data: numpy array of y_test data
Out:
table: a pandas df of precision, recall, f1, and accuracy scores for your model
'''
accuracy = accuracy_score(y_test_data, preds)
precision = precision_score(y_test_data, preds)
recall = recall_score(y_test_data, preds)
f1 = f1_score(y_test_data, preds)
table = pd.DataFrame({'model': [model_name],
'precision': [precision],
'recall': [recall],
'F1': [f1],
'accuracy': [accuracy]
})
return table
get_test_scores()
function to generate the scores on the test data. Assign the results to rf_test_scores
.rf_test_scores
to output the results. # Get scores on test data
rf_test_scores = get_test_scores('RF test', rf_preds, y_test)
results = pd.concat([results, rf_test_scores], axis=0)
results
model | precision | recall | F1 | accuracy | |
---|---|---|---|---|---|
0 | RF CV | 0.677001 | 0.759645 | 0.715824 | 0.682689 |
0 | RF test | 0.670996 | 0.771624 | 0.717800 | 0.680642 |
Try to improve our scores using an XGBoost model.
Instantiate the XGBoost classifier xgb
and set objective='binary:logistic'
. Also set the random state.
Create a dictionary cv_params
of the following hyperparameters and their corresponding values to tune:
max_depth
min_child_weight
learning_rate
n_estimators
Define a set scoring
of scoring metrics for grid search to capture (precision, recall, F1 score, and accuracy).
Instantiate the GridSearchCV
object xgb1
. Pass to it as arguments:
xgb
cv_params
scoring
cv=_
)refit='f1'
)# 1. Instantiate the XGBoost classifier
xgb = XGBClassifier(objective='binary:logistic', random_state=0)
# 2. Create a dictionary of hyperparameters to tune
cv_params = {'learning_rate': [0.1],
'max_depth': [8],
'min_child_weight': [2],
'n_estimators': [500]
}
# 3. Define a set of scoring metrics to capture
scoring = {'accuracy', 'precision', 'recall', 'f1'}
# 4. Instantiate the GridSearchCV object
xgb1 = GridSearchCV(xgb, cv_params, scoring=scoring, cv=4, refit='f1')
Now fit the model to the X_train
and y_train
data.
# Fit the model and time how long it takes
start = datetime.now()
print(xgb1.fit(X_train, y_train))
end = datetime.now()
print(end - start)
GridSearchCV(cv=4, error_score=nan, estimator=XGBClassifier(base_score=None, booster=None, callbacks=None, colsample_bylevel=None, colsample_bynode=None, colsample_bytree=None, early_stopping_rounds=None, enable_categorical=False, eval_metric=None, gamma=None, gpu_id=None, grow_policy=None, importance_type=None, interaction_constraints=None, learning_rate=None, max... n_estimators=100, n_jobs=None, num_parallel_tree=None, objective='binary:logistic', predictor=None, random_state=0, reg_alpha=None, ...), iid='deprecated', n_jobs=None, param_grid={'learning_rate': [0.1], 'max_depth': [8], 'min_child_weight': [2], 'n_estimators': [500]}, pre_dispatch='2*n_jobs', refit='f1', return_train_score=False, scoring={'accuracy', 'precision', 'recall', 'f1'}, verbose=0) 0:03:04.027885
Get the best score from this model.
# Examine best score
xgb1.best_score_
0.6923998236397069
And the best parameters.
# Examine best parameters
xgb1.best_params_
{'learning_rate': 0.1, 'max_depth': 8, 'min_child_weight': 2, 'n_estimators': 500}
Use the make_results()
function to output all of the scores of the model.
# Call 'make_results()' on the GridSearch object
xgb1_cv_results = make_results('XGB CV', xgb1, 'f1')
results = pd.concat([results, xgb1_cv_results], axis=0)
results
model | precision | recall | F1 | accuracy | |
---|---|---|---|---|---|
0 | RF CV | 0.677001 | 0.759645 | 0.715824 | 0.682689 |
0 | RF test | 0.670996 | 0.771624 | 0.717800 | 0.680642 |
0 | XGB CV | 0.669160 | 0.717486 | 0.692400 | 0.664510 |
# Get scores on test data
xgb_preds = xgb1.best_estimator_.predict(X_test)
get_test_scores()
function to generate the scores on the test data. Assign the results to xgb_test_scores
.xgb_test_scores
to output the results.# Get scores on test data
xgb_test_scores = get_test_scores('XGB test', xgb_preds, y_test)
results = pd.concat([results, xgb_test_scores], axis=0)
results
model | precision | recall | F1 | accuracy | |
---|---|---|---|---|---|
0 | RF CV | 0.677001 | 0.759645 | 0.715824 | 0.682689 |
0 | RF test | 0.670996 | 0.771624 | 0.717800 | 0.680642 |
0 | XGB CV | 0.669160 | 0.717486 | 0.692400 | 0.664510 |
0 | XGB test | 0.672326 | 0.739266 | 0.704209 | 0.673108 |
Plot a confusion matrix of the model's predictions on the test data.
# Generate array of values for confusion matrix
# Plot confusion matrix
cm = confusion_matrix(y_test, rf_preds, labels=rf1.classes_)
disp = ConfusionMatrixDisplay(confusion_matrix=cm,
display_labels=rf1.classes_,
)
disp.plot(values_format='');
Use the feature_importances_
attribute of the best estimator object to inspect the features of your final model. You can then sort them and plot the most important ones.
# Plot feature importances
importances = rf1.best_estimator_.feature_importances_
rf_importances = pd.Series(importances, index=X_test.columns)
rf_importances = rf_importances.sort_values(ascending=False)[:15]
fig, ax = plt.subplots(figsize=(8,5))
rf_importances.plot.bar(ax=ax)
ax.set_title('Feature importances')
ax.set_ylabel('Mean decrease in impurity')
fig.tight_layout();
In this step, use the results of the models above to formulate a conclusion. We shall consider the following questions:
Exemplar responses:
Yes, this is model performs acceptably. Its F1 score was 0.7235 and it had an overall accuracy of 0.6865. It correctly identified ~78% of the actual responders in the test set, which is 48% better than a random guess. It may be worthwhile to test the model with a select group of taxi drivers to get feedback.
Unfortunately, random forest is not the most transparent machine learning algorithm. We know that VendorID
, predicted_fare
, mean_duration
, and mean_distance
are the most important features, but we don't know how they influence tipping. This would require further exploration. It is interesting that VendorID
is the most predictive feature. This seems to indicate that one of the two vendors tends to attract more generous customers. It may be worth performing statistical tests on the different vendors to examine this further.
There are almost always additional features that can be engineered, but hopefully the most obvious ones were generated during the first round of modeling. In our case, we could try creating three new columns that indicate if the trip distance is short, medium, or far. We could also engineer a column that gives a ratio that represents (the amount of money from the fare amount to the nearest higher multiple of \$5) / fare amount. For example, if the fare were \$12, the value in this column would be 0.25, because \$12 to the nearest higher multiple of \$5 (\$15) is \$3, and \$3 divided by \$12 is 0.25. The intuition for this feature is that people might be likely to simply round up their tip, so journeys with fares with values just under a multiple of \$5 may have lower tip percentages than those with fare values just over a multiple of \$5. We could also do the same thing for fares to the nearest \$10.
It would probably be very helpful to have past tipping behavior for each customer. It would also be valuable to have accurate tip values for customers who pay with cash. It would be helpful to have a lot more data. With enough data, we could create a unique feature for each pickup/dropoff combination.