Notebook

Tree-based Machine Learning Model¶

The New York City Taxi & Limousine Commission

In this project, we are the Data Analyst at the New York City Taxi & Limousine Commission (New York City TLC), and we have been requested to build a machine learning model to predict if a customer will not leave a tip. They want to use the model in an app that will alert taxi drivers to customers who are unlikely to tip, since drivers depend on tips.

Objective¶

We will employ the tree-based modeling techniques to predict on a binary target class.
The purpose of this model is to find ways to generate more revenue for taxi cab drivers.
The goal of this model is to predict whether or not a customer is a generous tipper.

There are 3 parts in this project

Part 1: Ethical considerations

Consider the ethical implications of the request
Revisit the project objective

Part 2: Feature engineering

Perform feature selection, extraction, and transformation to prepare the data for modeling

Part 3: Modeling

Build the models, evaluate them, and advise on next steps

PACE stages: Plan, Analyze, Construct, and Execute¶

Plan¶

In this stage, let's consider the following questions:

What is the main purpose of the task?
What are the ethical implications of the model?
Do the benefits of such a model outweigh the potential problems?
Should we proceed with the request to build this model? Why or why not?
Can the objective be modified to make it less problematic?

Responses¶

If our model focus whether a customer will tip or not, then it might end up making it difficult for customers, who can't afford to tip, to find a taxi, i.e. even if the model is correct, it will limit the accessibility of taxi service.
If Drivers are told by the app that a customer will leave a tip but the customer doesn't, then the drivers will not trust or use the app.
To make this prediction, ideally, we'd have behavourial history for each customer so we can predict if they will tip based on their previous rides.
The target variable would be a binary variable (1 or 0) that indictates whether a customer is expected to tip or more than 20%.

Actions¶

Instead of predicting whether people will or will not tip, we can predict if people are generous tipper.
Those who tip 20% or more are considered generous.

Step 1. Imports and data loading¶

Import packages and libraries needed to build and evaluate random forest and XGBoost classification models.

In [44]:

# Import packages and libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import roc_auc_score, roc_curve
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score,\
    confusion_matrix, ConfusionMatrixDisplay, RocCurveDisplay

from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier

from xgboost import plot_importance

In [45]:

# RUN THIS CELL TO SEE ALL COLUMNS 
# This lets us see all of the columns, preventing Juptyer from redacting them.
pd.set_option('display.max_columns', None)

In [46]:

# Load dataset into dataframe
df0 = pd.read_csv('2017_Yellow_Taxi_Trip_Data.csv')

# Import predicted fares and mean distance and duration
nyc_preds_means = pd.read_csv('nyc_preds_means.csv')

Inspect the first few rows of df0.

In [47]:

# Inspect the first few rows of df0
df0.head()

Out[47]:

	Unnamed: 0	VendorID	tpep_pickup_datetime	tpep_dropoff_datetime	passenger_count	trip_distance	RatecodeID	store_and_fwd_flag	PULocationID	DOLocationID	payment_type	fare_amount	extra	mta_tax	tip_amount	improvement_surcharge	total_amount
0	24870114	2	03/25/2017 8:55:43 AM	03/25/2017 9:09:47 AM	6	3.34	1	N	100	231	1	13.0	0.0	0.5	2.76	0.3	16.56
1	35634249	1	04/11/2017 2:53:28 PM	04/11/2017 3:19:58 PM	1	1.80	1	N	186	43	1	16.0	0.0	0.5	4.00	0.3	20.80
2	106203690	1	12/15/2017 7:26:56 AM	12/15/2017 7:34:08 AM	1	1.00	1	N	262	236	1	6.5	0.0	0.5	1.45	0.3	8.75
3	38942136	2	05/07/2017 1:17:59 PM	05/07/2017 1:48:14 PM	1	3.70	1	N	188	97	1	20.5	0.0	0.5	6.39	0.3	27.69
4	30841670	2	04/15/2017 11:32:20 PM	04/15/2017 11:49:03 PM	1	4.37	1	N	4	112	2	16.5	0.5	0.5	0.00	0.3	17.80

Inspect the first few rows of nyc_preds_means.

In [48]:

# Inspect the first few rows of `nyc_preds_means`
nyc_preds_means.head()

Out[48]:

	mean_duration	mean_distance	predicted_fare
0	22.847222	3.521667	16.434245
1	24.470370	3.108889	16.052218
2	7.250000	0.881429	7.053706
3	30.250000	3.700000	18.731650
4	14.616667	4.435000	15.845642

Join the two dataframes¶

Join the two dataframes using a method of your choice.

In [49]:

# Merge datasets
df0 = df0.merge(nyc_preds_means, left_index=True, right_index=True)
df0.head()

Out[49]:

	Unnamed: 0	VendorID	tpep_pickup_datetime	tpep_dropoff_datetime	passenger_count	trip_distance	RatecodeID	store_and_fwd_flag	PULocationID	DOLocationID	payment_type	fare_amount	extra	mta_tax	tip_amount	improvement_surcharge	total_amount	mean_duration	mean_distance	predicted_fare
0	24870114	2	03/25/2017 8:55:43 AM	03/25/2017 9:09:47 AM	6	3.34	1	N	100	231	1	13.0	0.0	0.5	2.76	0.3	16.56	22.847222	3.521667	16.434245
1	35634249	1	04/11/2017 2:53:28 PM	04/11/2017 3:19:58 PM	1	1.80	1	N	186	43	1	16.0	0.0	0.5	4.00	0.3	20.80	24.470370	3.108889	16.052218
2	106203690	1	12/15/2017 7:26:56 AM	12/15/2017 7:34:08 AM	1	1.00	1	N	262	236	1	6.5	0.0	0.5	1.45	0.3	8.75	7.250000	0.881429	7.053706
3	38942136	2	05/07/2017 1:17:59 PM	05/07/2017 1:48:14 PM	1	3.70	1	N	188	97	1	20.5	0.0	0.5	6.39	0.3	27.69	30.250000	3.700000	18.731650
4	30841670	2	04/15/2017 11:32:20 PM	04/15/2017 11:49:03 PM	1	4.37	1	N	4	112	2	16.5	0.5	0.5	0.00	0.3	17.80	14.616667	4.435000	15.845642

Analyze¶

Step 2. Feature engineering¶

Upon completion of EDA, let's inspect the newly combined dataframe.

In [50]:

# See the info of df0
df0.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 22699 entries, 0 to 22698
Data columns (total 21 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   Unnamed: 0             22699 non-null  int64  
 1   VendorID               22699 non-null  int64  
 2   tpep_pickup_datetime   22699 non-null  object 
 3   tpep_dropoff_datetime  22699 non-null  object 
 4   passenger_count        22699 non-null  int64  
 5   trip_distance          22699 non-null  float64
 6   RatecodeID             22699 non-null  int64  
 7   store_and_fwd_flag     22699 non-null  object 
 8   PULocationID           22699 non-null  int64  
 9   DOLocationID           22699 non-null  int64  
 10  payment_type           22699 non-null  int64  
 11  fare_amount            22699 non-null  float64
 12  extra                  22699 non-null  float64
 13  mta_tax                22699 non-null  float64
 14  tip_amount             22699 non-null  float64
 15  tolls_amount           22699 non-null  float64
 16  improvement_surcharge  22699 non-null  float64
 17  total_amount           22699 non-null  float64
 18  mean_duration          22699 non-null  float64
 19  mean_distance          22699 non-null  float64
 20  predicted_fare         22699 non-null  float64
dtypes: float64(11), int64(7), object(3)
memory usage: 3.6+ MB

We know from our EDA that customers who pay cash generally have a tip amount of $0. To meet the modeling objective, we'll need to sample the data to select only the customers who pay with credit card.

Let's copy df0 and assign the result to a variable called df1. Then, use a Boolean mask to filter df1 so it contains only customers who paid with credit card.

In [51]:

# Subset the data to isolate only customers who paid by credit card
df1 = df0[df0['payment_type'] == 1]
df1.head()

Out[51]:

	Unnamed: 0	VendorID	tpep_pickup_datetime	tpep_dropoff_datetime	passenger_count	trip_distance	RatecodeID	store_and_fwd_flag	PULocationID	DOLocationID	payment_type	fare_amount	extra	mta_tax	tip_amount	improvement_surcharge	total_amount	mean_duration	mean_distance	predicted_fare
0	24870114	2	03/25/2017 8:55:43 AM	03/25/2017 9:09:47 AM	6	3.34	1	N	100	231	1	13.0	0.0	0.5	2.76	0.3	16.56	22.847222	3.521667	16.434245
1	35634249	1	04/11/2017 2:53:28 PM	04/11/2017 3:19:58 PM	1	1.80	1	N	186	43	1	16.0	0.0	0.5	4.00	0.3	20.80	24.470370	3.108889	16.052218
2	106203690	1	12/15/2017 7:26:56 AM	12/15/2017 7:34:08 AM	1	1.00	1	N	262	236	1	6.5	0.0	0.5	1.45	0.3	8.75	7.250000	0.881429	7.053706
3	38942136	2	05/07/2017 1:17:59 PM	05/07/2017 1:48:14 PM	1	3.70	1	N	188	97	1	20.5	0.0	0.5	6.39	0.3	27.69	30.250000	3.700000	18.731650
5	23345809	2	03/25/2017 8:34:11 PM	03/25/2017 8:42:11 PM	6	2.30	1	N	161	236	1	9.0	0.5	0.5	2.06	0.3	12.36	11.855376	2.052258	10.441351

Target¶

Let's add a tip_percent column to the dataframe by performing the following calculation:

$$tip\ percent = \frac{tip\ amount}{total\ amount - tip\ amount}$$

Purpose: the customers who tip ≥ 20% are considered generous.

In [53]:

# Create tip % col
df1['tip_percent'] = round(df1['tip_amount'] / (df1['total_amount'] - df1['tip_amount']), 3)

Now that we have a tip_percent column, we can add a binary indicator to identify if each customer is generous (tipped more than 20% will be 1= yes, otherwise, 0 = no).

In [54]:

# Create 'generous' col (target)
df1['generous'] = (df1['tip_percent'] >= 0.2).astype(int)
df1.head()

Out[54]:

	Unnamed: 0	VendorID	tpep_pickup_datetime	tpep_dropoff_datetime	passenger_count	trip_distance	RatecodeID	store_and_fwd_flag	PULocationID	DOLocationID	payment_type	fare_amount	extra	mta_tax	tip_amount	improvement_surcharge	total_amount	mean_duration	mean_distance	predicted_fare	tip_percent	generous
0	24870114	2	03/25/2017 8:55:43 AM	03/25/2017 9:09:47 AM	6	3.34	1	N	100	231	1	13.0	0.0	0.5	2.76	0.3	16.56	22.847222	3.521667	16.434245	0.200	1
1	35634249	1	04/11/2017 2:53:28 PM	04/11/2017 3:19:58 PM	1	1.80	1	N	186	43	1	16.0	0.0	0.5	4.00	0.3	20.80	24.470370	3.108889	16.052218	0.238	1
2	106203690	1	12/15/2017 7:26:56 AM	12/15/2017 7:34:08 AM	1	1.00	1	N	262	236	1	6.5	0.0	0.5	1.45	0.3	8.75	7.250000	0.881429	7.053706	0.199	0
3	38942136	2	05/07/2017 1:17:59 PM	05/07/2017 1:48:14 PM	1	3.70	1	N	188	97	1	20.5	0.0	0.5	6.39	0.3	27.69	30.250000	3.700000	18.731650	0.300	1
5	23345809	2	03/25/2017 8:34:11 PM	03/25/2017 8:42:11 PM	6	2.30	1	N	161	236	1	9.0	0.5	0.5	2.06	0.3	12.36	11.855376	2.052258	10.441351	0.200	1

Create day column¶

Let's convert the tpep_pickup_datetime and tpep_dropoff_datetime columns to datetime.

In [55]:

# Convert pickup and dropoff cols to datetime
df1['tpep_pickup_datetime'] = pd.to_datetime(df1['tpep_pickup_datetime'], format='%m/%d/%Y %I:%M:%S %p')
df1['tpep_dropoff_datetime'] = pd.to_datetime(df1['tpep_dropoff_datetime'], format='%m/%d/%Y %I:%M:%S %p')

Now, create a day column that contains only the day of the week when each passenger was picked up. Then, convert the values to lowercase.

In [56]:

# Create a 'day' col
df1['day'] = df1['tpep_pickup_datetime'].dt.day_name().str.lower()

Create time of day columns¶

Next, engineer four new columns that represent time of day bins. Each column should contain binary values (0=no, 1=yes) that indicate whether a trip began (picked up) during the following times:

am_rush = [06:00–10:00)
daytime = [10:00–16:00)
pm_rush = [16:00–20:00)
nighttime = [20:00–06:00)

To do this, first create the four columns. For now, each new column should be identical and contain the same information: the hour (only) from the tpep_pickup_datetime column.

In [57]:

# Create 'am_rush' col
#==> ENTER YOUR CODE HERE
df1['am_rush'] = df1['tpep_pickup_datetime'].dt.hour.between(6, 10)

# Create 'daytime' col
#==> ENTER YOUR CODE HERE
df1['daytime'] = df1['tpep_pickup_datetime'].dt.hour.between(10, 16)

# Create 'pm_rush' col
#==> ENTER YOUR CODE HERE
df1['pm_rush'] = df1['tpep_pickup_datetime'].dt.hour.between(16, 20)

# Create 'nighttime' col
#==> ENTER YOUR CODE HERE
df1['nighttime'] = df1['tpep_pickup_datetime'].dt.hour.between(20, 6)

In [16]:

df1.head()

Out[16]:

	Unnamed: 0	VendorID	tpep_pickup_datetime	tpep_dropoff_datetime	passenger_count	trip_distance	RatecodeID	store_and_fwd_flag	PULocationID	DOLocationID	payment_type	fare_amount	extra	mta_tax	tip_amount	improvement_surcharge	total_amount	mean_duration	mean_distance	predicted_fare	tip_percent	generous	day	am_rush	daytime	pm_rush	nighttime
0	24870114	2	2017-03-25 08:55:43	2017-03-25 09:09:47	6	3.34	1	N	100	231	1	13.0	0.0	0.5	2.76	0.3	16.56	22.847222	3.521667	16.434245	0.200	1	saturday	True	False	False	False
1	35634249	1	2017-04-11 14:53:28	2017-04-11 15:19:58	1	1.80	1	N	186	43	1	16.0	0.0	0.5	4.00	0.3	20.80	24.470370	3.108889	16.052218	0.238	1	tuesday	False	True	False	False
2	106203690	1	2017-12-15 07:26:56	2017-12-15 07:34:08	1	1.00	1	N	262	236	1	6.5	0.0	0.5	1.45	0.3	8.75	7.250000	0.881429	7.053706	0.199	0	friday	True	False	False	False
3	38942136	2	2017-05-07 13:17:59	2017-05-07 13:48:14	1	3.70	1	N	188	97	1	20.5	0.0	0.5	6.39	0.3	27.69	30.250000	3.700000	18.731650	0.300	1	sunday	False	True	False	False
5	23345809	2	2017-03-25 20:34:11	2017-03-25 20:42:11	6	2.30	1	N	161	236	1	9.0	0.5	0.5	2.06	0.3	12.36	11.855376	2.052258	10.441351	0.200	1	saturday	False	False	True	False

In [17]:

# transform and encode all the time columns into '1' & '0'
cols = ['am_rush', 'daytime', 'pm_rush', 'nighttime']
df1.loc[:, cols] = df1.loc[:, cols].replace({True: 1, False: 0})
df1.head()

Out[17]:

	Unnamed: 0	VendorID	tpep_pickup_datetime	tpep_dropoff_datetime	passenger_count	trip_distance	RatecodeID	store_and_fwd_flag	PULocationID	DOLocationID	payment_type	fare_amount	extra	mta_tax	tip_amount	improvement_surcharge	total_amount	mean_duration	mean_distance	predicted_fare	tip_percent	generous	day	am_rush	daytime	pm_rush
0	24870114	2	2017-03-25 08:55:43	2017-03-25 09:09:47	6	3.34	1	N	100	231	1	13.0	0.0	0.5	2.76	0.3	16.56	22.847222	3.521667	16.434245	0.200	1	saturday	1	0	0
1	35634249	1	2017-04-11 14:53:28	2017-04-11 15:19:58	1	1.80	1	N	186	43	1	16.0	0.0	0.5	4.00	0.3	20.80	24.470370	3.108889	16.052218	0.238	1	tuesday	0	1	0
2	106203690	1	2017-12-15 07:26:56	2017-12-15 07:34:08	1	1.00	1	N	262	236	1	6.5	0.0	0.5	1.45	0.3	8.75	7.250000	0.881429	7.053706	0.199	0	friday	1	0	0
3	38942136	2	2017-05-07 13:17:59	2017-05-07 13:48:14	1	3.70	1	N	188	97	1	20.5	0.0	0.5	6.39	0.3	27.69	30.250000	3.700000	18.731650	0.300	1	sunday	0	1	0
5	23345809	2	2017-03-25 20:34:11	2017-03-25 20:42:11	6	2.30	1	N	161	236	1	9.0	0.5	0.5	2.06	0.3	12.36	11.855376	2.052258	10.441351	0.200	1	saturday	0	0	1

Create `month` column¶

Now, create a month column that contains only the abbreviated name of the month when each passenger was picked up, then convert the result to lowercase.

In [58]:

# Create 'month' col
df1['month'] = df1['tpep_pickup_datetime'].dt.strftime('%b').str.lower()

Examine the first five rows of your dataframe.

In [49]:

# Check the first few rows of df1
df1.head()

Out[49]:

	Unnamed: 0	VendorID	tpep_pickup_datetime	tpep_dropoff_datetime	passenger_count	trip_distance	RatecodeID	store_and_fwd_flag	PULocationID	DOLocationID	payment_type	fare_amount	extra	mta_tax	tip_amount	improvement_surcharge	total_amount	mean_duration	mean_distance	predicted_fare	tip_percent	generous	day	am_rush	daytime	pm_rush	month
0	24870114	2	2017-03-25 08:55:43	2017-03-25 09:09:47	6	3.34	1	N	100	231	1	13.0	0.0	0.5	2.76	0.3	16.56	22.847222	3.521667	16.434245	0.200	1	saturday	1	0	0	mar
1	35634249	1	2017-04-11 14:53:28	2017-04-11 15:19:58	1	1.80	1	N	186	43	1	16.0	0.0	0.5	4.00	0.3	20.80	24.470370	3.108889	16.052218	0.238	1	tuesday	0	1	0	apr
2	106203690	1	2017-12-15 07:26:56	2017-12-15 07:34:08	1	1.00	1	N	262	236	1	6.5	0.0	0.5	1.45	0.3	8.75	7.250000	0.881429	7.053706	0.199	0	friday	1	0	0	dec
3	38942136	2	2017-05-07 13:17:59	2017-05-07 13:48:14	1	3.70	1	N	188	97	1	20.5	0.0	0.5	6.39	0.3	27.69	30.250000	3.700000	18.731650	0.300	1	sunday	0	1	0	may
5	23345809	2	2017-03-25 20:34:11	2017-03-25 20:42:11	6	2.30	1	N	161	236	1	9.0	0.5	0.5	2.06	0.3	12.36	11.855376	2.052258	10.441351	0.200	1	saturday	0	0	1	mar

Drop columns¶

Drop redundant and irrelevant columns as well as those that would not be available when the model is deployed. This includes information like payment type, trip distance, tip amount, tip percentage, total amount, toll amount, etc. The target variable (generous) must remain in the data because it will get isolated as the y data for modeling.

In [59]:

# Drop columns
df1.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 15265 entries, 0 to 22698
Data columns (total 29 columns):
 #   Column                 Non-Null Count  Dtype         
---  ------                 --------------  -----         
 0   Unnamed: 0             15265 non-null  int64         
 1   VendorID               15265 non-null  int64         
 2   tpep_pickup_datetime   15265 non-null  datetime64[ns]
 3   tpep_dropoff_datetime  15265 non-null  datetime64[ns]
 4   passenger_count        15265 non-null  int64         
 5   trip_distance          15265 non-null  float64       
 6   RatecodeID             15265 non-null  int64         
 7   store_and_fwd_flag     15265 non-null  object        
 8   PULocationID           15265 non-null  int64         
 9   DOLocationID           15265 non-null  int64         
 10  payment_type           15265 non-null  int64         
 11  fare_amount            15265 non-null  float64       
 12  extra                  15265 non-null  float64       
 13  mta_tax                15265 non-null  float64       
 14  tip_amount             15265 non-null  float64       
 15  tolls_amount           15265 non-null  float64       
 16  improvement_surcharge  15265 non-null  float64       
 17  total_amount           15265 non-null  float64       
 18  mean_duration          15265 non-null  float64       
 19  mean_distance          15265 non-null  float64       
 20  predicted_fare         15265 non-null  float64       
 21  tip_percent            15262 non-null  float64       
 22  generous               15265 non-null  int64         
 23  day                    15265 non-null  object        
 24  am_rush                15265 non-null  bool          
 25  daytime                15265 non-null  bool          
 26  pm_rush                15265 non-null  bool          
 27  nighttime              15265 non-null  bool          
 28  month                  15265 non-null  object        
dtypes: bool(4), datetime64[ns](2), float64(12), int64(8), object(3)
memory usage: 3.1+ MB

In [60]:

drop_cols = ['Unnamed: 0', 'tpep_pickup_datetime', 'tpep_dropoff_datetime', 'payment_type',
            'trip_distance', 'store_and_fwd_flag', 'payment_type', 'fare_amount', 'extra',
             'mta_tax', 'tip_amount', 'tolls_amount', 'improvement_surcharge', 'total_amount', 'tip_percent']

df1 = df1.drop(drop_cols, axis=1)
df1.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 15265 entries, 0 to 22698
Data columns (total 15 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   VendorID         15265 non-null  int64  
 1   passenger_count  15265 non-null  int64  
 2   RatecodeID       15265 non-null  int64  
 3   PULocationID     15265 non-null  int64  
 4   DOLocationID     15265 non-null  int64  
 5   mean_duration    15265 non-null  float64
 6   mean_distance    15265 non-null  float64
 7   predicted_fare   15265 non-null  float64
 8   generous         15265 non-null  int64  
 9   day              15265 non-null  object 
 10  am_rush          15265 non-null  bool   
 11  daytime          15265 non-null  bool   
 12  pm_rush          15265 non-null  bool   
 13  nighttime        15265 non-null  bool   
 14  month            15265 non-null  object 
dtypes: bool(4), float64(3), int64(6), object(2)
memory usage: 1.5+ MB

Variable encoding¶

Many of the columns are categorical and will need to be dummied (converted to binary). Some of these columns are numeric, but they actually encode categorical information, such as RatecodeID and the pickup and dropoff locations. To make these columns recognizable to the get_dummies() function as categorical variables, we'll first need to convert them to type(str).

Define a variable called cols_to_str, which is a list of the numeric columns that contain categorical information and must be converted to string: RatecodeID, PULocationID, DOLocationID.
Write a for loop that converts each column in cols_to_str to string.

In [61]:

# 1. Define list of cols to convert to string
cols_to_str = ['RatecodeID', 'PULocationID', 'DOLocationID','VendorID',]

# 2. Convert each column to string
for col in cols_to_str:
    df1[col] = df1[col].astype(str)

HINT

To convert to string, use astype(str) on the column.

Now convert all the categorical columns to binary.

Call get_dummies() on the dataframe and assign the results back to a new dataframe called df2.

In [62]:

# Convert categoricals to binary
df2 = pd.get_dummies(df1, drop_first=True)
df2.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 15265 entries, 0 to 22698
Columns: 347 entries, passenger_count to month_sep
dtypes: bool(4), float64(3), int64(2), uint8(338)
memory usage: 5.7 MB

Evaluation metric¶

Before modeling, we must decide on an evaluation metric.

Examine the class balance of your target variable.

In [63]:

# Get class balance of 'generous' col
df2['generous'].value_counts(normalize=True)

Out[63]:

1    0.526368
0    0.473632
Name: generous, dtype: float64

A little over half of the customers in this dataset were "generous" (tipped ≥ 20%). The dataset is very nearly balanced.

To determine a metric, consider the cost of both kinds of model error:

False positives (the model predicts a tip ≥ 20%, but the customer does not give one)
False negatives (the model predicts a tip < 20%, but the customer gives more)

False positives are worse for cab drivers, because they would pick up a customer expecting a good tip and then not receive one, frustrating the driver.

False negatives are worse for customers, because a cab driver would likely pick up a different customer who was predicted to tip more—even when the original customer would have tipped generously.

**The stakes are relatively even. We want to help taxi drivers make more money, but we don't want this to anger customers. Our metric should weigh both precision and recall equally.

Construct¶

Step 3. Modeling¶

Split the data¶

Now you're ready to model. The only remaining step is to split the data into features/target variable and training/testing data.

Define a variable y that isolates the target variable (generous).
Define a variable X that isolates the features.
Split the data into training and testing sets. Put 20% of the samples into the test set, stratify the data, and set the random state.

In [64]:

# Isolate target variable (y)
y = df2['generous']

# Isolate the features (X)
X = df2.drop('generous', axis=1)

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.2,random_state=42)

Random forest¶

Begin with using GridSearchCV to tune a random forest model.

Instantiate the random forest classifier rf and set the random state.
Create a dictionary cv_params of any of the following hyperparameters and their corresponding values to tune. The more you tune, the better your model will fit the data, but the longer it will take.

max_depth
max_features
max_samples
min_samples_leaf
min_samples_split
n_estimators

Define a set scoring of scoring metrics for GridSearch to capture (precision, recall, F1 score, and accuracy).
Instantiate the GridSearchCV object rf1. Pass to it as arguments:

estimator=rf
param_grid=cv_params
scoring=scoring
cv: define the number of you cross-validation folds you want (cv=_)
refit: indicate which evaluation metric you want to use to select the model (refit=_)

Note: refit should be set to 'f1'.

In [65]:

# 1. Instantiate the random forest classifier
rf = RandomForestClassifier(random_state=42)

# 2. Create a dictionary of hyperparameters to tune 
cv_params = {'max_depth': [None],
            'max_features': [1.0],
             'max_samples': [0.7],
             'min_samples_leaf': [1],
             'min_samples_split': [2],
             'n_estimators': [300]
            
            }


# 3. Define a set of scoring metrics to capture
scoring = {'accuracy', 'precision', 'recall', 'f1'}

# 4. Instantiate the GridSearchCV object
rf1 = GridSearchCV(rf, cv_params, scoring=scoring, cv=4, refit='f1')

Now fit the model to the training data.

In [66]:

# fit the model and time how long it takes
from datetime import datetime

start = datetime.now()
print(rf1.fit(X_train, y_train))
end = datetime.now()

print(end - start)

GridSearchCV(cv=4, error_score=nan,
             estimator=RandomForestClassifier(bootstrap=True, ccp_alpha=0.0,
                                              class_weight=None,
                                              criterion='gini', max_depth=None,
                                              max_features='auto',
                                              max_leaf_nodes=None,
                                              max_samples=None,
                                              min_impurity_decrease=0.0,
                                              min_impurity_split=None,
                                              min_samples_leaf=1,
                                              min_samples_split=2,
                                              min_weight_fraction_leaf=0.0,
                                              n_estimators=100, n_jobs=None,
                                              oob_score=False, random_state=42,
                                              verbose=0, warm_start=False),
             iid='deprecated', n_jobs=None,
             param_grid={'max_depth': [None], 'max_features': [1.0],
                         'max_samples': [0.7], 'min_samples_leaf': [1],
                         'min_samples_split': [2], 'n_estimators': [300]},
             pre_dispatch='2*n_jobs', refit='f1', return_train_score=False,
             scoring={'accuracy', 'precision', 'recall', 'f1'}, verbose=0)
0:03:56.572835

Saving the model with `Pickle`¶

We can use pickle to save the models and read them back in. This can be particularly helpful when performing a search over many possible hyperparameter values.

In [67]:

import pickle 

# Define a path to the folder where you want to save the model
path = '/home/jovyan/work/'

In [68]:

def write_pickle(path, model_object, save_name:str):
    '''
    save_name is a string.
    '''
    with open(path + save_name + '.pickle', 'wb') as to_write:
        pickle.dump(model_object, to_write)

In [69]:

def read_pickle(path, saved_model_name:str):
    '''
    saved_model_name is a string.
    '''
    with open(path + saved_model_name + '.pickle', 'rb') as to_read:
        model = pickle.load(to_read)

        return model

Examine the best average score across all the validation folds.

In [70]:

# Examine best score
rf1.best_score_

Out[70]:

0.7158235030134816

Examine the best combination of hyperparameters.

In [71]:

rf1.best_params_

Out[71]:

{'max_depth': None,
 'max_features': 1.0,
 'max_samples': 0.7,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'n_estimators': 300}

Use the make_results() function to output all of the scores of your model. Note that it accepts three arguments.

In [72]:

def make_results(model_name:str, model_object, metric:str):
    '''
    Arguments:
    model_name (string): what you want the model to be called in the output table
    model_object: a fit GridSearchCV object
    metric (string): precision, recall, f1, or accuracy

    Returns a pandas df with the F1, recall, precision, and accuracy scores
    for the model with the best mean 'metric' score across all validation folds.
    '''

    # Create dictionary that maps input metric to actual metric name in GridSearchCV
    metric_dict = {'precision': 'mean_test_precision',
                 'recall': 'mean_test_recall',
                 'f1': 'mean_test_f1',
                 'accuracy': 'mean_test_accuracy',
                 }

    # Get all the results from the CV and put them in a df
    cv_results = pd.DataFrame(model_object.cv_results_)

    # Isolate the row of the df with the max(metric) score
    best_estimator_results = cv_results.iloc[cv_results[metric_dict[metric]].idxmax(), :]

    # Extract Accuracy, precision, recall, and f1 score from that row
    f1 = best_estimator_results.mean_test_f1
    recall = best_estimator_results.mean_test_recall
    precision = best_estimator_results.mean_test_precision
    accuracy = best_estimator_results.mean_test_accuracy

    # Create table of results
    table = pd.DataFrame({'model': [model_name],
                        'precision': [precision],
                        'recall': [recall],
                        'F1': [f1],
                        'accuracy': [accuracy],
                        },
                       )

    return table

Call make_results() on the GridSearch object.

In [73]:

# View the results
results = make_results('RF CV', rf1, 'f1')
results

Out[73]:

	model	precision	recall	F1	accuracy
0	RF CV	0.677001	0.759645	0.715824	0.682689

Use the model to predict on the test data. Assign the results to a variable called rf_preds.

In [74]:

# Get scores on test data
rf_preds = rf1.best_estimator_.predict(X_test)

Use the below get_test_scores() function we will use to output the scores of the model on the test data.

In [75]:

def get_test_scores(model_name:str, preds, y_test_data):
    '''
    Generate a table of test scores.

    In:
    model_name (string): Your choice: how the model will be named in the output table
    preds: numpy array of test predictions
    y_test_data: numpy array of y_test data

    Out:
    table: a pandas df of precision, recall, f1, and accuracy scores for your model
    '''
    accuracy = accuracy_score(y_test_data, preds)
    precision = precision_score(y_test_data, preds)
    recall = recall_score(y_test_data, preds)
    f1 = f1_score(y_test_data, preds)

    table = pd.DataFrame({'model': [model_name],
                        'precision': [precision],
                        'recall': [recall],
                        'F1': [f1],
                        'accuracy': [accuracy]
                        })

    return table

Use the get_test_scores() function to generate the scores on the test data. Assign the results to rf_test_scores.
Call rf_test_scores to output the results.

RF test results¶

In [76]:

 # Get scores on test data
rf_test_scores = get_test_scores('RF test', rf_preds, y_test)
results = pd.concat([results, rf_test_scores], axis=0)
results

Out[76]:

	model	precision	recall	F1	accuracy
0	RF CV	0.677001	0.759645	0.715824	0.682689
0	RF test	0.670996	0.771624	0.717800	0.680642

XGBoost¶

Try to improve our scores using an XGBoost model.

Instantiate the XGBoost classifier xgb and set objective='binary:logistic'. Also set the random state.
Create a dictionary cv_params of the following hyperparameters and their corresponding values to tune:

max_depth
min_child_weight
learning_rate
n_estimators

Define a set scoring of scoring metrics for grid search to capture (precision, recall, F1 score, and accuracy).
Instantiate the GridSearchCV object xgb1. Pass to it as arguments:

estimator=xgb
param_grid=cv_params
scoring=scoring
cv: define the number of cross-validation folds you want (cv=_)
refit: indicate which evaluation metric you want to use to select the model (refit='f1')

In [77]:

# 1. Instantiate the XGBoost classifier
xgb = XGBClassifier(objective='binary:logistic', random_state=0)

# 2. Create a dictionary of hyperparameters to tune
cv_params = {'learning_rate': [0.1],
            'max_depth': [8],
             'min_child_weight': [2],
             'n_estimators': [500]
            
            }

# 3. Define a set of scoring metrics to capture
scoring = {'accuracy', 'precision', 'recall', 'f1'}

# 4. Instantiate the GridSearchCV object
xgb1 = GridSearchCV(xgb, cv_params, scoring=scoring, cv=4, refit='f1')

Now fit the model to the X_train and y_train data.

In [78]:

# Fit the model and time how long it takes
start = datetime.now()
print(xgb1.fit(X_train, y_train))
end = datetime.now()

print(end - start)

GridSearchCV(cv=4, error_score=nan,
             estimator=XGBClassifier(base_score=None, booster=None,
                                     callbacks=None, colsample_bylevel=None,
                                     colsample_bynode=None,
                                     colsample_bytree=None,
                                     early_stopping_rounds=None,
                                     enable_categorical=False, eval_metric=None,
                                     gamma=None, gpu_id=None, grow_policy=None,
                                     importance_type=None,
                                     interaction_constraints=None,
                                     learning_rate=None, max...
                                     n_estimators=100, n_jobs=None,
                                     num_parallel_tree=None,
                                     objective='binary:logistic',
                                     predictor=None, random_state=0,
                                     reg_alpha=None, ...),
             iid='deprecated', n_jobs=None,
             param_grid={'learning_rate': [0.1], 'max_depth': [8],
                         'min_child_weight': [2], 'n_estimators': [500]},
             pre_dispatch='2*n_jobs', refit='f1', return_train_score=False,
             scoring={'accuracy', 'precision', 'recall', 'f1'}, verbose=0)
0:03:04.027885

Get the best score from this model.

In [81]:

# Examine best score
xgb1.best_score_

Out[81]:

0.6923998236397069

And the best parameters.

In [82]:

# Examine best parameters
xgb1.best_params_

Out[82]:

{'learning_rate': 0.1,
 'max_depth': 8,
 'min_child_weight': 2,
 'n_estimators': 500}

XGB CV Results¶

Use the make_results() function to output all of the scores of the model.

In [83]:

# Call 'make_results()' on the GridSearch object

xgb1_cv_results = make_results('XGB CV', xgb1, 'f1')
results = pd.concat([results, xgb1_cv_results], axis=0)
results

Out[83]:

model	precision	recall	F1	accuracy
RF CV	0.677001	0.759645	0.715824	0.682689
RF test	0.670996	0.771624	0.717800	0.680642
XGB CV	0.669160	0.717486	0.692400	0.664510

In [84]:

# Get scores on test data
xgb_preds = xgb1.best_estimator_.predict(X_test)

XGB test results¶

Use the get_test_scores() function to generate the scores on the test data. Assign the results to xgb_test_scores.
Call xgb_test_scores to output the results.

In [85]:

# Get scores on test data
xgb_test_scores = get_test_scores('XGB test', xgb_preds, y_test)
results = pd.concat([results, xgb_test_scores], axis=0)
results

Out[85]:

model	precision	recall	F1	accuracy
RF CV	0.677001	0.759645	0.715824	0.682689
RF test	0.670996	0.771624	0.717800	0.680642
XGB CV	0.669160	0.717486	0.692400	0.664510
XGB test	0.672326	0.739266	0.704209	0.673108

Plot a confusion matrix of the model's predictions on the test data.

In [86]:

# Generate array of values for confusion matrix

# Plot confusion matrix

cm = confusion_matrix(y_test, rf_preds, labels=rf1.classes_)

disp = ConfusionMatrixDisplay(confusion_matrix=cm,
                             display_labels=rf1.classes_, 
                             )
disp.plot(values_format='');

Feature importance¶

Use the feature_importances_ attribute of the best estimator object to inspect the features of your final model. You can then sort them and plot the most important ones.

In [87]:

# Plot feature importances
importances = rf1.best_estimator_.feature_importances_
rf_importances = pd.Series(importances, index=X_test.columns)
rf_importances = rf_importances.sort_values(ascending=False)[:15]

fig, ax = plt.subplots(figsize=(8,5))
rf_importances.plot.bar(ax=ax)
ax.set_title('Feature importances')
ax.set_ylabel('Mean decrease in impurity')
fig.tight_layout();

Execute¶

Conclusion¶

In this step, use the results of the models above to formulate a conclusion. We shall consider the following questions:

Exemplar responses:

Should we recommend using this model? Why or why not?

Yes, this is model performs acceptably. Its F₁ score was 0.7235 and it had an overall accuracy of 0.6865. It correctly identified ~78% of the actual responders in the test set, which is 48% better than a random guess. It may be worthwhile to test the model with a select group of taxi drivers to get feedback.

What was the highest scoring model doing? How was it making predictions?

Unfortunately, random forest is not the most transparent machine learning algorithm. We know that VendorID, predicted_fare, mean_duration, and mean_distance are the most important features, but we don't know how they influence tipping. This would require further exploration. It is interesting that VendorID is the most predictive feature. This seems to indicate that one of the two vendors tends to attract more generous customers. It may be worth performing statistical tests on the different vendors to examine this further.

Are there new features that can be engineered and might improve model performance?

There are almost always additional features that can be engineered, but hopefully the most obvious ones were generated during the first round of modeling. In our case, we could try creating three new columns that indicate if the trip distance is short, medium, or far. We could also engineer a column that gives a ratio that represents (the amount of money from the fare amount to the nearest higher multiple of \$5) / fare amount. For example, if the fare were \$12, the value in this column would be 0.25, because \$12 to the nearest higher multiple of \$5 (\$15) is \$3, and \$3 divided by \$12 is 0.25. The intuition for this feature is that people might be likely to simply round up their tip, so journeys with fares with values just under a multiple of \$5 may have lower tip percentages than those with fare values just over a multiple of \$5. We could also do the same thing for fares to the nearest \$10.

What features would be useful to have that would likely improve the performance of the model?

It would probably be very helpful to have past tipping behavior for each customer. It would also be valuable to have accurate tip values for customers who pay with cash. It would be helpful to have a lot more data. With enough data, we could create a unique feature for each pickup/dropoff combination.

In [ ]: