#!/usr/bin/env python # coding: utf-8 # # **Tree-based Machine Learning Model** # **The New York City Taxi & Limousine Commission** # In this project, we are the Data Analyst at the New York City Taxi & Limousine Commission (New York City TLC), and we have been requested to build a machine learning model to predict if a customer will not leave a tip. They want to use the model in an app that will alert taxi drivers to customers who are unlikely to tip, since drivers depend on tips. # # Objective # # We will employ the tree-based modeling techniques to predict on a binary target class. # **The purpose** of this model is to find ways to generate more revenue for taxi cab drivers. # **The goal** of this model is to predict whether or not a customer is a generous tipper. # #
# There are 3 parts in this project # # **Part 1:** Ethical considerations # * Consider the ethical implications of the request # # * Revisit the project objective # # **Part 2:** Feature engineering # # * Perform feature selection, extraction, and transformation to prepare the data for modeling # # **Part 3:** Modeling # # * Build the models, evaluate them, and advise on next steps # # # **PACE stages: Plan, Analyze, Construct, and Execute** # # ## Plan # # In this stage, let's consider the following questions: # # 1. What is the main purpose of the task? # # 2. What are the ethical implications of the model? # # 3. Do the benefits of such a model outweigh the potential problems? # # 4. Should we proceed with the request to build this model? Why or why not? # # 5. Can the objective be modified to make it less problematic? # # # ### Responses # # - If our model focus whether a customer will tip or not, then it might end up making it difficult for customers, who can't afford to tip, to find a taxi, i.e. even if the model is correct, it will limit the accessibility of taxi service. # - If Drivers are told by the app that a customer will leave a tip but the customer doesn't, then the drivers will not trust or use the app. # - To make this prediction, ideally, we'd have behavourial history for each customer so we can predict if they will tip based on their previous rides. # - The target variable would be a binary variable (1 or 0) that indictates whether a customer is expected to tip or more than 20%. # ### Actions # # - Instead of predicting whether people will or will not tip, we can predict if people are *generous* tipper. # - Those who tip 20% or more are considered `generous`. # # ### **Step 1. Imports and data loading** # # Import packages and libraries needed to build and evaluate random forest and XGBoost classification models. # In[44]: # Import packages and libraries import pandas as pd import numpy as np import matplotlib.pyplot as plt from sklearn.model_selection import train_test_split, GridSearchCV from sklearn.metrics import roc_auc_score, roc_curve from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score,\ confusion_matrix, ConfusionMatrixDisplay, RocCurveDisplay from sklearn.ensemble import RandomForestClassifier from xgboost import XGBClassifier from xgboost import plot_importance # In[45]: # RUN THIS CELL TO SEE ALL COLUMNS # This lets us see all of the columns, preventing Juptyer from redacting them. pd.set_option('display.max_columns', None) # In[46]: # Load dataset into dataframe df0 = pd.read_csv('2017_Yellow_Taxi_Trip_Data.csv') # Import predicted fares and mean distance and duration nyc_preds_means = pd.read_csv('nyc_preds_means.csv') # Inspect the first few rows of `df0`. # # In[47]: # Inspect the first few rows of df0 df0.head() # Inspect the first few rows of `nyc_preds_means`. # In[48]: # Inspect the first few rows of `nyc_preds_means` nyc_preds_means.head() # #### Join the two dataframes # # Join the two dataframes using a method of your choice. # In[49]: # Merge datasets df0 = df0.merge(nyc_preds_means, left_index=True, right_index=True) df0.head() # ## Analyze # ### **Step 2. Feature engineering** # # Upon completion of EDA, let's inspect the newly combined dataframe. # In[50]: # See the info of df0 df0.info() # We know from our EDA that customers who pay cash generally have a tip amount of $0. To meet the modeling objective, we'll need to sample the data to select only the customers who pay with credit card. # # Let's copy `df0` and assign the result to a variable called `df1`. Then, use a Boolean mask to filter `df1` so it contains only customers who paid with credit card. # In[51]: # Subset the data to isolate only customers who paid by credit card df1 = df0[df0['payment_type'] == 1] df1.head() # ##### **Target** # # Let's add a `tip_percent` column to the dataframe by performing the following calculation: #
# # $$tip\ percent = \frac{tip\ amount}{total\ amount - tip\ amount}$$ # # Purpose: the customers who tip ≥ 20% are considered `generous`. # # In[53]: # Create tip % col df1['tip_percent'] = round(df1['tip_amount'] / (df1['total_amount'] - df1['tip_amount']), 3) # Now that we have a `tip_percent` column, we can add a binary indicator to identify if each customer is `generous` (tipped more than 20% will be 1= yes, otherwise, 0 = no). # In[54]: # Create 'generous' col (target) df1['generous'] = (df1['tip_percent'] >= 0.2).astype(int) df1.head() # #### Create day column # Let's convert the `tpep_pickup_datetime` and `tpep_dropoff_datetime` columns to datetime. # In[55]: # Convert pickup and dropoff cols to datetime df1['tpep_pickup_datetime'] = pd.to_datetime(df1['tpep_pickup_datetime'], format='%m/%d/%Y %I:%M:%S %p') df1['tpep_dropoff_datetime'] = pd.to_datetime(df1['tpep_dropoff_datetime'], format='%m/%d/%Y %I:%M:%S %p') # Now, create a `day` column that contains only the day of the week when each passenger was picked up. Then, convert the values to lowercase. # In[56]: # Create a 'day' col df1['day'] = df1['tpep_pickup_datetime'].dt.day_name().str.lower() # #### Create time of day columns # Next, engineer four new columns that represent time of day bins. Each column should contain binary values (0=no, 1=yes) that indicate whether a trip began (picked up) during the following times: # # `am_rush` = [06:00–10:00) # `daytime` = [10:00–16:00) # `pm_rush` = [16:00–20:00) # `nighttime` = [20:00–06:00) # # To do this, first create the four columns. For now, each new column should be identical and contain the same information: the hour (only) from the `tpep_pickup_datetime` column. # In[57]: # Create 'am_rush' col #==> ENTER YOUR CODE HERE df1['am_rush'] = df1['tpep_pickup_datetime'].dt.hour.between(6, 10) # Create 'daytime' col #==> ENTER YOUR CODE HERE df1['daytime'] = df1['tpep_pickup_datetime'].dt.hour.between(10, 16) # Create 'pm_rush' col #==> ENTER YOUR CODE HERE df1['pm_rush'] = df1['tpep_pickup_datetime'].dt.hour.between(16, 20) # Create 'nighttime' col #==> ENTER YOUR CODE HERE df1['nighttime'] = df1['tpep_pickup_datetime'].dt.hour.between(20, 6) # In[16]: df1.head() # In[17]: # transform and encode all the time columns into '1' & '0' cols = ['am_rush', 'daytime', 'pm_rush', 'nighttime'] df1.loc[:, cols] = df1.loc[:, cols].replace({True: 1, False: 0}) df1.head() # #### Create `month` column # Now, create a `month` column that contains only the abbreviated name of the month when each passenger was picked up, then convert the result to lowercase. # In[58]: # Create 'month' col df1['month'] = df1['tpep_pickup_datetime'].dt.strftime('%b').str.lower() # Examine the first five rows of your dataframe. # In[49]: # Check the first few rows of df1 df1.head() # #### Drop columns # # Drop redundant and irrelevant columns as well as those that would not be available when the model is deployed. This includes information like payment type, trip distance, tip amount, tip percentage, total amount, toll amount, etc. The target variable (`generous`) must remain in the data because it will get isolated as the `y` data for modeling. # In[59]: # Drop columns df1.info() # In[60]: drop_cols = ['Unnamed: 0', 'tpep_pickup_datetime', 'tpep_dropoff_datetime', 'payment_type', 'trip_distance', 'store_and_fwd_flag', 'payment_type', 'fare_amount', 'extra', 'mta_tax', 'tip_amount', 'tolls_amount', 'improvement_surcharge', 'total_amount', 'tip_percent'] df1 = df1.drop(drop_cols, axis=1) df1.info() # #### Variable encoding # Many of the columns are categorical and will need to be dummied (converted to binary). Some of these columns are numeric, but they actually encode categorical information, such as `RatecodeID` and the pickup and dropoff locations. To make these columns recognizable to the `get_dummies()` function as categorical variables, we'll first need to convert them to `type(str)`. # # 1. Define a variable called `cols_to_str`, which is a list of the numeric columns that contain categorical information and must be converted to string: `RatecodeID`, `PULocationID`, `DOLocationID`. # 2. Write a for loop that converts each column in `cols_to_str` to string. # # In[61]: # 1. Define list of cols to convert to string cols_to_str = ['RatecodeID', 'PULocationID', 'DOLocationID','VendorID',] # 2. Convert each column to string for col in cols_to_str: df1[col] = df1[col].astype(str) # #
#
HINT
# # To convert to string, use `astype(str)` on the column. #
# Now convert all the categorical columns to binary. # # 1. Call `get_dummies()` on the dataframe and assign the results back to a new dataframe called `df2`. # # In[62]: # Convert categoricals to binary df2 = pd.get_dummies(df1, drop_first=True) df2.info() # ##### Evaluation metric # # Before modeling, we must decide on an evaluation metric. # # 1. Examine the class balance of your target variable. # In[63]: # Get class balance of 'generous' col df2['generous'].value_counts(normalize=True) # A little over half of the customers in this dataset were "generous" (tipped ≥ 20%). The dataset is very nearly balanced. # # To determine a metric, consider the cost of both kinds of model error: # * False positives (the model predicts a tip ≥ 20%, but the customer does not give one) # * False negatives (the model predicts a tip < 20%, but the customer gives more) # # False positives are worse for cab drivers, because they would pick up a customer expecting a good tip and then not receive one, frustrating the driver. # # False negatives are worse for customers, because a cab driver would likely pick up a different customer who was predicted to tip more—even when the original customer would have tipped generously. # # **The stakes are relatively even. We want to help taxi drivers make more money, but we don't want this to anger customers. Our metric should weigh both precision and recall equally. # ## Construct # ### **Step 3. Modeling** # ##### **Split the data** # # Now you're ready to model. The only remaining step is to split the data into features/target variable and training/testing data. # # 1. Define a variable `y` that isolates the target variable (`generous`). # 2. Define a variable `X` that isolates the features. # 3. Split the data into training and testing sets. Put 20% of the samples into the test set, stratify the data, and set the random state. # In[64]: # Isolate target variable (y) y = df2['generous'] # Isolate the features (X) X = df2.drop('generous', axis=1) # Split into train and test sets X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.2,random_state=42) # ##### **Random forest** # # Begin with using `GridSearchCV` to tune a random forest model. # # 1. Instantiate the random forest classifier `rf` and set the random state. # # 2. Create a dictionary `cv_params` of any of the following hyperparameters and their corresponding values to tune. The more you tune, the better your model will fit the data, but the longer it will take. # - `max_depth` # - `max_features` # - `max_samples` # - `min_samples_leaf` # - `min_samples_split` # - `n_estimators` # # 3. Define a set `scoring` of scoring metrics for GridSearch to capture (precision, recall, F1 score, and accuracy). # # 4. Instantiate the `GridSearchCV` object `rf1`. Pass to it as arguments: # - estimator=`rf` # - param_grid=`cv_params` # - scoring=`scoring` # - cv: define the number of you cross-validation folds you want (`cv=_`) # - refit: indicate which evaluation metric you want to use to select the model (`refit=_`) # # # **Note:** `refit` should be set to `'f1'`. # # # # In[65]: # 1. Instantiate the random forest classifier rf = RandomForestClassifier(random_state=42) # 2. Create a dictionary of hyperparameters to tune cv_params = {'max_depth': [None], 'max_features': [1.0], 'max_samples': [0.7], 'min_samples_leaf': [1], 'min_samples_split': [2], 'n_estimators': [300] } # 3. Define a set of scoring metrics to capture scoring = {'accuracy', 'precision', 'recall', 'f1'} # 4. Instantiate the GridSearchCV object rf1 = GridSearchCV(rf, cv_params, scoring=scoring, cv=4, refit='f1') # Now fit the model to the training data. # In[66]: # fit the model and time how long it takes from datetime import datetime start = datetime.now() print(rf1.fit(X_train, y_train)) end = datetime.now() print(end - start) # ### Saving the model with `Pickle` # We can use `pickle` to save the models and read them back in. This can be particularly helpful when performing a search over many possible hyperparameter values. # In[67]: import pickle # Define a path to the folder where you want to save the model path = '/home/jovyan/work/' # In[68]: def write_pickle(path, model_object, save_name:str): ''' save_name is a string. ''' with open(path + save_name + '.pickle', 'wb') as to_write: pickle.dump(model_object, to_write) # In[69]: def read_pickle(path, saved_model_name:str): ''' saved_model_name is a string. ''' with open(path + saved_model_name + '.pickle', 'rb') as to_read: model = pickle.load(to_read) return model # Examine the best average score across all the validation folds. # In[70]: # Examine best score rf1.best_score_ # Examine the best combination of hyperparameters. # In[71]: rf1.best_params_ # Use the `make_results()` function to output all of the scores of your model. Note that it accepts three arguments. # In[72]: def make_results(model_name:str, model_object, metric:str): ''' Arguments: model_name (string): what you want the model to be called in the output table model_object: a fit GridSearchCV object metric (string): precision, recall, f1, or accuracy Returns a pandas df with the F1, recall, precision, and accuracy scores for the model with the best mean 'metric' score across all validation folds. ''' # Create dictionary that maps input metric to actual metric name in GridSearchCV metric_dict = {'precision': 'mean_test_precision', 'recall': 'mean_test_recall', 'f1': 'mean_test_f1', 'accuracy': 'mean_test_accuracy', } # Get all the results from the CV and put them in a df cv_results = pd.DataFrame(model_object.cv_results_) # Isolate the row of the df with the max(metric) score best_estimator_results = cv_results.iloc[cv_results[metric_dict[metric]].idxmax(), :] # Extract Accuracy, precision, recall, and f1 score from that row f1 = best_estimator_results.mean_test_f1 recall = best_estimator_results.mean_test_recall precision = best_estimator_results.mean_test_precision accuracy = best_estimator_results.mean_test_accuracy # Create table of results table = pd.DataFrame({'model': [model_name], 'precision': [precision], 'recall': [recall], 'F1': [f1], 'accuracy': [accuracy], }, ) return table # Call `make_results()` on the GridSearch object. # In[73]: # View the results results = make_results('RF CV', rf1, 'f1') results # Use the model to predict on the test data. Assign the results to a variable called `rf_preds`. # In[74]: # Get scores on test data rf_preds = rf1.best_estimator_.predict(X_test) # Use the below `get_test_scores()` function we will use to output the scores of the model on the test data. # In[75]: def get_test_scores(model_name:str, preds, y_test_data): ''' Generate a table of test scores. In: model_name (string): Your choice: how the model will be named in the output table preds: numpy array of test predictions y_test_data: numpy array of y_test data Out: table: a pandas df of precision, recall, f1, and accuracy scores for your model ''' accuracy = accuracy_score(y_test_data, preds) precision = precision_score(y_test_data, preds) recall = recall_score(y_test_data, preds) f1 = f1_score(y_test_data, preds) table = pd.DataFrame({'model': [model_name], 'precision': [precision], 'recall': [recall], 'F1': [f1], 'accuracy': [accuracy] }) return table # 1. Use the `get_test_scores()` function to generate the scores on the test data. Assign the results to `rf_test_scores`. # 2. Call `rf_test_scores` to output the results. # ###### RF test results # In[76]: # Get scores on test data rf_test_scores = get_test_scores('RF test', rf_preds, y_test) results = pd.concat([results, rf_test_scores], axis=0) results # ##### **XGBoost** # # Try to improve our scores using an XGBoost model. # # 1. Instantiate the XGBoost classifier `xgb` and set `objective='binary:logistic'`. Also set the random state. # # 2. Create a dictionary `cv_params` of the following hyperparameters and their corresponding values to tune: # - `max_depth` # - `min_child_weight` # - `learning_rate` # - `n_estimators` # # 3. Define a set `scoring` of scoring metrics for grid search to capture (precision, recall, F1 score, and accuracy). # # 4. Instantiate the `GridSearchCV` object `xgb1`. Pass to it as arguments: # - estimator=`xgb` # - param_grid=`cv_params` # - scoring=`scoring` # - cv: define the number of cross-validation folds you want (`cv=_`) # - refit: indicate which evaluation metric you want to use to select the model (`refit='f1'`) # In[77]: # 1. Instantiate the XGBoost classifier xgb = XGBClassifier(objective='binary:logistic', random_state=0) # 2. Create a dictionary of hyperparameters to tune cv_params = {'learning_rate': [0.1], 'max_depth': [8], 'min_child_weight': [2], 'n_estimators': [500] } # 3. Define a set of scoring metrics to capture scoring = {'accuracy', 'precision', 'recall', 'f1'} # 4. Instantiate the GridSearchCV object xgb1 = GridSearchCV(xgb, cv_params, scoring=scoring, cv=4, refit='f1') # Now fit the model to the `X_train` and `y_train` data. # In[78]: # Fit the model and time how long it takes start = datetime.now() print(xgb1.fit(X_train, y_train)) end = datetime.now() print(end - start) # Get the best score from this model. # In[81]: # Examine best score xgb1.best_score_ # And the best parameters. # In[82]: # Examine best parameters xgb1.best_params_ # ##### XGB CV Results # # Use the `make_results()` function to output all of the scores of the model. # In[83]: # Call 'make_results()' on the GridSearch object xgb1_cv_results = make_results('XGB CV', xgb1, 'f1') results = pd.concat([results, xgb1_cv_results], axis=0) results # In[84]: # Get scores on test data xgb_preds = xgb1.best_estimator_.predict(X_test) # ###### XGB test results # # 1. Use the `get_test_scores()` function to generate the scores on the test data. Assign the results to `xgb_test_scores`. # 2. Call `xgb_test_scores` to output the results. # In[85]: # Get scores on test data xgb_test_scores = get_test_scores('XGB test', xgb_preds, y_test) results = pd.concat([results, xgb_test_scores], axis=0) results # Plot a confusion matrix of the model's predictions on the test data. # In[86]: # Generate array of values for confusion matrix # Plot confusion matrix cm = confusion_matrix(y_test, rf_preds, labels=rf1.classes_) disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=rf1.classes_, ) disp.plot(values_format=''); # ##### Feature importance # # Use the `feature_importances_` attribute of the best estimator object to inspect the features of your final model. You can then sort them and plot the most important ones. # In[87]: # Plot feature importances importances = rf1.best_estimator_.feature_importances_ rf_importances = pd.Series(importances, index=X_test.columns) rf_importances = rf_importances.sort_values(ascending=False)[:15] fig, ax = plt.subplots(figsize=(8,5)) rf_importances.plot.bar(ax=ax) ax.set_title('Feature importances') ax.set_ylabel('Mean decrease in impurity') fig.tight_layout(); # ## Execute # ### **Conclusion** # # In this step, use the results of the models above to formulate a conclusion. We shall consider the following questions: # # **Exemplar responses:** # 1. **Should we recommend using this model? Why or why not?** # Yes, this is model performs acceptably. Its F1 score was 0.7235 and it had an overall accuracy of 0.6865. It correctly identified ~78% of the actual responders in the test set, which is 48% better than a random guess. It may be worthwhile to test the model with a select group of taxi drivers to get feedback. # # # 2. **What was the highest scoring model doing? How was it making predictions?** # Unfortunately, random forest is not the most transparent machine learning algorithm. We know that `VendorID`, `predicted_fare`, `mean_duration`, and `mean_distance` are the most important features, but we don't know how they influence tipping. This would require further exploration. It is interesting that `VendorID` is the most predictive feature. This seems to indicate that one of the two vendors tends to attract more generous customers. It may be worth performing statistical tests on the different vendors to examine this further. # # # 3. **Are there new features that can be engineered and might improve model performance?** # There are almost always additional features that can be engineered, but hopefully the most obvious ones were generated during the first round of modeling. In our case, we could try creating three new columns that indicate if the trip distance is short, medium, or far. We could also engineer a column that gives a ratio that represents (the amount of money from the fare amount to the nearest higher multiple of \\$5) / fare amount. For example, if the fare were \\$12, the value in this column would be 0.25, because \\$12 to the nearest higher multiple of \\$5 (\\$15) is \\$3, and \\$3 divided by \\$12 is 0.25. The intuition for this feature is that people might be likely to simply round up their tip, so journeys with fares with values just under a multiple of \\$5 may have lower tip percentages than those with fare values just over a multiple of \\$5. We could also do the same thing for fares to the nearest \\$10. # # # 4. **What features would be useful to have that would likely improve the performance of the model?** # It would probably be very helpful to have past tipping behavior for each customer. It would also be valuable to have accurate tip values for customers who pay with cash. # It would be helpful to have a lot more data. With enough data, we could create a unique feature for each pickup/dropoff combination. # In[ ]: