#!/usr/bin/env python # coding: utf-8 # ## <center> Predicting Critical Health Code Violations: Dining in Las Vegas # Restaurants are frequent settings for foodborne illness outbreaks. Periodic inspection of restaurants is crucial to ensure commercial food establishments carry out safe food handling procedures. Predictive analytics can help identify problematic restaurants and maximize the utility of limited enforcement resources. To that end, I develop a machine learning model in this notebook. The resulting model can predict restaurants and other dining venues likely to face critical health code challenges in Las Vegas during the next inspection period. # # The final model uses five numeric features to predict if a restaurant receives a C grade or below during the inspection (i.e., critical violation of sanitary practices). Three predictors refer to employee characteristics; the remaining two features describe the degree of violations. The model is a good tool if the purpose is to identify as many problematic restaurants as possible. The model could correctly identify 80% of problematic restaurants in a holdout test data set. At the same time, the model misclassified many compliant restaurants as non-compliant ones. # # Initial Data Exploration and Observations # I converted all uppercase characters in column names into lowercase characters to facilitate data manipulation. I also renamed the target variable to 'target.' # # The concise summary of the dataframe indicates that many rows have some missing values. The heat map shows that missing values do not seem to occur because of a systematic reason. As such, the missing values can be imputed without introducing a significant bias to the data. # # While the dataframe contains many features, my unreported exploratory analyses indicate that most of these features are neither predictive of the target nor necessary. For instance, the information contained in 'current_grade' is already reflected in 'current_demerits.' As such, I focus my analyses on numeric features that tend to contain richer information compared to categorical data. I focus on the following five numeric features: # - current_demerits # - employee_count # - inspection_demerits # - median_employee_age # - median_employee_tenure # In[1]: #import Python libraries import pandas as pd pd.set_option('display.max_rows', None, 'display.max_columns', None, 'display.width', None) import seaborn as sns import numpy as np import matplotlib.pyplot as plt get_ipython().run_line_magic('matplotlib', 'inline') sns.set(rc={'figure.figsize':(17, 4)}) import warnings warnings.filterwarnings('ignore') # In[2]: #read in the dataset into dataframe and display three randomly selected rows df = pd.read_csv('TRAIN_SET_2021.csv') df.columns = df.columns.str.lower() df.rename(columns = {'next_inspection_grade_c_or_below':'target'}, inplace = True) df.sample(3) # In[3]: #display number of missing variables per column df.isnull().sum() # In[4]: # create heatmap for missing values sns.heatmap(df.isnull(), cbar=False, cmap="YlGnBu") plt.title('Missing Values', fontsize=20) plt.show() # # Cleaning Target Variable # To explore the target column, I created a frequency distribution table. The table reveals that the most common values are coded 0 and 1. Additionally, the column contains 40 missings values and some erroneous entries. I slice the dataframe that contains only 0 and 1 values and convert the column to the integer type to address these issues. The resulting dataframe contains 84% of values coded 1, while the remaining 16% were coded 0. # In[5]: # create frequency distribution table for the target column df.target.value_counts(dropna=False) # In[6]: #slice the df df = df[(df.target == '1') | (df.target == '0')] df.target = df.target.astype('int') # In[7]: # create frequency distribution table for target column after slicing df round(df.target.value_counts(normalize=True, dropna=False)*100,0) # # Cleaning Features # The frequency distribution table for the current_demerits columns indicates that the columns contain missing, unusually high, float, and negative values. The column should contain only natural numbers not exceeding 87. To address this issue, I replace values outside this criterion as missing, fill missing values with a mean value, and convert the column to the integer type. I follow a similar procedure to clean the remaining five features. # In[8]: # create frequency distribution table for current_demerits column df.current_demerits.value_counts(dropna=False) # In[9]: #cleaing current_demerits df.current_demerits.replace([1214, 89, 88, 363, 100, 98, -8.00, 3.14] , np.nan, inplace=True ) df.current_demerits.fillna(round(df.current_demerits.mean(), 0), inplace=True) df.current_demerits = df.current_demerits.astype('int') #cleaning employee_count df.employee_count.replace([687, 111447, -7, 902] , np.nan, inplace=True ) df.employee_count = df.employee_count.astype('float') df.employee_count.fillna(round(df.employee_count.mean(), 0), inplace=True) df.employee_count = df.employee_count.astype('int') #cleaning inspection_demerits df.inspection_demerits.replace('Routine Inspection', np.nan, inplace=True ) df.inspection_demerits = df.inspection_demerits.astype('float') df.inspection_demerits.fillna(round(df.inspection_demerits.mean(), 0), inplace=True) df.inspection_demerits = df.inspection_demerits.astype('int') #cleaning number_of_violations df.number_of_violations.replace('Nevada', np.nan, inplace=True ) df.number_of_violations = df.number_of_violations.astype('float') df.number_of_violations.fillna(round(df.number_of_violations.mean(), 0), inplace=True) df.number_of_violations = df.number_of_violations.astype('int') #cleaning median_employee_age df.median_employee_age.fillna(round(df.median_employee_age.mean(), 0), inplace=True) #cleaning median_employee_tenure df.median_employee_tenure.fillna(round(df.median_employee_tenure.mean(), 0), inplace=True) # # Feature Selection # The next thing to look for is potential collinearity between some of these feature columns. Collinearity is when two feature columns are highly correlated and risk duplicating information. If two features convey the same information using two different measures or metrics, there is no need to keep both. # # I generate a correlation matrix heatmap using Seaborn to visually compare the correlations and look for problematic pairwise feature correlations. Based on the correlation matrix heatmap, the following pairs of columns are strongly correlated: 'inspection_demerits' and 'number_of_violations.' These two features reflect very similar information. Given that 'inspection_demeritts' contain more nuanced information, I drop 'number_of_violations' from further analysis. # # Also, I examined the possibility of removing features with low variance. When the values in a feature column have low variance, they don't meaningfully contribute to the model's predictive capability. To make apples-to-apples comparisons between columns, I rescaled all of the columns to vary between 0 and 1. This is known as min-max scaling or as rescaling. But I finally decided to keep all five features in the model. # In[10]: # Explore correlations among features corr = df[['target', 'current_demerits', 'employee_count', 'median_employee_age', 'median_employee_tenure','inspection_demerits', 'number_of_violations']].corr() plt.figure(figsize=(10,10)) ax = sns.heatmap(corr,vmin=-1, vmax=1, center=0,cmap=sns.diverging_palette(20, 220, n=200), square=True) ax.set_xticklabels(ax.get_xticklabels(), rotation=90, horizontalalignment='right') ax.set_title('Pairwise Correlations Among Features', fontsize=20) plt.show() # # ML Modeling # To determine the predictive capacity of the ML model, I split the data into a training set and testing set. Given that the number of restaurants that receive a C grade or below for their evaluation (16%) is fewer than the remaining cases (84%), I use the ClusterCentroids method to better train the model to recognize the minority class cases. This method addresses data imbalance by undersampling the majority class by replacing a cluster of majority samples with the cluster centroid of a KMeans algorithm. # # I use the pipeline tool combined with GridSearchCV to assemble several steps that can be cross-validated while setting different parameters. The preprocessing step in the pipeline includes RobustScaler, scales features using statistics that are robust to outliers. The classifier step includes C-Support Vector Classification, Random Forrest, Logistic Regression, and Gradient Boosting classifiers. # # GridSearchCV runs through all the different parameters fed into the parameter grid and produces the best combination of parameters based on a scoring metric of your choice. The recall is used as a performance metric when there is a need to identify all positive samples; that is, when it is crucial to avoid false negatives. A health code violation is a good case for this: it is vital to find all non-compliant restaurants, possibly falsely including well-maintained venues in the prediction. Thus, I use recall as an evaluation metric when ranking results. # In[11]: #import ML libraries from sklearn.preprocessing import RobustScaler from sklearn.metrics import classification_report, plot_confusion_matrix from sklearn.model_selection import train_test_split, GridSearchCV from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier from sklearn.linear_model import LogisticRegression from sklearn.svm import SVC from imblearn.under_sampling import ClusterCentroids from imblearn.pipeline import Pipeline # In[12]: y = df.target X = df[['current_demerits', 'employee_count', 'median_employee_age', 'median_employee_tenure', 'inspection_demerits']] # In[13]: # Create the pipeline pipe = Pipeline([('preprocessing', RobustScaler()), ('classifier', SVC())]) # In[14]: # Specify the hyperparameter space parameters = [ {'classifier': [RandomForestClassifier(n_jobs=-1, random_state=0, class_weight='balanced')], 'preprocessing': [None], 'classifier__n_estimators': [100, 200, 300], 'classifier__max_features': ['auto', 3, 4, 5], 'classifier__max_depth': [None, 2, 3, 100], 'classifier__criterion': ['gini', 'entropy'], 'classifier__min_samples_leaf': [3, 4, 5], 'classifier__min_samples_split': [8, 10, 12]}, {'classifier': [SVC(random_state=0, class_weight='balanced')], 'preprocessing': [RobustScaler(), None], 'classifier__gamma': [0.01, 0.1, 1, 10, 100], 'classifier__C': [0.01, 0.1, 1, 10, 100]}, {'classifier': [GradientBoostingClassifier(random_state=0)], 'preprocessing': [None], 'classifier__learning_rate': [0.01, 0.1, 1, 10], 'classifier__n_estimators': [100, 200, 300], 'classifier__max_depth': [2, 3, 5]}, {'classifier': [LogisticRegression(n_jobs=-1, max_iter=1000)], 'preprocessing': [RobustScaler(), None], 'classifier__C': [0.01, 0.1, 1, 10, 100]} ] # In[15]: # Instantiate GridSearchCV object grid = GridSearchCV(pipe, param_grid=parameters, cv=5, scoring='recall') # In[16]: # Define resampling method method = ClusterCentroids(random_state=0) # In[17]: # Create training and testing sets X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0, stratify=y) # Apply resampling to the training data only X_resampled, y_resampled = method.fit_resample(X_train, y_train) # In[18]: # Fit to the training set model = grid.fit(X_resampled, y_resampled) # # Results # The exhaustive search identified the best parameters: SVC with the regularization parameter of 0.1 and gamma parameter for non-linear hyperplanes of 0.01 in combination with RobustScaler. This model produced a recall score for problematic restaurants of 0.82. Because this test set recall score of the model is the same as the cross-validation recall score during training, the model is not likely to overfit and can be considered reliable. # In[19]: # Get performance metrics predict = model.predict(X_test) print("Best params:\n{}\n".format(model.best_params_)) print("Best cross-validation score: {:.2f}".format(model.best_score_)) print("Test set score: {:.2f}".format(model.score(X_test, y_test))) plot_confusion_matrix(model, X_test, y_test) print (classification_report(y_test, predict)) # # Conclusion # The final model uses five numeric features to predict if a restaurant receives a C grade or below during the inspection. Three predictors refer to employee characteristics; the remaining three variables describe the degree of violations. The model is a good tool if the purpose is to identify as many problematic restaurants as possible. The model could correctly identify 82% of problematic restaurants in a holdout test data set. At the same time, the model misclassified many compliant restaurants as non-compliant ones. # # There is plenty of room to improve the model's predictive capacity by supplementing it with additional data sources. For instance, employee characteristics in the current model had decent effect sizes. Additional descriptive information about employees, such as experience and training in safe food preparation practices, may improve the model substantially. In addition to employees' ability, motivation and incentives to practice safe food handling (or cut corners to maximize lower quality output in the kitchen) are also essential components that may explain and predict (in)compliance by restaurants. Other projects were able to leverage social media in predicting outcomes of health inspections. For instance, Uppoor et al. used data from Yelp reviews ([see PDF](https://cseweb.ucsd.edu/classes/wi15/cse255-a/reports/fa15/036.pdf)). However, the authors reported that it was more challenging to predict critical violations than minor problems.