Predicting Critical Health Code Violations: Dining in Las Vegas

Restaurants are frequent settings for foodborne illness outbreaks. Periodic inspection of restaurants is crucial to ensure commercial food establishments carry out safe food handling procedures. Predictive analytics can help identify problematic restaurants and maximize the utility of limited enforcement resources. To that end, I develop a machine learning model in this notebook. The resulting model can predict restaurants and other dining venues likely to face critical health code challenges in Las Vegas during the next inspection period.

The final model uses five numeric features to predict if a restaurant receives a C grade or below during the inspection (i.e., critical violation of sanitary practices). Three predictors refer to employee characteristics; the remaining two features describe the degree of violations. The model is a good tool if the purpose is to identify as many problematic restaurants as possible. The model could correctly identify 80% of problematic restaurants in a holdout test data set. At the same time, the model misclassified many compliant restaurants as non-compliant ones.

Initial Data Exploration and Observations¶

I converted all uppercase characters in column names into lowercase characters to facilitate data manipulation. I also renamed the target variable to 'target.'

The concise summary of the dataframe indicates that many rows have some missing values. The heat map shows that missing values do not seem to occur because of a systematic reason. As such, the missing values can be imputed without introducing a significant bias to the data.

While the dataframe contains many features, my unreported exploratory analyses indicate that most of these features are neither predictive of the target nor necessary. For instance, the information contained in 'current_grade' is already reflected in 'current_demerits.' As such, I focus my analyses on numeric features that tend to contain richer information compared to categorical data. I focus on the following five numeric features:

current_demerits
employee_count
inspection_demerits
median_employee_age
median_employee_tenure

In [1]:

#import Python libraries
import pandas as pd
pd.set_option('display.max_rows', None, 'display.max_columns', None, 'display.width', None)
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
sns.set(rc={'figure.figsize':(17, 4)})
import warnings
warnings.filterwarnings('ignore')

In [2]:

#read in the dataset into dataframe and display three randomly selected rows
df = pd.read_csv('TRAIN_SET_2021.csv')
df.columns = df.columns.str.lower()
df.rename(columns = {'next_inspection_grade_c_or_below':'target'}, inplace = True) 
df.sample(3)

Out[2]:

	restaurant_serial_number	restaurant_permit_number	restaurant_name	restaurant_location	restaurant_category	address	city	state	zip	current_demerits	current_grade	employee_count	median_employee_age	median_employee_tenure	inspection_time	inspection_type	inspection_demerits	violations_raw	record_updated	lat_long_raw	first_violation	second_violation	third_violation	first_violation_type	second_violation_type	third_violation_type	number_of_violations
6478	DA1546062	PR0013888	Pump N Snack Deli	Pump N Snack	Snack Bar	329 N Las Vegas Blvd	Las Vegas	Nevada	89101-2965	8.0	A	24.0	32.953746	6.272959	7/1/2015 14:30	Routine Inspection	15	2,152,172,222,909,290,000,000,000	7/6/2015 12:12	(36.171985, 115.139176)	215.0	217.0	222.0	Major	Major	Major	7
14817	DA0984229	PR0022849	Tortilleria San Diego	Tortilleria San Diego	Restaurant	235 N Eastern Ave 117	Las Vegas	NaN	89101-4542	0.0	A	15.0	28.561471	2.513798	5/30/2012 10:10	Routine Inspection	19	211,212,214,215,216,000,000,000,000	2/21/2013 22:26	(36.1651291, 115.1164481)	211.0	212.0	214.0	Major	Major	Major	9
10530	DA0946895	PR0002002	Sbarros #551 at Boulevard Mall	Boulevard Mall - Sbarros #551	Restaurant	3528 S Maryland Pkwy	Las Vegas	Nevada	89169-3054	8.0	A	24.0	36.845926	3.101600	2/28/2012 12:00	Routine Inspection	6	222,229,230,233	2/21/2013 22:26	(36.1248998, 115.1350531)	222.0	229.0	230.0	Major	Non-Major	Non-Major	4

In [3]:

#display number of missing variables per column
df.isnull().sum()

Out[3]:

restaurant_serial_number      0
restaurant_permit_number      0
restaurant_name              65
restaurant_location         200
restaurant_category         130
address                      70
city                        236
state                       209
zip                          59
current_demerits            216
current_grade               308
employee_count               93
median_employee_age          34
median_employee_tenure      297
inspection_time             183
inspection_type             221
inspection_demerits         254
violations_raw              165
record_updated              119
lat_long_raw                 15
first_violation             212
second_violation             85
third_violation              61
first_violation_type        146
second_violation_type       267
third_violation_type        173
number_of_violations        169
target                       40
dtype: int64

In [4]:

# create heatmap for missing values
sns.heatmap(df.isnull(), cbar=False, cmap="YlGnBu")
plt.title('Missing Values', fontsize=20)
plt.show()

Cleaning Target Variable¶

To explore the target column, I created a frequency distribution table. The table reveals that the most common values are coded 0 and 1. Additionally, the column contains 40 missings values and some erroneous entries. I slice the dataframe that contains only 0 and 1 values and convert the column to the integer type to address these issues. The resulting dataframe contains 84% of values coded 1, while the remaining 16% were coded 0.

In [5]:

# create frequency distribution table for the target column
df.target.value_counts(dropna=False)

Out[5]:

0       13143
1        2484
NaN        40
-3          1
Goat        1
7           1
9           1
4           1
3           1
Name: target, dtype: int64

In [6]:

#slice the df
df = df[(df.target == '1') | (df.target == '0')]
df.target = df.target.astype('int')

In [7]:

# create frequency distribution table for target column after slicing df
round(df.target.value_counts(normalize=True, dropna=False)*100,0)

Out[7]:

0    84.0
1    16.0
Name: target, dtype: float64

Cleaning Features¶

The frequency distribution table for the current_demerits columns indicates that the columns contain missing, unusually high, float, and negative values. The column should contain only natural numbers not exceeding 87. To address this issue, I replace values outside this criterion as missing, fill missing values with a mean value, and convert the column to the integer type. I follow a similar procedure to clean the remaining five features.

In [8]:

# create frequency distribution table for current_demerits column
df.current_demerits.value_counts(dropna=False)

Out[8]:

 0.00       3920
 3.00       3115
 8.00       2437
 6.00       2198
 9.00       1868
 5.00        763
 10.00       449
 NaN         216
 7.00        111
 19.00        73
 4.00         60
 20.00        50
 1.00         48
 2.00         35
 14.00        34
 17.00        23
 27.00        16
 11.00        16
 12.00        15
 25.00        13
 22.00        13
 32.00        13
 16.00        11
 18.00        11
 31.00        10
 30.00         9
 46.00         9
 100.00        8
 51.00         7
 13.00         7
 23.00         7
 42.00         7
 24.00         6
 35.00         6
 26.00         6
 39.00         6
 28.00         5
 15.00         5
 38.00         5
 43.00         3
 21.00         2
 37.00         1
 33.00         1
 87.00         1
 88.00         1
 98.00         1
 363.00        1
 89.00         1
 1214.00       1
-8.00          1
 48.00         1
 3.14          1
Name: current_demerits, dtype: int64

In [9]:

#cleaing current_demerits
df.current_demerits.replace([1214, 89, 88, 363, 100, 98, -8.00, 3.14] , np.nan, inplace=True )
df.current_demerits.fillna(round(df.current_demerits.mean(), 0), inplace=True)
df.current_demerits = df.current_demerits.astype('int')

#cleaning employee_count
df.employee_count.replace([687, 111447, -7, 902] , np.nan, inplace=True )
df.employee_count = df.employee_count.astype('float')
df.employee_count.fillna(round(df.employee_count.mean(), 0), inplace=True)
df.employee_count = df.employee_count.astype('int')

#cleaning inspection_demerits
df.inspection_demerits.replace('Routine Inspection', np.nan, inplace=True )
df.inspection_demerits = df.inspection_demerits.astype('float')
df.inspection_demerits.fillna(round(df.inspection_demerits.mean(), 0), inplace=True)
df.inspection_demerits = df.inspection_demerits.astype('int')

#cleaning number_of_violations
df.number_of_violations.replace('Nevada', np.nan, inplace=True )
df.number_of_violations = df.number_of_violations.astype('float')
df.number_of_violations.fillna(round(df.number_of_violations.mean(), 0), inplace=True)
df.number_of_violations = df.number_of_violations.astype('int')

#cleaning median_employee_age
df.median_employee_age.fillna(round(df.median_employee_age.mean(), 0), inplace=True)

#cleaning median_employee_tenure
df.median_employee_tenure.fillna(round(df.median_employee_tenure.mean(), 0), inplace=True)

Feature Selection¶

The next thing to look for is potential collinearity between some of these feature columns. Collinearity is when two feature columns are highly correlated and risk duplicating information. If two features convey the same information using two different measures or metrics, there is no need to keep both.

I generate a correlation matrix heatmap using Seaborn to visually compare the correlations and look for problematic pairwise feature correlations. Based on the correlation matrix heatmap, the following pairs of columns are strongly correlated: 'inspection_demerits' and 'number_of_violations.' These two features reflect very similar information. Given that 'inspection_demeritts' contain more nuanced information, I drop 'number_of_violations' from further analysis.

Also, I examined the possibility of removing features with low variance. When the values in a feature column have low variance, they don't meaningfully contribute to the model's predictive capability. To make apples-to-apples comparisons between columns, I rescaled all of the columns to vary between 0 and 1. This is known as min-max scaling or as rescaling. But I finally decided to keep all five features in the model.

In [10]:

# Explore correlations among features
corr = df[['target', 'current_demerits', 'employee_count', 'median_employee_age', 
           'median_employee_tenure','inspection_demerits', 'number_of_violations']].corr()
plt.figure(figsize=(10,10))
ax = sns.heatmap(corr,vmin=-1, vmax=1, center=0,cmap=sns.diverging_palette(20, 220, n=200), square=True)
ax.set_xticklabels(ax.get_xticklabels(), rotation=90, horizontalalignment='right')
ax.set_title('Pairwise Correlations Among Features', fontsize=20)
plt.show()

ML Modeling¶

To determine the predictive capacity of the ML model, I split the data into a training set and testing set. Given that the number of restaurants that receive a C grade or below for their evaluation (16%) is fewer than the remaining cases (84%), I use the ClusterCentroids method to better train the model to recognize the minority class cases. This method addresses data imbalance by undersampling the majority class by replacing a cluster of majority samples with the cluster centroid of a KMeans algorithm.

I use the pipeline tool combined with GridSearchCV to assemble several steps that can be cross-validated while setting different parameters. The preprocessing step in the pipeline includes RobustScaler, scales features using statistics that are robust to outliers. The classifier step includes C-Support Vector Classification, Random Forrest, Logistic Regression, and Gradient Boosting classifiers.

GridSearchCV runs through all the different parameters fed into the parameter grid and produces the best combination of parameters based on a scoring metric of your choice. The recall is used as a performance metric when there is a need to identify all positive samples; that is, when it is crucial to avoid false negatives. A health code violation is a good case for this: it is vital to find all non-compliant restaurants, possibly falsely including well-maintained venues in the prediction. Thus, I use recall as an evaluation metric when ranking results.

In [11]:

#import ML libraries
from sklearn.preprocessing import RobustScaler
from sklearn.metrics import classification_report, plot_confusion_matrix
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from imblearn.under_sampling import ClusterCentroids
from imblearn.pipeline import Pipeline

In [12]:

y = df.target
X = df[['current_demerits', 'employee_count', 'median_employee_age', 'median_employee_tenure', 'inspection_demerits']] 

In [13]:

# Create the pipeline
pipe = Pipeline([('preprocessing', RobustScaler()), ('classifier', SVC())])

In [14]:

# Specify the hyperparameter space
parameters = [
{'classifier': [RandomForestClassifier(n_jobs=-1, random_state=0, class_weight='balanced')],
'preprocessing': [None],     
'classifier__n_estimators': [100, 200, 300],
'classifier__max_features': ['auto', 3, 4, 5],
'classifier__max_depth': [None, 2, 3, 100],    
'classifier__criterion': ['gini', 'entropy'],
'classifier__min_samples_leaf': [3, 4, 5],
'classifier__min_samples_split': [8, 10, 12]},
    
{'classifier': [SVC(random_state=0, class_weight='balanced')], 
'preprocessing': [RobustScaler(), None],
'classifier__gamma': [0.01, 0.1, 1, 10, 100],
'classifier__C': [0.01, 0.1, 1, 10, 100]},  

{'classifier': [GradientBoostingClassifier(random_state=0)], 
'preprocessing': [None],
'classifier__learning_rate': [0.01, 0.1, 1, 10],
'classifier__n_estimators': [100, 200, 300],
'classifier__max_depth': [2, 3, 5]},

{'classifier': [LogisticRegression(n_jobs=-1, max_iter=1000)], 
'preprocessing': [RobustScaler(), None], 
'classifier__C': [0.01, 0.1, 1, 10, 100]}
]

In [15]:

# Instantiate GridSearchCV object
grid = GridSearchCV(pipe, param_grid=parameters, cv=5, scoring='recall')

In [16]:

# Define resampling method
method = ClusterCentroids(random_state=0)

In [17]:

# Create training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0, stratify=y)

# Apply resampling to the training data only
X_resampled, y_resampled = method.fit_resample(X_train, y_train)

In [18]:

# Fit to the training set
model = grid.fit(X_resampled, y_resampled)

Results¶

The exhaustive search identified the best parameters: SVC with the regularization parameter of 0.1 and gamma parameter for non-linear hyperplanes of 0.01 in combination with RobustScaler. This model produced a recall score for problematic restaurants of 0.82. Because this test set recall score of the model is the same as the cross-validation recall score during training, the model is not likely to overfit and can be considered reliable.

In [19]:

# Get performance metrics
predict = model.predict(X_test)

print("Best params:\n{}\n".format(model.best_params_))
print("Best cross-validation score: {:.2f}".format(model.best_score_))
print("Test set score: {:.2f}".format(model.score(X_test, y_test)))
plot_confusion_matrix(model, X_test, y_test)
print (classification_report(y_test, predict))

Best params:
{'classifier': SVC(C=0.1, class_weight='balanced', gamma=0.01, random_state=0), 'classifier__C': 0.1, 'classifier__gamma': 0.01, 'preprocessing': RobustScaler()}

Best cross-validation score: 0.82
Test set score: 0.82
              precision    recall  f1-score   support

           0       0.83      0.17      0.28      3286
           1       0.16      0.82      0.26       621

    accuracy                           0.27      3907
   macro avg       0.49      0.49      0.27      3907
weighted avg       0.72      0.27      0.28      3907

Conclusion¶

The final model uses five numeric features to predict if a restaurant receives a C grade or below during the inspection. Three predictors refer to employee characteristics; the remaining three variables describe the degree of violations. The model is a good tool if the purpose is to identify as many problematic restaurants as possible. The model could correctly identify 82% of problematic restaurants in a holdout test data set. At the same time, the model misclassified many compliant restaurants as non-compliant ones.

There is plenty of room to improve the model's predictive capacity by supplementing it with additional data sources. For instance, employee characteristics in the current model had decent effect sizes. Additional descriptive information about employees, such as experience and training in safe food preparation practices, may improve the model substantially. In addition to employees' ability, motivation and incentives to practice safe food handling (or cut corners to maximize lower quality output in the kitchen) are also essential components that may explain and predict (in)compliance by restaurants. Other projects were able to leverage social media in predicting outcomes of health inspections. For instance, Uppoor et al. used data from Yelp reviews (see PDF). However, the authors reported that it was more challenging to predict critical violations than minor problems.