#!/usr/bin/env python # coding: utf-8 # Open In Colab # # Work Visa: Ensemble Methods # The number of applications for US work visas is growing. To assist with the review process, we will develop an ML solution to filter out many candidates. This filtering will be trained on a data set of previous applications and the resulting case statuses. # # This data has features describing both the applicant and their sponsoring employer: # # * case_id: unique identifier for each application # # * continent: employee continent of origin # # * education_of_employee: level of education # # * has_job_experience: binary flag # # * requires_job_training: binary flag # # * no_of_employees: size of sponsoring employer's company # # * yr_of_estab: sponsoring company's year of establishment # # * region_of_employment: applicant's intended region of employment in the US # # * prevailing_wage: regional average wage for a given domain of labor; useful to keep job market competative without underpaying foreign workers # # * unit_of_wage: frequency of payment # # * full_time_position: binary flag; Y: full time # # * case_status: binary flag # ## Importing necessary libraries and data # In[ ]: get_ipython().system('pip uninstall scikit-learn -y') get_ipython().system('pip install -U scikit-learn') # In[ ]: # math and data import numpy as np import pandas as pd # plotting import matplotlib.pyplot as plt import seaborn as sns sns.set_theme() # model building from sklearn import metrics from sklearn.model_selection import train_test_split, GridSearchCV # models from sklearn import tree from sklearn.linear_model import LogisticRegression from sklearn.ensemble import BaggingClassifier, RandomForestClassifier from sklearn.ensemble import AdaBoostClassifier, GradientBoostingClassifier from sklearn.ensemble import StackingClassifier from xgboost import XGBClassifier # In[ ]: visa=pd.read_csv('dataset.csv') # In[ ]: visa_copy=visa.copy() # ## Data Overview # # In[ ]: visa.sample(10) # In[ ]: visa.shape # In[ ]: visa.info() # There are twelve features with 25480 records each. While we record no NaN values, there may very well be missing or erroneous data. Most of the columns are object type, which we can convert to categorical during our pre-processing later. # ## Exploratory Data Analysis (EDA) # In[ ]: visa.describe().T # * We see that the minimum number of employees is -26, an impossibility! This indicates that we already have at least one apparent error in our data. # # * The median number of employees is around 2100, so a midsize company. The largest in our records has over 600,000 employees. # # * The oldest company we have records for was founded in 1800, but the mean year is around 1979. # # * The median local wage is \$70,308.21, and the mean is several thousand higher, indicating some skew toward higher costs of living. There is at least one shockingly low value, as the minimum is just \$2.14. Perhaps this is hourly wage, though even then \$2/hr is quite low. This hints that we must diligently explore outliers for potential errors. # In[ ]: visa.describe(include='object').T # * With 25480 unique case IDs, we can be fairly confident that there are no duplicates in our records. # # * The data covers 6 continents of origin, with Asia being the most frequent. # # * The plurality of applicants have a Bachelor's level of education. We will explore the other levels on record shortly. # # * The majority of applicants do have some job experience, and around 88% of jobs do not require training. # # * Nearly all wages are recorded as yearly salaries, though the data dictionary indicates that hourly, weekly, and monthly are options too. # In[ ]: for col in visa.drop('case_id',axis=1).select_dtypes('object').columns: print('Values in',col,'feature') print(visa[col].value_counts()) print('-'*45) # * The only continent of origin not represented is (predictably) Antarctica. After Asia, the most common continents are Europe and North America (likely Canada and Mexico). # # * Other than a Bachelor's degree, many applicants have a Master's. Some also have a high school diploma or a doctorate, but these are less common. # # * There are five options for region of employment: Northeast, South, West, Midwest, and Island (in decreasing level of frequency). Northeast and South have nearly the same number of records. # # * The dependent variable in this study is ```case_status```, with outcomes Certified or Denied. About two-thirds of cases are Certified. # In[ ]: # quick plotting function def plott(col=None): '''Quick plot a countplot for categorical data and a histogram for numeric data.''' plt.figure(figsize=(8,5)) if col==None: return elif visa[col].dtype=='object': plt.title('Countplot of '+col,fontsize=14) sns.countplot(data=visa,x=col, order=visa[col].value_counts().index.tolist()); else: plt.title('Histogram of '+col,fontsize=14) sns.histplot(data=visa,x=col); # In[ ]: plott('continent') # We find that Asia is by far the most frequent continent of origin, while Oceania is the least. # In[ ]: plott('education_of_employee') # Bachelor's and Master's education are almost equally common, with both around 10000 records each. # In[ ]: plott('has_job_experience') # In[ ]: visa['has_job_experience'].value_counts(normalize=True) # The majority of applicants have some job experience, but over 10000 (about 40%) have none. # In[ ]: plott('requires_job_training') # Unlike the previous feature, which had fairly balanced classes, most applicants do NOT require job training. # In[ ]: plott('no_of_employees') # In[ ]: plt.figure(figsize=(16,5)) plt.suptitle('Feature: no_of_employees',fontsize=18) plt.subplot(1,2,1) plt.title('Boxplot (with fliers)',fontsize=14) sns.boxplot(data=visa,x='no_of_employees') plt.subplot(1,2,2) plt.title('Boxplot (fliers removed)',fontsize=14) sns.boxplot(data=visa,x='no_of_employees',showfliers=False) plt.show() # With such skewed data, the histogram lends little insight. Boxplots give a better idea of data concentration and distribution. Most of our records are concentrated below around 7000, but there are numerous extreme values on the high end. # # We will not treat these extreme values though. It is entirely reasonable that many records reflect applicants to large companies. # In[ ]: plott('yr_of_estab') # In[ ]: visa['yr_of_estab'].describe() # We find that 75% of companies in our records were founded in 1976 or later. Like the previous feature, we have many extreme values, this time on the lower end. As before, all these values are entirely sensible, so we will not alter them. # # However, we will convert this feature to 'years since founded', which will flip the distribution from left-skewed to right-skewed but otherwise leave the data unchanged. # In[ ]: plott('region_of_employment') # Northeast, South, and West are all common regions. Island is decidedly uncommon, with fewer than 500 records. # In[ ]: plott('prevailing_wage') # Aside from the spike around \$0, prevailing wage follows a right-skewed normal distribution. It may well be that the spike is due to other wage units, such as hourly. # In[ ]: plott() plt.title('Prevailing Yearly Wage',fontsize=14) sns.histplot(data=visa.loc[visa['unit_of_wage']=='Year'], x='prevailing_wage'); # Indeed, the distribution looks far less erratic when just focused on yearly prevailing wage. # In[ ]: plott('unit_of_wage') # Most of our records consider cases with yearly wages. Weekly and montly are vastly less common, as was shown earlier in the value counts table. # In[ ]: plott('full_time_position') # Most positions are full time. # In[ ]: plott('case_status') # Only one-thirds of cases are Denied. # In[ ]: visa['case_status'].value_counts(normalize=True) # Let's dig deeper into prevailing wage. # In[ ]: visa.groupby(by='unit_of_wage')['prevailing_wage'].describe().T # I am skeptical about the records tagged with 'Week' for ```unit_of_wage```. According to our records, 75% of jobs with a weekly wage pay at least \$51,000 per WEEK! As we have limited visibility into the source of these data, we will preserve these records, but ideally I would like to do further research to justify these entries. # In[ ]: plott() plt.title('Prevailing Wage by Unit',fontsize=14) sns.boxplot(data=visa, x='prevailing_wage', y='unit_of_wage'); # The boxplot above is further evidence that weekly—and even monthly—data looks erroneous. Many of the weekly wages would result in a yearly earning well over a million USD! Perhaps this is personal bias, but it seems far fetched to have this many exceptionally high-paying jobs in this data set. # # Without any evidence to the contrary, however, we will leave these records intact. # In[ ]: plott() plt.title('Prevailing Hourly Wage',fontsize=14) sns.boxplot(data=visa.loc[visa['unit_of_wage']=='Hour'], x='prevailing_wage'); # In[ ]: b=visa.loc[visa['unit_of_wage']=='Hour']['prevailing_wage'].argmax() visa.iloc[b]['full_time_position'] # In[ ]: a=visa.loc[visa['unit_of_wage']=='Hour']['prevailing_wage'].max() print('The maximum hourly wage is ${} for a full time position, equivalently ${} per year!'.format(a,2080*a)) # For the record, some hourly data points seem unreasonable too: The maximum wage is about \$1000 per hour, or around \$2 million per year. Again, there is no evidence that this data is necessarily erroneous; it simply stands out. # In[ ]: plott() plt.title('Prevailing Wage by Region',fontsize=14) sns.boxplot(data=visa, x='prevailing_wage', y='region_of_employment'); # The middle 50% of the data for prevailing wage is higher in the midwest and island regions. The northeast has the lowest first quartile. Every region has many extreme values on the high end. # In[ ]: sns.catplot(data=visa, x='case_status', col='education_of_employee', kind='count', col_wrap=2); # Looking toward our dependent variable, it appears that ```case_status``` is influenced by education. The trouble here is that it is difficult to compare the regions, as the scale is different for each. Is the _percentage_ Certified for Doctorate any greater or less than that for, say, Bachelor's? # # Let's instead examine the percentage Certified and Denied. # In[ ]: # dataframe of percentages a=visa.groupby('education_of_employee')['case_status'].value_counts(normalize=True) b=pd.DataFrame(index=['High School',"Bachelor's","Master's",'Doctorate'], columns=['Certified','Denied']) for (c,d) in a.index: b.loc[c,d]=a[(c,d)] b # In[ ]: # barplot of percentages plott() plt.title('Percent Certified by Education Level',fontsize=14) plt.bar(b.index, b['Certified'], label='Certified') plt.bar(b.index, b['Denied'], bottom=b['Certified'], label='Denied') plt.legend(loc='lower right') plt.xlabel('Education Level') plt.ylabel('Percent Certified/Denied') plt.show() # Indeed, here we can compare ```case_status``` on like scales. We find that applicants with a Doctorate are most often certified (over 87% of the time), while employees with only a High School education are certified around a third of the time. We find a clear trend: Visa status is directly connected with level of education. Higher levels of education lead to a greater percentage of visas certifed. # # To plot the rest of our features as percentages, we make the above code into a function. # In[ ]: def percent_status(col): '''Plot percent Certified/Denied for classes of a categorical variable.''' # generate dataframe of percentages a=visa.groupby(col)['case_status'].value_counts(normalize=True) #compute ascending order of classes ser=pd.Series(dtype='float') for name in visa[col].unique(): ser[name]=a[(name,'Certified')] # dataframe b=pd.DataFrame(index=ser.sort_values().index.tolist(), columns=['Certified','Denied']) for (c,d) in a.index: b.loc[c,d]=a[(c,d)] # plot percentages plott() plt.title('Percent Certified by '+col,fontsize=14) plt.bar(b.index,b['Certified'],label='Certified') plt.bar(b.index,b['Denied'],bottom=b['Certified'],label='Denied') plt.legend(loc='lower right') plt.xlabel(col) plt.ylabel('Percent Certified/Denied') plt.show() # In[ ]: percent_status('continent') # * Europe, Africa, and Asia see the greatest percentage of certified visas. # * South America sees the least. # * Every region has over 50% certification rate. # In[ ]: percent_status('has_job_experience') # Having job experince certainly helps, as the proportion of Certified visas is higher for applicants with job experience. Let's test whether this difference is significant, assuming a level of siginificance of 5%. Please note that our populations are binomially distributed, independent, and simply randomly sampled; they furthermore satisfy the sample size inequalities. The assumptions for a two proportion z-test are therefore met. Our null hypothesis is that our sample proportions come from populations with the same ratios of Certification to Denial; the alternative is that the ratios are truly different. # In[ ]: from statsmodels.stats.proportion import proportions_ztest def cert_ztest(col): '''Run a two proportions independent z-test for case_status.''' # collect data size=visa[col].value_counts() a=visa.groupby(col)['case_status'].value_counts() cert=[] for idx in size.index.tolist(): cert.append(a[idx,'Certified']) # run test t,p_val=proportions_ztest(cert,size) print('The p-value is',p_val) # In[ ]: cert_ztest('has_job_experience') # With an astoundingly low p-value, we can confidently conclude that these sample proportions reflect a real-world difference for visa applications: We find that a greater proportion of applications are certified when the applicant has previous job experience. # In[ ]: percent_status('requires_job_training') # In[ ]: visa.groupby('requires_job_training')['case_status'].value_counts(normalize=True) # Unlike the previous feature, the requirement of job training has very close proportions: 66.6% Certified (N) versus 67.9% (Y). There is apparently no difference in ```case_status``` based on the job training requirement. To confirm, we can run another two proportion z-test. (As before, the assumptions for the test are met.) We set the level of siginificance at 0.05. Our null hypothesis is that our sample proportions come from populations with the same ratios of Certification to Denial; the alternative is that the ratios are truly different. # In[ ]: cert_ztest('requires_job_training') # Our p-value exceeds 0.05, so we fail to reject the null hypothesis. We must conclude that there is no significant difference between the two proportions. Put another way, ```requires_job_training``` does not impact ```case_status``` on its own. # In[ ]: percent_status('region_of_employment') # There seems to be little difference in visa certification rate among island, western, and northeastern jobs. The south and midwest look to have appreciably higher rates. # In[ ]: percent_status('unit_of_wage') # Applicants to hourly jobs see far less than 50% certification. This is well below the rest, with yearly salaried jobs being most likely to lead to visa certification. # In[ ]: percent_status('full_time_position') # In[ ]: visa.groupby('full_time_position')['case_status'].value_counts(normalize=True) # Similar to ```requires_job_training```, it appears here that there is not much difference in the resulting certification based on whether the position is full time. We find that 68.5% are Certified for part time work and 66.6% are Certified for full time work. Our assumptions being once more satisfied, we can run a two proportion z-test. We set the level of siginificance at 0.05. Our null hypothesis is that our sample proportions come from populations with the same ratios of Certification to Denial; the alternative is that the ratios are truly different. # In[ ]: cert_ztest('full_time_position') # With a p-value under 0.05, we can reject the null hypothesis. In this case, there is a statistically significant difference between these two proportions. That means that an applicant is more likely to be certified for part time work than full time work. # In[ ]: plott() plt.title('Prevailing Wage and Case Status',fontsize=14) sns.boxplot(data=visa,x='prevailing_wage', y='case_status', order=['Certified','Denied']); # In[ ]: visa.groupby('case_status')['prevailing_wage'].describe().T # The mean prevailing wage for certified applications is higher than that for denied applications by about \$8500. The middle 50% for certified is concentrated higher too. Interestingly, the whiskers extend further for denied applications. Taken with its wider IQR, this shows that denied applications have greater variance in prevailing wage. # # This is also true across various regions. # In[ ]: plt.figure(figsize=(15,15)) for idx,region in enumerate(visa['region_of_employment'].unique()): plt.subplot(3,2,idx+1) plt.title('Prevailing wage in '+region) sns.boxplot(data=visa.loc[visa['region_of_employment']==region], x='prevailing_wage', y='case_status', order=['Certified','Denied']) plt.tight_layout() # Denied applications in the northeast seem to have some of the lowest prevailing wages. The prevailing wages for the island region, on the other hand, appear generally higher than the rest, with certified cases having the greatest middle 50% among the regions. This is likely due to the higher cost of living on islands, as most goods must be imported. # ## Data Preprocessing # ### Feature Engineering # In[ ]: visa['years_in_business']=2016-visa['yr_of_estab'] visa.drop('yr_of_estab',axis=1,inplace=True) # Rather than recording the year businesses were established, we will instead consider the number of years in business. # ### Outlier detection and treatment # In[ ]: visa.loc[visa['no_of_employees']<1].shape # There are 33 records where the application lists a negative number of employees. These clearly constitute an error, and with no clear method to impute these values, our best recourse is to drop the rows. # In[ ]: idx=visa.loc[visa['no_of_employees']<1].index visa.drop(idx,axis=0,inplace=True) # ### Data Prep # In[ ]: visa.drop('case_id',axis=1,inplace=True) # We drop ```case_id```, as the unique identifiers will not lend any predictive power to our model. # In[ ]: # convert object dtype to category for col in visa.select_dtypes('object').columns: visa[col]=pd.Categorical(visa[col]) # After making all object variables into category type, we encode them numerically. # In[ ]: # define categorical ordering order_struct={ 'education_of_employee':{'High School':1, "Bachelor's":2, "Master's":3, 'Doctorate':4}, 'has_job_experience':{'Y':1,'N':0}, 'requires_job_training':{'Y':1,'N':0}, 'full_time_position':{'Y':1,'N':0}, 'case_status':{'Certified':1,'Denied':0} } no_order={'continent','region_of_employment','unit_of_wage'} # In[ ]: # convert data to numeric visa.replace(order_struct,inplace=True) visa=pd.get_dummies(visa,columns=no_order) # We use one hot encoding for the categorical variables with no inherent order. # In[ ]: X=visa.drop('case_status',axis=1) y=visa['case_status'] # In[ ]: # split X_train,X_test,y_train,y_test=train_test_split(X,y, test_size=0.3, stratify=y, random_state=57) # Now that we've split our data, it is ready for modeling. # ## Second EDA # # We only removed 33 records out of 25480. That's only 0.1% of the data. Accordingly, there is not enough change in our data to produce appreciably different visualizations. We only include plots for features that were directly altered/engineered. # In[ ]: plott('years_in_business') # In[ ]: plott() plt.title('Boxplot of Years Since Founding',fontsize=14) sns.boxplot(data=visa,x='years_in_business'); # * We find that most businesses in our records are less than 50 years old. # * There are a steady supply of businesses greater than 75 years old, about the same amount for each year. It only starts to trail off after around 175 years. # In[ ]: plt.figure(figsize=(16,5)) plt.suptitle('Feature: no_of_employees',fontsize=18) plt.subplot(1,2,1) plt.title('Boxplot (with fliers)',fontsize=14) sns.boxplot(data=visa,x='no_of_employees') plt.subplot(1,2,2) plt.title('Boxplot (fliers removed)',fontsize=14) sns.boxplot(data=visa,x='no_of_employees',showfliers=False) plt.show() # With so few records removed, there is little difference in the plots for number of employees. Put another way, the comments on this feature from the first EDA are still applicable. The middle 50% of observations is between about 1000 and 3500. There are also tiny companies, with not more than several employees, and massive companies, with over half a million employees, in our data set. # ## Building bagging and boosting models # When scoring our models, we must find a good balance between false positives and false negatives. # * A false positive is a certified visa that should have been denied. We want to reduce false positives because these workers may be ineffective and not benefit the company. # * On the other hand, a false negative constitutes a denied visa that should have been certified. This deprives the workforce of an asset, someone who could benefit US companies. # # With this in mind, we will score our models on F1, as it balances recall and precision, reducing both false positives and false negatives. # ### Functions # In[ ]: def scores(model): '''Print training and testing metrics for the specified model.''' score_idx=['Accuracy','Precision','Recall','F1'] X_train_pred=model.predict(X_train) score_list_train=[metrics.accuracy_score(y_train,X_train_pred), metrics.precision_score(y_train,X_train_pred), metrics.recall_score(y_train,X_train_pred), metrics.f1_score(y_train,X_train_pred)] X_test_pred=model.predict(X_test) score_list_test=[metrics.accuracy_score(y_test,X_test_pred), metrics.precision_score(y_test,X_test_pred), metrics.recall_score(y_test,X_test_pred), metrics.f1_score(y_test,X_test_pred)] df=pd.DataFrame(index=score_idx) df['Train']=score_list_train df['Test']=score_list_test return df # This function gets the scores of a model on training data and testing data. # In[ ]: model_comp_table=pd.DataFrame(columns=['Train Acc','Test Acc','Train F1','Test F1','Status']) y_train_test_lens=[y_train.shape[0],y_test.shape[0]] def tabulate(model,name): '''Compute train/test accuracy and F1 for a given model. Add to table.''' # run predictions with model X_train_pred=model.predict(X_train) X_test_pred=model.predict(X_test) # run overfitting test train_acc_count=np.logical_not(np.logical_xor(y_train,X_train_pred)).sum() test_acc_count=np.logical_not(np.logical_xor(y_test,X_test_pred)).sum() t,p_val=proportions_ztest([train_acc_count,test_acc_count],y_train_test_lens) # assign rating based on p-value rating=str() if p_val<0.05: rating='Overfit' else: rating='General' # collect data for new table row model_comp_table.loc[name]=[metrics.accuracy_score(y_train,X_train_pred), metrics.accuracy_score(y_test,X_test_pred), metrics.f1_score(y_train,X_train_pred), metrics.f1_score(y_test,X_test_pred), rating] return model_comp_table # This function adds accuracy and F1 scores to a DataFrame that we can use to compare models. It also runs a two proportion z-test to determine whether the training accuracy and testing accuracy are different enough for the model to be considered overfit. We assume a level of significance of 5% in the interpretation of the p-value. All conditions are necessarily met to justify the test. # In[ ]: def confusion_heatmap(model,show_scores=False): '''Heatmap of confusion matrix of model predictions on test data.''' actual=y_test predicted=model.predict(X_test) # generate confusion matrix cm=metrics.confusion_matrix(actual,predicted) cm=np.flip(cm).T # heatmap labels labels=['TP','FP','FN','TN'] cm_labels=np.array(cm).flatten() cm_percents=np.round((cm_labels/np.sum(cm))*100,3) annot_labels=[] for i in range(4): annot_labels.append(str(labels[i])+'\nCount:'+str(cm_labels[i])+'\n'+str(cm_percents[i])+'%') annot_labels=np.array(annot_labels).reshape(2,2) # print figure plt.figure(figsize=(8,5)) plt.title('Confusion Matrix',fontsize=20) sns.heatmap(data=cm, annot=annot_labels, annot_kws={'fontsize':'x-large'}, xticklabels=[1,0], yticklabels=[1,0], cmap='Greens', fmt='s') plt.xlabel('Actual',fontsize=14) plt.ylabel('Predicted',fontsize=14) plt.tight_layout(); return # The function prints a heatmap with the confusion matrix for a given model. # ### Bagging # #### Baseline: Decision Tree # In[ ]: dtree=tree.DecisionTreeClassifier(random_state=1) dtree.fit(X_train,y_train) # As a baseline comparison, we train a single decision tree. # In[ ]: scores(dtree) # Predictably, this model is comically overfit. # In[ ]: tabulate(dtree,'dTree (baseline)') # We add this model to a DataFrame that we will use to compare the different models. # #### Bagging Classifier # In[ ]: bag=BaggingClassifier(random_state=1) bag.fit(X_train,y_train) # The default estimator on a bagging classifier is a decision tree classifier, so this model consists of many parallel decision trees trained on samples taken with replacement (bootstrap samples). # In[ ]: scores(bag) # In[ ]: tabulate(bag,'Bagging Classifier') # This model is also overfit. The test performance is nonetheless slightly better than the single decision tree. # #### Random Forest Classifier # In[ ]: rf=RandomForestClassifier(random_state=1) rf.fit(X_train,y_train) # In[ ]: scores(rf) # As with the others, the random forest classifier overfits on the training data. Test performance is better than before, though this model is not generalized. # In[ ]: tabulate(rf,'Random Forest') # ### Boosting # #### AdaBoost # In[ ]: abc=AdaBoostClassifier(random_state=1) abc.fit(X_train,y_train) # In[ ]: scores(abc) # The basic AdaBoost Classifier has good performance and does not suffer from overfitting. Accuracy on training and testing is around 73% and F1 is about 82%. This is a much more promising baseline. # In[ ]: tabulate(abc,'AdaBoost') # #### Gradient Boosting # In[ ]: gbc=GradientBoostingClassifier(random_state=1) gbc.fit(X_train,y_train) # In[ ]: scores(gbc) # Gradient Boosting also does not appear overfit, with the added bonus of slightly better performance over AdaBoost. # * Train and test accuracy is around 75%. # * Train and test F1 is around 82%. # In[ ]: tabulate(gbc,'Gradient Boosting') # #### XGBoost # In[ ]: xgbc=XGBClassifier(random_state=1) xgbc.fit(X_train,y_train) # In[ ]: scores(xgbc) # XGBoost offers near identical performance to the gradient boosting above. # In[ ]: tabulate(xgbc,'XGBoost') # ## Will tuning the hyperparameters improve the model performance? # ### Bagging # #### Tuned Decision Tree # In[ ]: dtree_tuned=tree.DecisionTreeClassifier(random_state=1) params={'max_depth':np.arange(3,10), 'min_samples_leaf':np.arange(5,10), 'max_features':[None,'sqrt'], 'max_leaf_nodes':np.arange(5,25,5), 'min_impurity_decrease':[0.0,0.0005,0.001], 'class_weight':[None,'balanced']} # * We will test values of tree depth from 3 to 9. # * We will consider limiting the maximum number of features to consider when calculating the best split. # * We will try balancing the class weights, though I don't expect this will improve performance. # In[ ]: go=GridSearchCV(estimator=dtree_tuned, param_grid=params, scoring='f1', n_jobs=-1, cv=5, verbose=1) go.fit(X_train,y_train) # In[ ]: go.best_params_ # Grid Search Findings # * Our response variable is not so unbalanced that a rebalancing of class weights is required. # * A maximum tree depth of 7 seems to be ideal, with 20 as the upper bound for leaf nodes. # * The model requires at least 5 samples for each leaf. Along with limiting the depth and maximum number of leaves, this reduces overfitting. # * We get best performance when considering all features to determine the best split. # In[ ]: dtree_tuned=go.best_estimator_ dtree_tuned.fit(X_train,y_train) # We refit the best estimator on the full training set. (In the grid search, we employ cross validation, so there is a holdout set used to test each model. The upside is a more reliable search, but the cost is that we have yet to train the best estimator on the full training set.) # In[ ]: scores(dtree_tuned) # In[ ]: tabulate(dtree_tuned,'dTree (tuned)') # With around 75% accuracy, the tuned decision tree has similar performance to untuned boosting models. F1 score is around 82%. Importantly, tuning this tree has eliminated overfitting. # #### Tuned Bagging Classifier # In[ ]: params={'base_estimator':[tree.DecisionTreeClassifier(random_state=2)], 'base_estimator__max_depth':[None,1,2,3,4,5], 'n_estimators':[10,20,30,40,50], 'max_samples':[0.7,0.8,0.9,1.0], 'max_features':[0.7,0.8,0.9,1.0],} # * We will test different depths for the base estimator (Decision Tree Classifier). # * We will consider between 10 and 50 estimators. # * Max samples and max features will also be adjusted, testing different fractions to include. # In[ ]: bag_tuned=BaggingClassifier(random_state=1) go=GridSearchCV(estimator=bag_tuned, param_grid=params, scoring='f1', n_jobs=-1, cv=5, verbose=1) go.fit(X_train,y_train) # In[ ]: go.best_params_ # * Trees with a depth of 5 score best on F1. # * Our model performs better with fewer estimators. # * Taking less than 100% of the samples yields a higher-scoring model. # In[ ]: bag_tuned=go.best_estimator_ bag_tuned.fit(X_train,y_train) # In[ ]: scores(bag_tuned) # Note that the tuned bagging classifier has great recall. This model does a great job at reducing false negatives, meaning the US job market is not losing out on promising talent. On the other hand, precision is lower, which leaves us with an F1 score comparable to other generalized models. # In[ ]: tabulate(bag_tuned,'Bagging clfr (tuned)') # As an experiment, we try another grid search with a second option for the base estimator: Logistic Regression. # In[ ]: params={'base_estimator':[tree.DecisionTreeClassifier(random_state=2), LogisticRegression(random_state=2,max_iter=1000)], 'n_estimators':[10,20,30,40,50], 'max_samples':[0.7,0.8,0.9,1.0], 'max_features':[0.7,0.8,0.9,1.0],} # In[ ]: bag_tuned=BaggingClassifier(random_state=1) go=GridSearchCV(estimator=bag_tuned, param_grid=params, scoring='f1', n_jobs=-1, cv=5, verbose=1) go.fit(X_train,y_train) # In[ ]: go.best_params_ # After fitting the grid search, we find that the decision tree classifier was still the strongest base estimator for our metric. However, this time more estimators were required. # In[ ]: scores(go) # It appears our model now overfits, as we have not limited the tree depth. We will not consider this model. # #### Tuned Random Forest # In[ ]: params={'n_estimators':np.append(np.arange(150,350,50),1000), 'max_depth':np.arange(1,6), 'min_samples_split':np.arange(4,9), 'max_features':['sqrt','log2'], 'max_samples':[0.25,0.5,0.75,1.0]} # * We consider a number of estimators between 150 and 300. We also consider a huge model: 1000 trees. # * Various depths will be studied. # * As with the previous bagging model, we will test various values for the maximum number of features and samples. # In[ ]: rf_tuned=RandomForestClassifier(random_state=1,warm_start=True) go=GridSearchCV(estimator=rf_tuned, param_grid=params, scoring='f1', n_jobs=-1, cv=5, verbose=1) go.fit(X_train,y_train) # In[ ]: go.best_params_ # * Limiting the number of features at each split by the square root of the total number of features yielded better results than by using log base 2. # * A maximum tree depth of 5 worked best. # * We required 250 estimators in the highest scoring model. # In[ ]: # train best estimator omitting warm_start parameter rf_tuned=RandomForestClassifier(max_depth=5, max_features='sqrt', max_samples=0.75, min_samples_split=8, n_estimators=250, random_state=1) rf_tuned.fit(X_train,y_train) # In[ ]: scores(rf_tuned) # Similar to the tuned bagging classifier above, we're getting over 90% recall with this model. Unfortunately, accuracy is still below 75% and F1 is no higher than before. # In[ ]: tabulate(rf_tuned,'Random Forest (tuned)') # ### Boosting # #### Tuned AdaBoost # In[ ]: params={'base_estimator':[tree.DecisionTreeClassifier(random_state=2)], 'base_estimator__max_depth':[1,2,3], 'n_estimators':np.arange(10,80,10), 'learning_rate':np.linspace(0.1,1,10)} # Next we'll train an AdaBoost classifier, using a decision tree as the base estimator. We will consider between 10 and 70 estimators within the ensemble, various depths of the trees, and several different learning rates. # In[ ]: abc_tuned=AdaBoostClassifier(random_state=1) go=GridSearchCV(estimator=abc_tuned, param_grid=params, scoring='f1', n_jobs=-1, cv=5, verbose=1) go.fit(X_train,y_train) # In[ ]: go.best_params_ # With a depth of 3 and a slower learning rate, our best model required 50 estimators. # In[ ]: abc_tuned=go.best_estimator_ abc_tuned.fit(X_train,y_train) # In[ ]: scores(abc_tuned) # We get similar F1 performance to the tuned random forest, but better accuracy. Recall is lower than some other models. # In[ ]: tabulate(abc_tuned,'AdaBoost (tuned)') # #### Tuned Gradient Boosting # In[ ]: params={'init':[None,'zero',AdaBoostClassifier(random_state=2)], 'n_estimators':[50,75,100,150], 'max_features':[None,'sqrt']+np.linspace(0.7,1.0,4).tolist()} # We will next tune the gradient booster with several different initialization conditions. We will also check the number of estimators and the maximum number of features when looking for the best split. # In[ ]: gbc_tuned=GradientBoostingClassifier(random_state=1, warm_start=True) go=GridSearchCV(estimator=gbc_tuned, param_grid=params, scoring='f1', n_jobs=-1, cv=5, verbose=1) go.fit(X_train,y_train) # In[ ]: go.best_params_ # In[ ]: # train best estimator omitting warm_start parameter gbc_tuned=GradientBoostingClassifier(random_state=1, init=AdaBoostClassifier(random_state=2), max_features=0.7, n_estimators=100) gbc_tuned.fit(X_train,y_train) # In[ ]: scores(gbc_tuned) # As with AdaBoost, our accuracy is over 75%. The F1 score does not seem to be able to improve above 82%. # In[ ]: tabulate(gbc_tuned,'Grad Boost (tuned)') # #### Tuned XGBoost # In[ ]: params={'eta':[0.1,0.2,0.3], 'gamma':[0,2], 'subsample':[0.5,0.75,1.0], 'colsample_by_tree':[0.5,0.75,1.0], 'colsample_bylevel':[0.5,0.75,1.0], 'scale_pos_weight':[0.5,1.0]} # We will try many parameters for XGBoost. # * The parameter eta is the learning rate for the model. # * The parameter gamma concern loss reduction when partitioning a leaf node on a tree. # * Several sampling parameters control how the model samples the training data during fitting. # * The last parameter (```scale_pos_weight```) can help with biased classes in the response variable. # In[ ]: xgbc_tuned=XGBClassifier(random_state=1) go=GridSearchCV(estimator=xgbc_tuned, param_grid=params, scoring='f1', cv=5, verbose=1) go.fit(X_train,y_train) # In[ ]: go.best_params_ # * A lower learning rate ended up being better for F1. # * The classifier did not need to adjust class weights for better performance. # * While ```colsample_by_tree``` is less than 100%, both ```colsample_by_level``` and ```subsample``` are maxed out at 1. # In[ ]: xgbc_tuned=go.best_estimator_ xgbc_tuned.fit(X_train,y_train) # In[ ]: scores(xgbc_tuned) # After training the best estimator on the full training set, we find that performance is comparable—in fact, nearly identical—to the other tuned boosting models. # In[ ]: tabulate(xgbc_tuned,'XGBoost (tuned)') # #### Stacking Classifier # In[ ]: ests=[('Decision Tree',dtree_tuned), ('Bagging Classifier',bag_tuned), ('Random Forest',rf_tuned), ('AdaBoost',abc_tuned), ('Gradient Boosting',gbc_tuned), ('XGBoost',xgbc_tuned)] # With all six ensemble models tuned, we will now use a stacking classifier to hopefully gain additional predictive power. # In[ ]: stack=StackingClassifier(estimators=ests, final_estimator=XGBClassifier(random_state=1), cv=5) stack.fit(X_train,y_train) # In[ ]: scores(stack) # The F1 score is at best several thousandths higher than other well-performing models. The improvement is not as drastic as could be hoped. # In[ ]: tabulate(stack,'Stacking') # ## Model Performance Comparison and Conclusions # ### Best Model # We will examine the best model and compare it with some of the other high performing models. # In[ ]: mct=model_comp_table mct # We will define the 'best model' as that which has the highest F1 score on testing data. # In[ ]: mct.loc[mct['Status']=='General']['Test F1'].idxmax() # The best generalized model is the tuned gradient boosting classifier. # In[ ]: mct.loc[mct['Status']=='General']['Test Acc'].idxmax() # In[ ]: mct.loc['Grad Boost (tuned)'] # Incidentally, the tuned gradient boosting classifier is also the most accurate on test data, with over 75% accuracy. The F1 score of this model is a bit over 82%. # In[ ]: confusion_heatmap(gbc_tuned) # The tuned gradient boosting classifier allows few false negatives, but comparatively more false positives. This explains the higher recall score, as compared to precision. # # The percentage of true negatives is nearly identical to the percentage of false positives, around 16%. This means our model does a bad job at distinguishing Denied cases, as it's prediction is no better than random guessing. Put another way, there is only about a 50% chance that our model will correctly identify when a case is Denied. # In[ ]: # z-test num_obs=1239+1295 t,p_val=proportions_ztest([1239,1295],[num_obs,num_obs]) print('The p-value is',p_val) # We can run a two proportions z-test to check our model's ability to predict Denied cases. With a p-value greater than 0.05, our model is only as good as random guessing. # In[ ]: # feature importance ser=pd.Series(gbc_tuned.feature_importances_,index=X_train.columns) ser=ser.sort_values(ascending=False) print('Feature importance') print('-'*10) ser # * Education of the applicant is most important, i.e., has the greatest predictive impact on ```case_status```. # * Job experience is the next most influential feature. # * Continent of origin has less importance, as do the years in business and full time status of the role. # In[ ]: plott() plt.title('Feature Importances in Tuned Gradient Boosting Classifier',fontsize=14) sns.barplot(x=ser,y=ser.index) plt.xlabel('Importance'); # Relative importance can be gleaned with the plot above. Education is massively more important than anything else. # ### Other Good Models # #### Tuned Decision Tree # In[ ]: mct.loc['dTree (tuned)'] # The tuned decision tree also performed well on test data, with around 75% accuracy and 82% F1 score. This seems like the upper limit for performance. # In[ ]: confusion_heatmap(dtree_tuned) # Our confusion matrix for the tuned decision tree looks quite similar to that for the tuned gradient boosting classifier. It seems there is a trend in our models being unable to predict Denied applications. # In[ ]: # feature importance ser=pd.Series(gbc_tuned.feature_importances_,index=X_train.columns) ser=ser.sort_values(ascending=False) plott() plt.title('Feature Importances in Tuned Decision Tree',fontsize=14) sns.barplot(x=ser,y=ser.index) plt.xlabel('Importance'); # Once more, education and job experience are the most important features in predicting ```case_status```. # #### Stacking Classifier # In[ ]: mct.loc['Stacking'] # Despite stacking multiple models together, nothing was gained by implementing the stacking classifier. Accuracy is still around 75% and F1 is at 82%. # # It is possible that most of the models misclassified the same records, i.e., had the same false positives and false negatives. If this is the case, then stacking them together would not add any predictive power. # # We can get an idea from the confusion matrix. # In[ ]: confusion_heatmap(stack) # Here, it appears the stacking classifier is slightly better than earlier models at predicting Denied cases. We can check whether its performance is better than random guessing with a two proportions z-test (significance=0.05). # In[ ]: # z-test num_obs=1227+1307 t,p_val=proportions_ztest([1227,1307],[num_obs,num_obs]) print('The p-value is',p_val) # With a p-value less than 0.05, we find that there is in fact a significant difference between the number of true negatives and false negatives. Thus the stacking classifier is better at predicting Denied cases than randomly guessing. # ## Actionable Insights and Recommendations # * Our model can correctly predict 75% of case statuses. While this is fairly good—certainly better than guessing—there's much room for improvement. # * Our chosen scoring metric was F1, as it provides a good balance of recall and precision. Models maxed out at 82% for the F1 score. Meta-ensemble methods like the stacking classifier could not further improve this score. # * Level of education has the greatest influence on case status prediction. One way for OFLC to manage the number of applications is to process an initial short form application to immediately rule out candidates. If this approach were used, level of education would surely be one of the most helpful questions to ask. # * Previous job experience also has a big impact on case status. This too would be good to check on an initial automated screening. If the candidate passes the screening, they would be permitted to submit a complete application to be reviewed by a case reviewer at OFLC. # * As was seen in the EDA, a higher level of education leads to a greater chance of being Certified. The ideal candidate has a Doctorate and previous job experience, is from Europe, and is seeking part-time employment in the Midwest. # * Using machine learning models like those developed here, OFLC could in fact automate the entire screening! We could, for example, use a stacking classifier with a number of the estimators found above and logistic regression as the final estimator. Then we define two cutoffs: high probability of Certification and high probability of Denied. We can then assign statuses to the most obvious cases, leaving the more complicated cases on the edge for the reviewers at OFLC to consider. In effect, we will cut down on the number of applications that need to be reviewed by a human, while still allowing growth to the US job market.