import pandas as pd
pd.options.display.max_columns=None
datasets = pd.read_csv('./inputs/HR-Employee-Attrition.csv')
datasets.head()
Age | Attrition | BusinessTravel | DailyRate | Department | DistanceFromHome | Education | EducationField | EmployeeCount | EmployeeNumber | EnvironmentSatisfaction | Gender | HourlyRate | JobInvolvement | JobLevel | JobRole | JobSatisfaction | MaritalStatus | MonthlyIncome | MonthlyRate | NumCompaniesWorked | Over18 | OverTime | PercentSalaryHike | PerformanceRating | RelationshipSatisfaction | StandardHours | StockOptionLevel | TotalWorkingYears | TrainingTimesLastYear | WorkLifeBalance | YearsAtCompany | YearsInCurrentRole | YearsSinceLastPromotion | YearsWithCurrManager | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 41 | Yes | Travel_Rarely | 1102 | Sales | 1 | 2 | Life Sciences | 1 | 1 | 2 | Female | 94 | 3 | 2 | Sales Executive | 4 | Single | 5993 | 19479 | 8 | Y | Yes | 11 | 3 | 1 | 80 | 0 | 8 | 0 | 1 | 6 | 4 | 0 | 5 |
1 | 49 | No | Travel_Frequently | 279 | Research & Development | 8 | 1 | Life Sciences | 1 | 2 | 3 | Male | 61 | 2 | 2 | Research Scientist | 2 | Married | 5130 | 24907 | 1 | Y | No | 23 | 4 | 4 | 80 | 1 | 10 | 3 | 3 | 10 | 7 | 1 | 7 |
2 | 37 | Yes | Travel_Rarely | 1373 | Research & Development | 2 | 2 | Other | 1 | 4 | 4 | Male | 92 | 2 | 1 | Laboratory Technician | 3 | Single | 2090 | 2396 | 6 | Y | Yes | 15 | 3 | 2 | 80 | 0 | 7 | 3 | 3 | 0 | 0 | 0 | 0 |
3 | 33 | No | Travel_Frequently | 1392 | Research & Development | 3 | 4 | Life Sciences | 1 | 5 | 4 | Female | 56 | 3 | 1 | Research Scientist | 3 | Married | 2909 | 23159 | 1 | Y | Yes | 11 | 3 | 3 | 80 | 0 | 8 | 3 | 3 | 8 | 7 | 3 | 0 |
4 | 27 | No | Travel_Rarely | 591 | Research & Development | 2 | 1 | Medical | 1 | 7 | 1 | Male | 40 | 3 | 1 | Laboratory Technician | 2 | Married | 3468 | 16632 | 9 | Y | No | 12 | 3 | 4 | 80 | 1 | 6 | 3 | 3 | 2 | 2 | 2 | 2 |
datasets.shape
(1470, 35)
데이터를 살펴보면 categorycal features, numerical features가 함께 있습니다.
datasets.dtypes
Age int64 Attrition object BusinessTravel object DailyRate int64 Department object DistanceFromHome int64 Education int64 EducationField object EmployeeCount int64 EmployeeNumber int64 EnvironmentSatisfaction int64 Gender object HourlyRate int64 JobInvolvement int64 JobLevel int64 JobRole object JobSatisfaction int64 MaritalStatus object MonthlyIncome int64 MonthlyRate int64 NumCompaniesWorked int64 Over18 object OverTime object PercentSalaryHike int64 PerformanceRating int64 RelationshipSatisfaction int64 StandardHours int64 StockOptionLevel int64 TotalWorkingYears int64 TrainingTimesLastYear int64 WorkLifeBalance int64 YearsAtCompany int64 YearsInCurrentRole int64 YearsSinceLastPromotion int64 YearsWithCurrManager int64 dtype: object
Taget variable : Attrition
Yes / No -> 1 / 0으로 변경합니다
datasets['Attrition_idx'] = datasets['Attrition']\
.apply(lambda x: 1 if x == 'Yes' else 0)
datasets.head()
Age | Attrition | BusinessTravel | DailyRate | Department | DistanceFromHome | Education | EducationField | EmployeeCount | EmployeeNumber | EnvironmentSatisfaction | Gender | HourlyRate | JobInvolvement | JobLevel | JobRole | JobSatisfaction | MaritalStatus | MonthlyIncome | MonthlyRate | NumCompaniesWorked | Over18 | OverTime | PercentSalaryHike | PerformanceRating | RelationshipSatisfaction | StandardHours | StockOptionLevel | TotalWorkingYears | TrainingTimesLastYear | WorkLifeBalance | YearsAtCompany | YearsInCurrentRole | YearsSinceLastPromotion | YearsWithCurrManager | Attrition_idx | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 41 | Yes | Travel_Rarely | 1102 | Sales | 1 | 2 | Life Sciences | 1 | 1 | 2 | Female | 94 | 3 | 2 | Sales Executive | 4 | Single | 5993 | 19479 | 8 | Y | Yes | 11 | 3 | 1 | 80 | 0 | 8 | 0 | 1 | 6 | 4 | 0 | 5 | 1 |
1 | 49 | No | Travel_Frequently | 279 | Research & Development | 8 | 1 | Life Sciences | 1 | 2 | 3 | Male | 61 | 2 | 2 | Research Scientist | 2 | Married | 5130 | 24907 | 1 | Y | No | 23 | 4 | 4 | 80 | 1 | 10 | 3 | 3 | 10 | 7 | 1 | 7 | 0 |
2 | 37 | Yes | Travel_Rarely | 1373 | Research & Development | 2 | 2 | Other | 1 | 4 | 4 | Male | 92 | 2 | 1 | Laboratory Technician | 3 | Single | 2090 | 2396 | 6 | Y | Yes | 15 | 3 | 2 | 80 | 0 | 7 | 3 | 3 | 0 | 0 | 0 | 0 | 1 |
3 | 33 | No | Travel_Frequently | 1392 | Research & Development | 3 | 4 | Life Sciences | 1 | 5 | 4 | Female | 56 | 3 | 1 | Research Scientist | 3 | Married | 2909 | 23159 | 1 | Y | Yes | 11 | 3 | 3 | 80 | 0 | 8 | 3 | 3 | 8 | 7 | 3 | 0 | 0 |
4 | 27 | No | Travel_Rarely | 591 | Research & Development | 2 | 1 | Medical | 1 | 7 | 1 | Male | 40 | 3 | 1 | Laboratory Technician | 2 | Married | 3468 | 16632 | 9 | Y | No | 12 | 3 | 4 | 80 | 1 | 6 | 3 | 3 | 2 | 2 | 2 | 2 | 0 |
col_names = datasets.columns
col_names
Index(['Age', 'Attrition', 'BusinessTravel', 'DailyRate', 'Department', 'DistanceFromHome', 'Education', 'EducationField', 'EmployeeCount', 'EmployeeNumber', 'EnvironmentSatisfaction', 'Gender', 'HourlyRate', 'JobInvolvement', 'JobLevel', 'JobRole', 'JobSatisfaction', 'MaritalStatus', 'MonthlyIncome', 'MonthlyRate', 'NumCompaniesWorked', 'Over18', 'OverTime', 'PercentSalaryHike', 'PerformanceRating', 'RelationshipSatisfaction', 'StandardHours', 'StockOptionLevel', 'TotalWorkingYears', 'TrainingTimesLastYear', 'WorkLifeBalance', 'YearsAtCompany', 'YearsInCurrentRole', 'YearsSinceLastPromotion', 'YearsWithCurrManager', 'Attrition_idx'], dtype='object')
필요없는 변수들이 있다 : EmployeeCount, EmployeeNumber, Over18, StandardHours
print(datasets.Over18.value_counts())
print(datasets.EmployeeCount.value_counts())
print(datasets.StandardHours.value_counts())
Y 1470 Name: Over18, dtype: int64 1 1470 Name: EmployeeCount, dtype: int64 80 1470 Name: StandardHours, dtype: int64
# Target은 feature에서 제외한다.
col_names = col_names\
.drop(['Attrition_idx', 'Attrition', 'Over18',
'EmployeeCount', 'EmployeeNumber', 'StandardHours'])
Categorical column을 다루어보자.
Catagorical column을 numerical column을 나누어보자.
categorical_features = []
numerical_features = []
target = 'Attrition_idx'
# feature를 2가지 형태로 구분한다.
for col in col_names:
if datasets[col].dtype == 'O':
categorical_features.append(col)
else:
numerical_features.append(col)
print('Categorical feature의 수 :', len(categorical_features))
print('Numerical feature의 수 :', len(numerical_features))
Categorical feature의 수 : 7 Numerical feature의 수 : 23
categorical_features
['BusinessTravel', 'Department', 'EducationField', 'Gender', 'JobRole', 'MaritalStatus', 'OverTime']
numerical_features
['Age', 'DailyRate', 'DistanceFromHome', 'Education', 'EnvironmentSatisfaction', 'HourlyRate', 'JobInvolvement', 'JobLevel', 'JobSatisfaction', 'MonthlyIncome', 'MonthlyRate', 'NumCompaniesWorked', 'PercentSalaryHike', 'PerformanceRating', 'RelationshipSatisfaction', 'StockOptionLevel', 'TotalWorkingYears', 'TrainingTimesLastYear', 'WorkLifeBalance', 'YearsAtCompany', 'YearsInCurrentRole', 'YearsSinceLastPromotion', 'YearsWithCurrManager']
Categorical 데이터를 one-hot vector로 변경하자. Pandas에서 get_dummies
를 이용하자.
categorical_datasets = pd.get_dummies(datasets[categorical_features])
categorical_datasets.head()
BusinessTravel_Non-Travel | BusinessTravel_Travel_Frequently | BusinessTravel_Travel_Rarely | Department_Human Resources | Department_Research & Development | Department_Sales | EducationField_Human Resources | EducationField_Life Sciences | EducationField_Marketing | EducationField_Medical | EducationField_Other | EducationField_Technical Degree | Gender_Female | Gender_Male | JobRole_Healthcare Representative | JobRole_Human Resources | JobRole_Laboratory Technician | JobRole_Manager | JobRole_Manufacturing Director | JobRole_Research Director | JobRole_Research Scientist | JobRole_Sales Executive | JobRole_Sales Representative | MaritalStatus_Divorced | MaritalStatus_Married | MaritalStatus_Single | OverTime_No | OverTime_Yes | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 1 |
1 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 1 | 0 |
2 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 |
3 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 1 |
4 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 |
numerical_datasets = datasets[numerical_features]
numerical_datasets.head()
Age | DailyRate | DistanceFromHome | Education | EnvironmentSatisfaction | HourlyRate | JobInvolvement | JobLevel | JobSatisfaction | MonthlyIncome | MonthlyRate | NumCompaniesWorked | PercentSalaryHike | PerformanceRating | RelationshipSatisfaction | StockOptionLevel | TotalWorkingYears | TrainingTimesLastYear | WorkLifeBalance | YearsAtCompany | YearsInCurrentRole | YearsSinceLastPromotion | YearsWithCurrManager | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 41 | 1102 | 1 | 2 | 2 | 94 | 3 | 2 | 4 | 5993 | 19479 | 8 | 11 | 3 | 1 | 0 | 8 | 0 | 1 | 6 | 4 | 0 | 5 |
1 | 49 | 279 | 8 | 1 | 3 | 61 | 2 | 2 | 2 | 5130 | 24907 | 1 | 23 | 4 | 4 | 1 | 10 | 3 | 3 | 10 | 7 | 1 | 7 |
2 | 37 | 1373 | 2 | 2 | 4 | 92 | 2 | 1 | 3 | 2090 | 2396 | 6 | 15 | 3 | 2 | 0 | 7 | 3 | 3 | 0 | 0 | 0 | 0 |
3 | 33 | 1392 | 3 | 4 | 4 | 56 | 3 | 1 | 3 | 2909 | 23159 | 1 | 11 | 3 | 3 | 0 | 8 | 3 | 3 | 8 | 7 | 3 | 0 |
4 | 27 | 591 | 2 | 1 | 1 | 40 | 3 | 1 | 2 | 3468 | 16632 | 9 | 12 | 3 | 4 | 1 | 6 | 3 | 3 | 2 | 2 | 2 | 2 |
Categorical dataset과 numerical dataset을 합친다. 모델의 input으로 사용할 feature이다.
X = pd.concat([categorical_datasets, numerical_datasets], axis=1)
X.head()
BusinessTravel_Non-Travel | BusinessTravel_Travel_Frequently | BusinessTravel_Travel_Rarely | Department_Human Resources | Department_Research & Development | Department_Sales | EducationField_Human Resources | EducationField_Life Sciences | EducationField_Marketing | EducationField_Medical | EducationField_Other | EducationField_Technical Degree | Gender_Female | Gender_Male | JobRole_Healthcare Representative | JobRole_Human Resources | JobRole_Laboratory Technician | JobRole_Manager | JobRole_Manufacturing Director | JobRole_Research Director | JobRole_Research Scientist | JobRole_Sales Executive | JobRole_Sales Representative | MaritalStatus_Divorced | MaritalStatus_Married | MaritalStatus_Single | OverTime_No | OverTime_Yes | Age | DailyRate | DistanceFromHome | Education | EnvironmentSatisfaction | HourlyRate | JobInvolvement | JobLevel | JobSatisfaction | MonthlyIncome | MonthlyRate | NumCompaniesWorked | PercentSalaryHike | PerformanceRating | RelationshipSatisfaction | StockOptionLevel | TotalWorkingYears | TrainingTimesLastYear | WorkLifeBalance | YearsAtCompany | YearsInCurrentRole | YearsSinceLastPromotion | YearsWithCurrManager | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 1 | 41 | 1102 | 1 | 2 | 2 | 94 | 3 | 2 | 4 | 5993 | 19479 | 8 | 11 | 3 | 1 | 0 | 8 | 0 | 1 | 6 | 4 | 0 | 5 |
1 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 49 | 279 | 8 | 1 | 3 | 61 | 2 | 2 | 2 | 5130 | 24907 | 1 | 23 | 4 | 4 | 1 | 10 | 3 | 3 | 10 | 7 | 1 | 7 |
2 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 37 | 1373 | 2 | 2 | 4 | 92 | 2 | 1 | 3 | 2090 | 2396 | 6 | 15 | 3 | 2 | 0 | 7 | 3 | 3 | 0 | 0 | 0 | 0 |
3 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 33 | 1392 | 3 | 4 | 4 | 56 | 3 | 1 | 3 | 2909 | 23159 | 1 | 11 | 3 | 3 | 0 | 8 | 3 | 3 | 8 | 7 | 3 | 0 |
4 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 27 | 591 | 2 | 1 | 1 | 40 | 3 | 1 | 2 | 3468 | 16632 | 9 | 12 | 3 | 4 | 1 | 6 | 3 | 3 | 2 | 2 | 2 | 2 |
y = datasets[target]
y.head()
0 1 1 0 2 1 3 0 4 0 Name: Attrition_idx, dtype: int64
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = \
train_test_split(X, y, test_size=0.2, random_state=42)
Decision tree를 이용하여 분류기를 생성합니다. 아래의 파라미터를 grid search로 찾습니다.
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV
params = {
'max_depth': [5,7,9],
'min_samples_split': [2],
'min_samples_leaf': [1, 2, 3, 4]
}
grid_search_cv = \
GridSearchCV(
DecisionTreeClassifier(random_state=42),
params,
n_jobs=-1,
verbose=1,
cv=3)
grid_search_cv.fit(x_train, y_train)
Fitting 3 folds for each of 12 candidates, totalling 36 fits
[Parallel(n_jobs=-1)]: Done 36 out of 36 | elapsed: 0.2s finished
GridSearchCV(cv=3, error_score='raise', estimator=DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None, max_features=None, max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, presort=False, random_state=42, splitter='best'), fit_params=None, iid=True, n_jobs=-1, param_grid={'max_depth': [5, 7, 9], 'min_samples_split': [2], 'min_samples_leaf': [1, 2, 3, 4]}, pre_dispatch='2*n_jobs', refit=True, return_train_score='warn', scoring=None, verbose=1)
# 검색 결과, 좋은 결과값을 주는 파라미터를 가진 tree 모델을 찾습니다.
tree_classifier = grid_search_cv.best_estimator_
tree_classifier
DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=5, max_features=None, max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, presort=False, random_state=42, splitter='best')
# Score를 살펴봅니다.
grid_search_cv.best_score_
0.8273809523809523
pred_train = tree_classifier.predict(x_train)
pred_test = tree_classifier.predict(x_test)
from sklearn.metrics import accuracy_score, classification_report
학습셋에 대한 결과값을 살펴보자.
# 1. Confusion Matrix
print('\n Train Confusion Matrix :')
display(pd.crosstab(y_train, pred_train, rownames=['Actual'], colnames=['Predict']))
# 2. Accuracy
print('\n Train accuracy :', accuracy_score(y_train, pred_train))
# 3. Classification Report
print('\n Classification Report : \n', classification_report(y_train, pred_train))
Train Confusion Matrix :
Predict | 0 | 1 |
---|---|---|
Actual | ||
0 | 951 | 27 |
1 | 98 | 100 |
Train accuracy : 0.8937074829931972 Classification Report : precision recall f1-score support 0 0.91 0.97 0.94 978 1 0.79 0.51 0.62 198 avg / total 0.89 0.89 0.88 1176
테스트셋에 대한 결과값을 살펴보자.
# 1. Confusion Matrix
print('\n Test Confusion Matrix :')
display(pd.crosstab(y_test, pred_test, rownames=['Actual'], colnames=['Predict']))
# 2. Accuracy
print('\n Test accuracy :', accuracy_score(y_test, pred_test))
# 3. Classification Report
print('\n Classification Report : \n', classification_report(y_test, pred_test))
Test Confusion Matrix :
Predict | 0 | 1 |
---|---|---|
Actual | ||
0 | 235 | 20 |
1 | 33 | 6 |
Test accuracy : 0.8197278911564626 Classification Report : precision recall f1-score support 0 0.88 0.92 0.90 255 1 0.23 0.15 0.18 39 avg / total 0.79 0.82 0.80 294
정확도가 86%로 높은 것은 그렇게 의미가 없다.
datasets.Attrition_idx.value_counts()
0 1233 1 237 Name: Attrition_idx, dtype: int64
(1233-237)/1233
0.8077858880778589
값을 살펴보면 1의 비율이 6:1이다. 따라서 분류기가 모든 샘플에 대하여 0이라고만 분류해도, 80.77%의 정확도를 얻을 수 있다. 1(퇴직자)에 대한 분류를 제대로 못하고 있다.
퇴사할 가능성이 많은 직원에게 보너스를 많이 주어, 퇴사를 방지하는 것에 이 모델을 사용한다면, 심각한 문제가 될 수 있다. 모델은 퇴직하지 않는다고 예측했는데, 예측과 달리 퇴직한 직원의 비율이 상당히 높다.
모델을 살짝 튜닝해보자. 클래스의 가중값을 조절해보자. 예를 들어, 부류 1(퇴직자)의 가중값을 올리면 실제로 퇴사할 특성이 있는 직원들을 더 잘 파악하게 되지만, 퇴사할 가능성이 없는 일부 직원들을 잠재적 퇴사자로 분류하게 된다. 즉, 퇴사를 더 잘 막을 수 있게 될 것이다. (대출을 실행할 때, 신용도 낮은 사람을 승인하는 것보다, 신용도가 조금 만족되는 사람이라도 거절하는 편이 더 낫다. 즉 감당할 수 있는 오류이다.)
클래스의 가중값을 바꾸어보며 테스트 한다.
import numpy as np
tuning_results = pd.DataFrame(np.empty((6, 10)))
tuning_results.columns = ['class_0_weight', 'class_1_weight',
'train_accuracy', 'test_accuracy',
'precision_class_0', 'precision_class_1', 'precision_overall',
'recall_calss_0', 'recall_class_1', 'recall_overall']
# 나중에 결과 테이블을 만들 떄 사용
print(classification_report(y_test, pred_test).split())
['precision', 'recall', 'f1-score', 'support', '0', '0.88', '0.92', '0.90', '255', '1', '0.23', '0.15', '0.18', '39', 'avg', '/', 'total', '0.79', '0.82', '0.80', '294']
class_0_weight = [0.01, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 0.99]
for i in range(len(class_0_weight)):
class_weights = {0: class_0_weight[i], 1: 1 - class_0_weight[i]}
tree_classifier = DecisionTreeClassifier(criterion='gini',
max_depth=5,
min_samples_split=2,
min_samples_leaf=1,
random_state=42,
class_weight=class_weights)
tree_classifier.fit(x_train, y_train)
pred_train = tree_classifier.predict(x_train)
pred_test = tree_classifier.predict(x_test)
tuning_results.loc[i, 'class_0_weight'] = class_weights[0]
tuning_results.loc[i, 'class_1_weight'] = class_weights[1]
tuning_results.loc[i, 'train_accuracy'] = round(accuracy_score(y_train, pred_train), 4)
tuning_results.loc[i, 'test_accuracy'] = round(accuracy_score(y_test, pred_test), 4)
c_r = classification_report(y_test, pred_test).split()
tuning_results.loc[i, 'precision_class_0'] = float(c_r[5])
tuning_results.loc[i, 'precision_class_1'] = float(c_r[10])
tuning_results.loc[i, 'precision_overall'] = float(c_r[17])
tuning_results.loc[i, 'recall_calss_0'] = float(c_r[6])
tuning_results.loc[i, 'recall_class_1'] = float(c_r[11])
tuning_results.loc[i, 'recall_overall'] = float(c_r[18])
print(class_weights)
print('Test accuracy :', accuracy_score(y_test, pred_test))
display(pd.crosstab(y_test, pred_test, rownames=['Actual'], colnames=['Predict']))
{0: 0.01, 1: 0.99} Test accuracy : 0.2925170068027211
Predict | 0 | 1 |
---|---|---|
Actual | ||
0 | 50 | 205 |
1 | 3 | 36 |
{0: 0.1, 1: 0.9} Test accuracy : 0.6972789115646258
Predict | 0 | 1 |
---|---|---|
Actual | ||
0 | 183 | 72 |
1 | 17 | 22 |
{0: 0.2, 1: 0.8} Test accuracy : 0.7891156462585034
Predict | 0 | 1 |
---|---|---|
Actual | ||
0 | 216 | 39 |
1 | 23 | 16 |
{0: 0.3, 1: 0.7} Test accuracy : 0.8095238095238095
Predict | 0 | 1 |
---|---|---|
Actual | ||
0 | 226 | 29 |
1 | 27 | 12 |
{0: 0.4, 1: 0.6} Test accuracy : 0.7993197278911565
Predict | 0 | 1 |
---|---|---|
Actual | ||
0 | 225 | 30 |
1 | 29 | 10 |
{0: 0.5, 1: 0.5} Test accuracy : 0.8197278911564626
Predict | 0 | 1 |
---|---|---|
Actual | ||
0 | 235 | 20 |
1 | 33 | 6 |
{0: 0.6, 1: 0.4} Test accuracy : 0.8469387755102041
Predict | 0 | 1 |
---|---|---|
Actual | ||
0 | 247 | 8 |
1 | 37 | 2 |
{0: 0.7, 1: 0.30000000000000004} Test accuracy : 0.8537414965986394
Predict | 0 | 1 |
---|---|---|
Actual | ||
0 | 248 | 7 |
1 | 36 | 3 |
{0: 0.8, 1: 0.19999999999999996} Test accuracy : 0.8571428571428571
Predict | 0 | 1 |
---|---|---|
Actual | ||
0 | 250 | 5 |
1 | 37 | 2 |
{0: 0.9, 1: 0.09999999999999998} Test accuracy : 0.8673469387755102
Predict | 0 | 1 |
---|---|---|
Actual | ||
0 | 253 | 2 |
1 | 37 | 2 |
{0: 0.99, 1: 0.010000000000000009} Test accuracy : 0.8707482993197279
Predict | 0 | 1 |
---|---|---|
Actual | ||
0 | 255 | 0 |
1 | 38 | 1 |
tuning_results
class_0_weight | class_1_weight | train_accuracy | test_accuracy | precision_class_0 | precision_class_1 | precision_overall | recall_calss_0 | recall_class_1 | recall_overall | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 0.01 | 0.99 | 0.3580 | 0.2925 | 0.94 | 0.15 | 0.84 | 0.20 | 0.92 | 0.29 |
1 | 0.10 | 0.90 | 0.7976 | 0.6973 | 0.92 | 0.23 | 0.82 | 0.72 | 0.56 | 0.70 |
2 | 0.20 | 0.80 | 0.8759 | 0.7891 | 0.90 | 0.29 | 0.82 | 0.85 | 0.41 | 0.79 |
3 | 0.30 | 0.70 | 0.8912 | 0.8095 | 0.89 | 0.29 | 0.81 | 0.89 | 0.31 | 0.81 |
4 | 0.40 | 0.60 | 0.8903 | 0.7993 | 0.89 | 0.25 | 0.80 | 0.88 | 0.26 | 0.80 |
5 | 0.50 | 0.50 | 0.8937 | 0.8197 | 0.88 | 0.23 | 0.79 | 0.92 | 0.15 | 0.82 |
6 | 0.60 | 0.40 | 0.8954 | 0.8469 | 0.87 | 0.20 | 0.78 | 0.97 | 0.05 | 0.85 |
7 | 0.70 | 0.30 | 0.8963 | 0.8537 | 0.87 | 0.30 | 0.80 | 0.97 | 0.08 | 0.85 |
8 | 0.80 | 0.20 | 0.8869 | 0.8571 | 0.87 | 0.29 | 0.79 | 0.98 | 0.05 | 0.86 |
9 | 0.90 | 0.10 | 0.8622 | 0.8673 | 0.87 | 0.50 | 0.82 | 0.99 | 0.05 | 0.87 |
10 | 0.99 | 0.01 | 0.8435 | 0.8707 | 0.87 | 1.00 | 0.89 | 1.00 | 0.03 | 0.87 |
Class 0의 가중치가 커질수록, class 0으로 더 많이 예측한다. 우선 예측량이 많아지기 때문에 리콜은 상대적으로 높다. 정확도는 떨어질지라도 class 0으로 예측하는 양이 많아지기 때문에 recall이 상대적으로 높아지기 때문이다. 하지만 예측하는 양이 상대적으로 많이지기 때문에 precision은 떨어지게 된다.
결과를 살펴보면, class 0의 가중치가 0.3일 때, accuracy 및 precision, recall 이 괜찮은 결과를 보이고 있다.