In this guided project, we worked with exit surveys from employees of the Department of Education, Training and Employment) (DETE) and the Technical and Further Education (TAFE) institute in Queensland, Australia. You can find the TAFE exit survey here and the survey for the DETE here.
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
dete_survey = pd.read_csv('dete_survey.csv')
tafe_survey = pd.read_csv('tafe_survey.csv')
dete_survey.head()
dete_survey.info()
dete_survey.isnull().sum()
tafe_survey.head()
tafe_survey.info()
tafe_survey.isnull().sum()
dete_survey
which contain 'Not Stated' rather than Nan. While cleaning data we have to check uniformity in the columns, that is the values in the columns should have similar datatype and should basically belong to the same group as the other data. For example, if a column contains the marks of a student, we should make sure that there are no values which represent grades or percentage. # solving point 2 as missing values are marked by 'Not Stated'
dete_survey = pd.read_csv('dete_survey.csv',na_values = 'Not Stated')
# column names are more ordered in dete_survey so we will rename columns in tafe_survey according to dete_survey
# first, we will uniform column names in dete_survey
dete_survey.columns = dete_survey.columns.str.strip().str.lower().str.replace(' ','_')
dete_survey.columns
dete_survey.rename({'separationtype': 'separation_type'}, axis=1, inplace=True)
dete_survey.head()
# renaming in tafe_survey
columns = {'Record ID': 'id',
'CESSATION YEAR':'cease_date',
'Reason for ceasing employment':'separation_type',
'Gender. What is your Gender?':'gender',
'CurrentAge. Current Age':'age',
'Employment Type. Employment Type':'employment_status',
'Classification. Classification':'position',
'LengthofServiceOverall. Overall Length of Service at Institute (in years)':'institute_service',
'LengthofServiceCurrent. Length of Service at current workplace (in years)':'role_service',
'Contributing Factors. Dissatisfaction':'factors_diss',
'Contributing Factors. Job Dissatisfaction':'factors_job_diss'
}
tafe_survey = tafe_survey.rename(columns = columns)
tafe_survey.columns
# now we will try to delete some columns we dont need
# columns 28-48 have denoted values with no particular meaning or there is no guide to weigh them
# hence it is of no use to us
# the last 5 columns in dete_survey is dominated by null values
dete_survey = dete_survey.drop(dete_survey.columns[-5:], axis = 1)
dete_survey = dete_survey.drop(dete_survey.columns[28:49], axis = 1)
# we choose columns to drop on the same basis as we did for dete_survey
tafe_survey = tafe_survey.drop(tafe_survey.columns[17:66], axis = 1)
dete_survey.head()
tafe_survey.head()
# checking for duplicates by id
dete_survey.duplicated(['id']).value_counts()
tafe_survey.duplicated(['id']).value_counts()
For the development of this project we only need to focus on those people who resigned, that is why we'll filter both datasets taking as into account those values in the column separation_type
that contain the word 'Resignation'.
dete_survey['separation_type'].value_counts(dropna = False)
Separation due to all resignation types is around 53% followed by Age Retirement at around 35%. Now we will see that resignation in tafe_survey is just less than 50 %. The high proportions of resignation types needs some answers which we will try to answer after preparing our data.
tafe_survey['separation_type'].value_counts(dropna = False)
In the DETE-Survey dataset we see that in the column separation_type there are 3 values containing the word 'Resignation'. Whereas, in the TAFE-Survey dataset, there is only one.
# filtering out separation due to resignation
dete_survey['separationtype'] = dete_survey['separation_type'].str.split('-').str[0]
dete_resignations = dete_survey.copy()[dete_survey['separationtype'].str.contains(r'Resignation')]
tafe_resignations = tafe_survey.copy()[tafe_survey['separation_type'].str.contains(r'Resignation',na=False)]
tafe_resignations.head()
Now we will check the validity of dates and years in our dataset. Starting with `cease_date`:, the last year of a person's employment should not be more than the current year(the year the data was created). Similarly,for `date_start_date` , we can say that people working here wont be over 60, and reasoning that they started working their in their 20's, we can say that the least value for this column can be around 1935.
dete_resignations['cease_date'].value_counts()
This is what we discussed about here.
# cleaning the column by extracting the years
dete_resignations['cease_date'] = dete_resignations['cease_date'].str.extract(r'(?P<Years>[1-2][0-9]{3})')
dete_resignations['cease_date'] = dete_resignations['cease_date'].astype(float)
dete_resignations['cease_date'].value_counts()
dete_resignations['dete_start_date'] = dete_resignations['dete_start_date'].astype(float)
dete_resignations['dete_start_date'].value_counts()
# no need to change or extract anyhtin here
tafe_resignations['cease_date'].value_counts()
We have verified the date columns and none of the dates has been violating our assumptions. We needed to verify this as we are to going to sort or filter the employees based on their experience.
It is noiticable that the tafe_resignations
dataframe already contains a "service" column, which we renamed to institute_service
. In order to analyze both surveys together, we'll have to create a corresponding institute_service column in dete_resignations.
dete_resignations['institute_service'] = (dete_resignations['dete_start_date'].astype(float)-dete_resignations['cease_date'].astype(float))*(-1)
dete_resignations['institute_service'].value_counts()
institute_service
field from the DETE dataset, we observe that 42% of the employees worked at most 5 years.dete_resignations['institute_service'].isnull().sum()
tafe_resignations['institute_service'].value_counts(dropna = False)
As we can see from the values above, employee with relatively less experience have resigned in higher proportion.
TAFE
had worked for less than 2 years.Also, when we combine the datasets, institute_service
in TAFE
should have the same format as in DETE
. We will deal with this later in tis notebook.
After a bit of cleaning and preparing our data for analysis, now we will have to classify employees as dissatsfied and then filter them according to their experience to answer our question.
Also, we will work with dete_resignations
and tafe_resignations
as the dataset has data corresponding to resigned employees only.
In dete_resignations
, I think the following columns contribute to the employee's decision:
dete_resignations.loc[:,'job_dissatisfaction':'workload']
So since these columns have boolean input values, even if one of the column to a corresponding row states false, then that employee will be classified as 'dissatisfied'.
Before proceeding, I think we should not include columns such from maternity/family
to traumatic_incident
in columns such as dissatisfied
as they are not related to the institute.
# we will create a column 'dissatisfied' which will be of Boolean type
# the role of 'any' is to return whether any element is True
def update_vals(x):
if x == '-':
return False
elif pd.isnull(x):
return np.nan
else:
return True
dete_resignations['dissatisfied'] = dete_resignations[['job_dissatisfaction',
'dissatisfaction_with_the_department', 'physical_work_environment',
'lack_of_recognition', 'lack_of_job_security', 'work_location',
'employment_conditions', 'work_life_balance',
'workload']].any(1,skipna=False)
dete_resignations_up = dete_resignations.copy()
dete_resignations_up['dissatisfied'].value_counts(dropna = False)
dete_resignations_up['dissatisfied'].head()
dete_resignations_up['dissatisfied'].value_counts()
Similarly for tefe_resignations
we will include two columns:
Contributing Factors. Dissatisfaction
Contributing Factors. Job Dissatisfaction
def update_vals(x):
if x == '-':
return False
elif pd.isnull(x):
return np.nan
else:
return True
tafe_resignations['dissatisfied'] = tafe_resignations[['factors_diss','factors_job_diss']].applymap(update_vals).any(1,skipna=False)
tafe_resignations['dissatisfied'].head()
tafe_resignations_up = tafe_resignations.copy()
tafe_resignations_up['dissatisfied'].value_counts(dropna = False)
We have performed various actions to clean and filter our data. Now we will be reeady to merge iot. Also, as we practiced in some lessons before, while merging it is better that each dataset has it as own identity. We have given an identity to each dataste by dedicating a column with their title.
dete_resignations_up['institute'] = 'DETE'
tafe_resignations_up['institute'] = 'TAFE'
# combining
combined = pd.concat([dete_resignations_up, tafe_resignations_up], ignore_index=True)
combined.shape
combined_null = combined.isnull().sum()
combined_null
# we will drop the following columns
columns_drop = combined_null[combined_null >= 400]
combined = combined.dropna(thresh = 200, axis = 1)
combined
combined['institute_service']
# the column is pretty inconsistent with its value(due to tafe_survey)
type(combined['institute_service'][2])
# column consists of diffeent dtypes, so direct string operation is impossible
type(combined['institute_service'][640])
combined.head()
We would rather classify the employees according to a range of experience as years as it is effective to from classes. Our classification is as follows:
def service_category(val):
if pd.isna(val):
return np.nan
elif val < 3:
return 'New'
elif val < 7:
return 'Experienced'
elif val < 11:
return 'Established'
else:
return 'Veteran'
# combined['institute_service'].astype('str')
combined['institute_service'].value_counts(dropna = False)
combined['institute_service'] = combined['institute_service'].astype('str')
type(combined['institute_service'][0])
# cleaning columns entries/ transforming them into the same format
def clean(val):
if pd.isna(val):
return np.nan
if '-' in val:
return float(val[-1])
else:
res = [i for i in val.split()]
return res[0]
combined['institute_service'] = combined['institute_service'].apply(clean)
combined['institute_service']
combined['institute_service'][combined['institute_service'] == 'Less'] = 1.0
combined['institute_service'][combined['institute_service'] == 'More'] = 20.0
combined['institute_service'] = combined['institute_service'].astype('float')
combined['institute_service'].value_counts(dropna = False)
combined['institute_service'] = combined['institute_service'].astype('float')
combined['institute_service'].value_counts(dropna = False)
combined['service_cat'] = combined['institute_service'].apply(service_category)
combined['service_cat'].value_counts(dropna=False)
combined['age']
def convert_range(val):
if pd.notna(val):
if ' ' in val:
numbers = [int(i) for i in val.split() if i.isdigit()]
else:
numbers = [int(i) for i in val.split('-') if i.isdigit()]
return int(sum(numbers) / len(numbers))
return np.nan
combined['age'] = combined['age'].apply(convert_range)
combined['age'].value_counts(dropna = False)
def age_category(val):
if pd.isna(val):
return np.nan
elif val < 31:
return 'Junior'
elif val < 46:
return 'Middle'
else:
return 'Senior'
combined['age'] = combined['age'].astype('float')
combined['age_cat'] = combined['age'].apply(age_category)
combined['age_cat'].value_counts(dropna=False)
Since we have already used some columns to decide dissatisfaction, we no longer need those columns anymore. Hence, we will identify and keep only the columns we need for our analysis.
combined.info()
# other columns are needed for our analysis. We our choosing the columns according to the question we are trying to
# answer. We even dropped the bolean columns as we had used their information to create a new column which turns
# out to be more precise
combined = combined[['id', 'cease_date', 'position', 'employment_status', 'gender', 'age',
'institute_service', 'dissatisfied','institute','service_cat','age_cat']]
combined.sample(5)
combined
¶combined.isnull().sum()
index = combined['dissatisfied'].isna()
index
combined.loc[index,'dissatisfied']
About 1% values in dissatisfied
is null so we can fill it with the majority value that is 'False'.
combined.loc[index,'dissatisfied'] = False
#combined['dissatisfied'] = combined['dissatisfied'].fillna(False)
combined['dissatisfied'].value_counts(dropna = False)
combined.isnull().sum()
In the columns age
with its corresponding age_cat
and institute_service
with its corresponding service_cat, we observe a significant amount of missing values.
# checking where the columns intersect
age_service_null_bool = (combined['service_cat'].isnull()) & (combined['age_cat'].isnull())
age_service_null = combined[age_service_null_bool]
age_service_null
These are all null correspondingly so even if we want to fill in the missing values, we cant as the emloyee's data is missing in all columns.
combined.drop(combined[age_service_null_bool].index, inplace=True)
combined.reset_index(drop=True, inplace=True)
combined
We have now eliminated correspponding null values of age_cat
and service_cat
. Since our question is dependent on these columns, we will try to fill the missing values. Our next step would be to fill both those columns using information from each other.
service_null = combined[combined['service_cat'].isnull()]
service_null
To fill in values in service_cat
, we will fill in the most frequent groups that the corresponding age group represents.
Our boolean indexing will also state as institute to be 'DETE' because even thought the data is combined, corresponding values for age and service category depend on the separate institution.
age_cat_list = ['Junior','Middle','Senior']
for i in age_cat_list:
print(i,":")
print(combined[(combined['institute'] == 'DETE')&(combined['age_cat'] == i)]['service_cat'].value_counts())
service_null['institute'].unique()
Since only DETE is the institute for which service_cat
, we dont have to search for corresponding age groups for TAFE.
From above, we can say that:
max_values = ['Experienced', 'Established', 'Veteran']
for i,j in zip(age_cat_list,max_values):
sort_bool = (combined['institute'] == 'DETE') & (combined['age_cat'] == i) & (combined['institute_service'].isnull())
combined.loc[sort_bool,'service_cat'] = j
combined['service_cat'].isnull().sum()
combined[combined['age_cat'].isnull()]
We can continue similarly by finding maximum frequency for the corresponding service category. Again we will only check in DETE.
service_cat_list = ['Veteran', 'New']
for i in service_cat_list:
print(i,":")
print(combined[(combined['institute'] == 'DETE')&(combined['service_cat'] == i)]['age_cat'].value_counts(dropna = False))
max_values = ['Senior','Middle']
for i,j in zip(service_cat_list,max_values):
sort_bool = (combined['institute'] == 'DETE') & (combined['service_cat'] == i) & (combined['age'].isnull())
combined.loc[sort_bool,'age_cat'] = j
combined[combined['age_cat'].isnull()]
combined.info()
combined.isnull().sum()
There are some missing values in columns such as cease_date
,position
,employment_status
,gender
,age
and institute_service
. But these values are not as important as dissatisfied
,service_cat
and age_cat
, and since they have 0 null values, we can move ahead with our analysis.
We may debate that we may need to handle the unll values of institute_service
. we really do not need to as we have used it to classify other categories.
Now we'll look at the number of dissatisfied employees who left the institute at different stages of their career. We'll sort the values in descending order and then visualize the results.
combined['dissatisfied'].value_counts()
diss_sc = combined.pivot_table(index='service_cat', values='dissatisfied', aggfunc=np.sum)['dissatisfied'].sort_values(ascending=False).to_frame()
DETE_sc = combined[combined['institute'] == 'DETE'].reset_index(drop=True)
diss_sc['DETE'] = DETE_sc.pivot_table(index='service_cat', values='dissatisfied', aggfunc=np.sum)['dissatisfied'].sort_values(ascending=False)
TAFE_sc = combined[combined['institute'] == 'TAFE'].reset_index(drop=True)
diss_sc['TAFE'] = TAFE_sc.pivot_table(index='service_cat', values='dissatisfied', aggfunc=np.sum)['dissatisfied'].sort_values(ascending=False)
diss_sc
diss_sc = diss_sc.rename(columns = {'dissatisfied':'TOTAL'})
diss_sc.index
# can change the name attribute
diss_sc.index.name = 'Service Category'
diss_sc
diss_sc = diss_sc.fillna(0)
diss_sc['TAFE'] = diss_sc['TAFE'].astype('int')
diss_sc
diss_sc.plot(kind='bar', rot = 0)
plt.title('Dissatisfied by Service Category', fontsize=20)
plt.xlabel('Service category', fontsize=12)
plt.ylabel('Dissatisfied Employees Number', fontsize=12)
plt.legend()
# use sns.despine() and frameon = False to remove any borders
Analysing the total dissatisfied columns, we can see that the Veteran
,New
and Experienced
contribute in almost same proportions, at about 30%.
In DETE Institute, the maximum percentage of dissatisfied employees are Veteran
at 42.28 % while in TAFE Institute, the maximum percentage of dissatisfied employees are New
at 55%.
In every service category, DETE dominates as the institute with the higher number of employees leaving due to dissatisfaction, except for the New
category.
Overall, 73% of the employees left the DETE institute because of dissatisfaction issues, while only 27% of the employees from the TAFE institute left because of dissatsifaction.
combined[combined['institute']== 'TAFE']
fig = plt.figure(figsize=(18,10), dpi=1600)
ax1 = plt.subplot2grid((2,4),(0,0))
plt.pie(diss_sc.iloc[:,1], labels = diss_sc.index, colors = ('b','g','y','r'))
plt.title('DETE')
ax1 = plt.subplot2grid((2, 4), (0, 1))
plt.pie(diss_sc.iloc[:,2], labels = diss_sc.index, colors = ('b','g','y','r'))
plt.title('TAFE')