In this project, we'll work with exit surveys from employees of two departments, the Department of Education, Training and Employment (DETE) and the Technical and Further Education (TAFE) institute in Queensland, Australia.
We are going to find answers to the following questions by cleaning and analysing the combined data from both surveys:
Below is a preview of a couple columns we'll work with from the dete_survey.csv:
Below is a preview of a couple columns we'll work with from the tafe_survey.csv:
We are going to use pandas and NumPy libraries and read the surveys files:
import pandas as pd
import numpy as np
dete_survey = pd.read_csv("dete_survey.csv")
tafe_survey = pd.read_csv("tafe_survey.csv")
Let's look at the data:
dete_survey.info()
tafe_survey.info()
dete_survey.head(5)
tafe_survey.head(4)
dete_survey["Employment Status"].value_counts(dropna = False)
dete_survey["Age"].value_counts(dropna = False)
tafe_survey['Employment Type. Employment Type'].value_counts(dropna = False)
tafe_survey['CurrentAge. Current Age'].value_counts(dropna = False)
dete_survey.isnull().sum()
tafe_survey.isnull().sum()
By looking at the above information we have figured out the following issues:
To solve the above issues we are going to start data cleaning.
dete_survey = pd.read_csv("dete_survey.csv", na_values = "Not Stated")
We can delete the columns from both dataframes that won't be used in our analysis.
dete_survey_updated = dete_survey.drop(dete_survey.columns[28:49], axis = 1)
tafe_survey_updated = tafe_survey.drop(tafe_survey.columns[17:66], axis = 1)
Looking at the column names we are going to use the following criteria to update the column names:
These criteria are applied to dete.
dete_survey_updated.columns = dete_survey_updated.columns.str.lower().str.strip().str.replace(' ', '_')
print(dete_survey_updated.columns)
Some columns' names are so long and we need to rename them in tafe:
tafe_survey_updated.rename(columns = {'Record ID': 'id',
'CESSATION YEAR': 'cease_date',
'Reason for ceasing employment': 'separationtype',
'Gender. What is your Gender?': 'gender',
'CurrentAge. Current Age': 'age',
'Employment Type. Employment Type': 'employment_status',
'Classification. Classification': 'position',
'LengthofServiceOverall. Overall Length of Service at Institute (in years)': 'institute_service',
'LengthofServiceCurrent. Length of Service at current workplace (in years)': 'role_service'},
inplace = True)
print(tafe_survey_updated.columns)
Extract Resignated employees
Our goal is to answer this question:
So we are going to look at separationtype column in both dataframes and look at the data that the separation type contains 'Resignation'
dete_survey_updated['separationtype'].value_counts(dropna = False)
tafe_survey_updated['separationtype'].value_counts(dropna = False)
dete_resignations = dete_survey_updated[dete_survey_updated['separationtype'].str.contains('Resignation')].copy()
tafe_resignations = tafe_survey_updated.loc[tafe_survey_updated['separationtype'].str.contains('Resignation', na = False)].copy()
Two new dataframes have been created that only contains data when separationtype is a kind of Resignation. Those dataframes are dete_resignations and tafe_resignations
We need also to check if data we want to use is corrupted. We start checking cease_date and dete_start_date in dete. cease date must be after start date and start date can not be before 1940.
looking at the cease_date column, we figure out that it needs to be cleaned. Some dates are only year while the others contain months.
dete_resignations['cease_date'].value_counts(dropna = False)
Year is sufficient for the cease_date so we extract Year from cease_date and save it cease_year column:
dete_resignations['cease_year'] = dete_resignations['cease_date'].str.extract(
r'(?P<Month>[0-1])?/?(?P<Year>[0-9]{4})', expand=True)['Year'].astype(float)
dete_resignations['cease_year'].value_counts(dropna = False).sort_index(ascending = True)
tafe_resignations['cease_date'].value_counts(dropna = False).sort_index(ascending = True)
Let's Look at dete_start_date and verify that all start dates are before cease dates.
dete_resignations['dete_start_date'].value_counts(dropna = False).sort_index(ascending = True)
dete_resignations[dete_resignations['cease_year'] < dete_resignations['dete_start_date']]
In dete dataframe there is no start year less than 1963 and no data that the start year is bigger than the cease date, but there are some Nan values in both columns. So it seems that the data is not corrupted.
The first goal of this project is getting answer to the following question.
To answer the above question we need to calculate the length of time an employee spent in a workplace which is referred to as the years of Service and also we need to find out more about dissatisfaction, since there are some different columns in both dataframes which give us information about it.
In tafe dataframe there is the institute_service column but in dete we should calculate it by subtracting start year from cease year.
dete_resignations["institute_service"] = dete_resignations["cease_year"] - dete_resignations["dete_start_date"]
Below are the columns we'll use to categorize employees as "dissatisfied" from each dataframe. tafe survey:
dete survey:
If the employees indicated any of the factors above caused them to resign, we'll mark them as dissatisfied in a new column.
tafe dissatisfied column: If any of the two columns is True, dissatisfied is True and if both columns are False, dissatisfied will be False. If both columns are Nan, the result will be Nan too.
First, let's look at the values of those columns and change them to True, False and Nan:
tafe_resignations['Contributing Factors. Dissatisfaction'].value_counts(dropna = False)
tafe_resignations['Contributing Factors. Job Dissatisfaction'].value_counts(dropna = False)
def update_vals(val):
if pd.isnull(val):
return np.nan
elif val == '-':
return False
else:
return True
tafe_resignations_Dissatisfactions = tafe_resignations[['Contributing Factors. Dissatisfaction', 'Contributing Factors. Job Dissatisfaction']].applymap(update_vals)
At this stage we can make dissatisfied column for tafe:
tafe_resignations["dissatisfied"] = tafe_resignations_Dissatisfactions.any(axis = 1, skipna = False)
tafe_resignations_up = tafe_resignations.copy()
dete dissatisfied column
Since the values of the columns related to dissatisfaction are True and False we can make dissatisfied column directly and no cleaning is required.
columns = ['job_dissatisfaction',
'dissatisfaction_with_the_department',
'physical_work_environment',
'lack_of_recognition',
'lack_of_job_security',
'work_location',
'employment_conditions',
'work_life_balance',
'workload']
dete_resignations_Dissatisfactions = dete_resignations[columns]
dete_resignations["dissatisfied"] = dete_resignations_Dissatisfactions.any(axis = 1, skipna = False)
dete_resignations_up = dete_resignations.copy()
The result is two new dataframe, dete_resignations_up and tafe_resignations_up, with dissatisfied column.
We have already renamed the columns, drpped any data not needed for our analysis, verified the quality of data, created a new institute_service column and created a new column indicating if an employee resigned because they were dissatisfied in some way. Now it is the time of aggregating the data according to the institute_service column.
First, we add a column named institute, to each dataframe that will allow us to easily distinguish between the two dataframes. Then we wil combine the dataframes.
dete_resignations_up['institute'] = "DETE"
tafe_resignations_up['institute'] = "TAFE"
combined = pd.concat([dete_resignations_up, tafe_resignations_up])
combined['institute'].value_counts()
After combing data to combined dataframe, we will drop the columns with less than 500 not null values. The result will be saved in combined_updated dataframe.
combined_updated = combined.dropna(axis = 1, thresh = 500)
print("The number of combined columns:" + str(len(combined.columns)))
print("The number of combined_updated columns:" + str(len(combined_updated.columns)))
44 columns have been reamoved.
By looking at the values of institute_service column we figure out that this column is tricky to clean because it currently contains values in a couple of different forms:
combined_updated['institute_service'].value_counts()
We are going to categorized this column according to the following definition:
First, we are going to change all the values to a year by changing the ranges like 1-2 to the first number 1 and extract the years from 'Less than 1 year' and 'More than 20 years'.
combined_updated = combined_updated.copy()
combined_updated['institute_service'] = combined_updated['institute_service'].astype('str')
combined_updated['institute_service'] = combined_updated['institute_service'].str.replace('Less than 1 year', '1.0').str.replace('More than 20 years', '20.0')
combined_updated['institute_service']= combined_updated['institute_service'].astype(str).str.extract(
r'(?P<Year>\d+)[-|.]?\d+?', expand=True)['Year'].astype(float)
combined_updated['institute_service'].value_counts(dropna = False).sort_index()
Next, we'll map each value to one of the career stage definitions above.
def categorize(val):
if pd.isnull(val):
return np.nan
elif val<3:
return 'New'
elif val<=6:
return 'Experienced'
elif val<=10:
return 'Established'
else:
return 'Veteran'
combined_updated['service_cat'] = combined_updated['institute_service'].apply(categorize)
combined_updated['service_cat'].value_counts(dropna = False)
We have created a new column 'service_cat' that contains the category of each service.
Before starting the analysis we look at the dissatisfied column again and as there are only 8 NaN values and most of the values of this column is False. We are going to change NaNs to False. This changes can not have significant effect in our result.
combined_updated['dissatisfied'].value_counts(dropna = False)
combined_updated['dissatisfied'].fillna(False, inplace=True)
combined_updated['dissatisfied'].value_counts(dropna = False)
combined_updated['service_cat'].dropna(inplace = True)
pivot_combined_updated = combined_updated.pivot_table(values = 'dissatisfied', index = 'service_cat')
%matplotlib inline
import matplotlib.pyplot as plt
pivot_combined_updated.plot(kind = 'bar')
plt.title('dissatisfied vs service_cat')
Looking at the above bar plot, we find out that New and Experienced employees, who work in a shorter period of time, show less dissatisfaction than Established and Veteran with the longer service years.
Let's see how many people in each career stage resigned due to some kind of dissatisfaction.
grouped = combined_updated.groupby(['service_cat', 'dissatisfied'])['service_cat'].agg('count')
print(grouped)
The above data shows that Established and Veteran have almost the same amount for dissatisfied and satissfied but Expereinced and New show more amount for being satisfied and also resigned.
The second question of the project is mentioned her again:
Let's clean the age column and figure out how many people in each age group resgined due to some kind of dissatisfaction.
combined_updated['age'].value_counts(dropna = False)
It is possible to clean it by changing ranges for example '21 25' can be changed to '21-25', and also it is better to change '56-60' and '61 or older' to '56 or older'.
combined_updated['age'] = combined_updated['age'].str.replace(
' ', '-').str.replace('56-60', '56 or older').str.replace(
'61 or older', '56 or older')
combined_updated['age'].value_counts(dropna = False).sort_index()
pv_age_dissatisfied = combined_updated.pivot_table(values = 'dissatisfied', index = 'age')
pv_age_dissatisfied.plot(kind = 'bar')
plt.title('dissatisfied vs age')
grouped_age = combined_updated.groupby(['age', 'dissatisfied'])['age'].agg('count')
print(grouped_age)
The plot and information above show that the dissatisfaction of the resignated employees are more in the range of 26-30 and above 50 years old. The dissatisfaction of the younger employees (under 25) is less than the older employees (older than 50). Even in the ranges with higher dissatisfaction, it is difficulat to say that the dissatisfaction caused resignation because less than 50% of resignating employees were dissatisfied.
As the last step we are going to analyze dete and tafe surveys separately and see if there is any differenes between them.
pv_service_institute = combined_updated.pivot_table(
values = 'dissatisfied', index = ['service_cat', 'institute'])
ax = pv_service_institute.plot(kind = 'bar', legend = False)
ax.set_ylim(0, 1)
ax.set_ylabel("Percentage Dissatisfied")
plt.title('Dissatisfied vs service_cat, institute')
plt.show()
The highest dissatisfaction are in DETE Established and Veterans. More than 50 percent of them are dissatisfied that can cause the reasignations. TAFE dissatisfaction is less than DETE in all categories and less than 35% so we can not say that dissatisfaction is the reason of resignation in TAFE.
pv_age_institute = combined_updated.pivot_table(
values = 'dissatisfied', index = ['age', 'institute'])
ax = pv_age_institute.plot(kind = 'bar', legend = False)
ax.set_ylim(0, 1)
ax.set_ylabel("Percentage Dissatisfied")
plt.title('Dissatisfied vs service_cat, age')
plt.show()
DETE older employees (more than 50) and the employees between 26 and 35 show dissatisfaction of higher than 50% which can be the cause of resignation. Except for'20 or youner' employees which there is no record in DETE, for all ranges of ages DETE has more dissatisfaction than TAFE. Neither younger nor older TAFE employees resignation is due to some kind of dissatisfaction since the dissatisfied percentages are less than 35%.
pv_institute = combined_updated.pivot_table(
values = 'dissatisfied', index = 'institute')
ax = pv_institute.plot(kind = 'bar', legend = False)
ax.set_ylim(0, 1)
ax.set_ylabel("Percentage Dissatisfied")
plt.title('Dissatisfied vs institute')
plt.show()
The last plot demonstrates that DETE percentage of dissatisfaction is significantly higher than TAFE and we can say that resigning is due to dissatisfaction in DETE but not in TAFE.
In this project two data sets have been combined to get the answers to the following questions:
By cleaning data and analyzing them the following results have been acheived:
Analyzing Dete and Tafe separatly shows that resigning is due to dissatisfaction in Dete but not Tafe. The dissatisfaction of these employees can cause resignation according to Dete survey: