In this guided project, we'll work with exit surveys from employees of the Department of Education, Training and Employment (DETE) and the Technical and Further Education (TAFE) institute in Queensland, Australia. You can find the TAFE exit survey here and the survey for the DETE here. We've made some slight modifications to these datasets to make them easier to work with, including changing the encoding to UTF-8 (the original ones are encoded using cp1252.)
In this project, we'll play the role of data analyst and pretend our stakeholders want to know the following:
Are employees who only worked for the institutes for a short period of time resigning due to some kind of dissatisfaction? What about employees who have been there longer?
Are younger employees resigning due to some kind of dissatisfaction? What about older employees?
A data dictionary wasn't provided with the dataset. For this project, we'll use our general knowledge to define the columns.
Below is a preview of a couple columns we'll work with from the dete_survey.csv:
Below is a preview of a couple columns we'll work with from the tafe_survey.csv:
Importing the libraries we will be working with in this project
import pandas as pd
import numpy as np
Reading the csv files into pandas
dete_survey = pd.read_csv('dete_survey.csv')
tafe_survey = pd.read_csv('tafe_survey.csv')
Exploring the DETE dataset
dete_survey.info()
dete_survey.head()
# Checking missing data
dete_survey.isnull().sum()
Exploring the TAFE dataset
tafe_survey.info()
tafe_survey.head()
First Observations:
Fixing the first item "Not Stated" by reading the csv file again and passing the parameter na_values to read Not Started as NaN.
#read in the data again
dete_survey = pd.read_csv('dete_survey.csv', na_values='Not Stated')
#checking if the changes were correctly applied
dete_survey.head()
Dropping the columns from each dataframe that we will not use in our analysis
dete_survey_updated = dete_survey.drop(dete_survey.columns[28:49], axis=1)
tafe_survey_updated = tafe_survey.drop(tafe_survey.columns[17:66], axis=1)
Now both dataframes have only the columns that are relevant for this project. Let's confirm that by printing the first 5 rows.
dete_survey_updated.head()
tafe_survey_updated.head()
Exploring the Columns names
Because we eventually want to combine them, we'll have to standardize the column names. We will start by doing the following in the dete_survey_updated dataframe:
#cleaning strings
dete_survey_updated.columns = dete_survey_updated.columns.str.replace(" ", "_").str.strip().str.lower()
#Confirming the changes were successfully applied
dete_survey_updated.head()
Updating the names of some columns in the tafe_survey_updated dataset to match the dete_survey_updated
#Mapping columns
col_name_mapping = ({'Record ID': 'id', 'CESSATION YEAR': 'cease_date',
'Reason for ceasing employment': 'separationtype',
'Gender. What is your Gender?': 'gender',
'CurrentAge. Current Age': 'age',
'Employment Type. Employment Type': 'employment_status',
'Classification. Classification': 'position',
'LengthofServiceOverall. Overall Length of Service at Institute (in years)': 'institute_service',
'LengthofServiceCurrent. Length of Service at current workplace (in years)': 'role_service'
})
tafe_survey_updated.rename(columns=col_name_mapping, inplace=True)
#Printing first 5 rows
tafe_survey_updated.head()
Removing more of the data that we don't need for our analysis
The column separationtype contains a couple of different separation types. We will only analyze survey respondents who resigned. We're going to select the rows of respondents who have a Resignation separation type.
#Counting values in the separationtype column - DETE
dete_survey_updated['separationtype'].value_counts()
For dete_survey_updated, as it has 3 different types of resignation, we will use a regex to extra all rows that has the word "Resignation" as part of the string value in the column separationtype.
#Counting values in the separationtype column - TAFE
tafe_survey_updated['separationtype'].value_counts()
#we are adding the df.copy() to the end of the new dataframe to avoid the SettingWithCopy warning
pattern = r"[Rr]esignation"
dete_resignations = dete_survey_updated[dete_survey_updated['separationtype'].str.contains(pattern)].copy()
dete_resignations.head()
#We have to clean from Nan values to be able to do a boolean index
tafe_survey_updated_na = tafe_survey_updated[~tafe_survey_updated['separationtype'].isnull()]
tafe_resignations = tafe_survey_updated_na[tafe_survey_updated_na['separationtype'].str.contains(pattern)].copy()
#Priting first 5 rows
tafe_resignations.head()
Checking for inconsistent data in the columns cease_date and dete_start_date on both datasets
we'll focus on verifying that the years in the cease_date and dete_start_date columns make sense.
Since the cease_date is the last year of the person's employment and the dete_start_date is the person's first year of employment, it wouldn't make sense to have years after the current date.
Given that most people in this field start working in their 20s, it's also unlikely that the dete_start_date was before the year 1940.
dete_resignations
#Column cease_date
dete_resignations['cease_date'].value_counts()
Some of the values have month and year combined. We will remove the month and leave only the year
#Extracting the year and assigning it back to the series.
pattern = r"(?P<Years>[1-2][0-9]{3})"
dete_resignations['cease_date'] = dete_resignations['cease_date'].str.extract(pattern)
#Converting dtype to float
dete_resignations['cease_date'] = dete_resignations['cease_date'].astype(float)
# checking if the changes were correctly applied
dete_resignations['cease_date'].value_counts()
#Column dete_start_date
dete_resignations['dete_start_date'].value_counts().sort_index(ascending=True)
The data in those columns seem fine, between the date range we expected - the oldest year is 1963 and there is no date greater than the current date
tafe_resignations Column cease_date
tafe_resignations['cease_date'].value_counts()
The column cease_date in tafe_resignation also looks okay and matches the dtype from the dataset dete_resignations.
In the Human Resources field, the length of time an employee spent in a workplace is referred to as their years of service.
The tafe_resignations dataframe already contains a "service" column, which we renamed to institute_service. In order to analyze both surveys together, we'll have to create a corresponding institute_service column in dete_resignations.
Let's check if we have data that can be used to calculate the length of time the employee spent in their workplace before moving on.
dete_resignations.head()
We can calculate the years of service for dete_resignations by subtracting the cease_date from the dete_start_date. Let's go ahead and create the institute_service column
dete_resignations['institute_service'] = dete_resignations['cease_date'] - dete_resignations['dete_start_date']
#Checking new column in our dataset
dete_resignations.head()
Analyzing the columns related to job satisfaction on both dataframes
Checking the values in the columns 'Contributing Factors. Dissatisfaction' and 'Contributing Factors. Job Dissatisfaction' in the tafe_resignations dataframe.
tafe_resignations['Contributing Factors. Dissatisfaction'].value_counts()
tafe_resignations['Contributing Factors. Job Dissatisfaction'].value_counts()
Creating column dissatisfied for tafe_resignations
#Creating a function to update values in both columns
def update_vals(value):
if pd.isnull(value):
return np.nan
elif value == '-':
return False
else:
return True
Applying the function we created to update the values in the new column 'dissatisfied' for TAFE dataset
cols = ['Contributing Factors. Dissatisfaction', 'Contributing Factors. Job Dissatisfaction']
#creating new column dissatisfied
tafe_resignations['dissatisfied'] = tafe_resignations[cols].applymap(update_vals).any(axis=1, skipna=False)
#creating a copy of the dataset to avoid SettingWithCopy warn
tafe_resignations_up = tafe_resignations.copy()
#Displaying value counts
tafe_resignations_up['dissatisfied'].value_counts()
Creating a dissatisfied column to dete_resignations
dete_resignations['dissatisfied'] = dete_resignations[['job_dissatisfaction',
'dissatisfaction_with_the_department','physical_work_environment',
'lack_of_recognition','lack_of_job_security','work_location',
'employment_conditions','work_life_balance',
'workload']].any(axis = 1,skipna = False)
dete_resignations_up = dete_resignations.copy()
dete_resignations_up['dissatisfied'].value_counts(dropna= False)
We've successfully created the column dissatisfied for both datasets with values True, False and NaN. If the employee indicated any of the factors(column names listed above in the code), we will mark them as dissatisfied in the new column.
Now, we're finally ready to combine our datasets! Our end goal is to aggregate the data according to the institute_service column, so when you combine the data, think about how to get the data into a form that's easy to aggregate.
First, let's add a column institute to each dataframe that will allow us to easily distinguish between the two.
dete_resignations_up['institute'] = 'DETE'
tafe_resignations_up['institute'] = 'TAFE'
combined = pd.concat([dete_resignations_up,tafe_resignations_up],ignore_index = True)
#Printing fist 5 rows
combined.head()
#Quick Exploration of the new dataset
combined.info()
#Check for not null values
combined.notnull().sum().sort_values()
Dropping any columns with less than 500 non null values
combined_updated = combined.dropna(axis = 1, thresh = 500).copy()
combined_updated.head()
To recap, we've accomplished the following so far:
Before we can proceed with our analysis, we need to clean up the institue_service column. It contains values in different formats.
#Check for unique values
combined_updated['institute_service'].unique()
To analyze the data, we'll convert these numbers into categories:
Extracting the years of service from each value in the institute_service column.
#Convert column to string type
combined_updated['institute_service'] = combined_updated['institute_service'].astype(str)
#Changing the type to str to extract the year and then converting to float
combined_updated['institute_service'] = combined_updated['institute_service'].str.extract(r'(\d+)').astype('float')
#Check if changes applied correctly
combined_updated['institute_service'].unique()
Creating a function to map each value to one of the carrier stages definitions that categorize employees according to the amount of years spent in their workplace:
# Define function
def map_years_to_category (years):
if pd.isnull(years):
return 'Unknown' # in case years-of-service is unknown, let's make it 'Unknown'
elif years < 3:
return ('New')
elif years < 7:
return ('Eperienced')
elif years < 11:
return ('Established')
else:
return ('Veteran')
Creating a new column 'service_cat' to add the carrier categories values
combined_updated['service_cat'] = combined_updated['institute_service'].apply(map_years_to_category)
#Check if changed applied
combined_updated['service_cat'].unique()
Handling missing values in the column dissatisfied
#Checking the number of True, False and NaN in the dissatisfied column
combined_updated['dissatisfied'].value_counts(dropna=False)
#Replacing the NaN values with the value that occurs most frequently - False.
combined_updated['dissatisfied'].fillna(False, inplace=True)
#double checking the changes
combined_updated['dissatisfied'].value_counts(dropna=False)
Using the pd.pivot_table method to calculate the percentage of dissatisfied employees in each service_cat group.
pv_combined_updated = pd.pivot_table(combined_updated, index='service_cat', values='dissatisfied')
pv_combined_updated.head()
#Plotting the results
%matplotlib inline
pv_combined_updated.plot(kind = 'bar', ylim = (0,1), title = 'Dissatisfied % per Career Stage', legend=False)
Observations:
There seems to be a direct correlation between years of service and satisfaction level, the longer an employee works for the company, the higher is their dissatisfaction.
Calculating how many people in each age group resigned due to dissatisfaction
combined_updated['age'].unique()
The column age needs cleaning. Some of the age ranges are missing the dash and have double space and the range 56 or order will be renamed to 56-60. As there is a range 61 or older, it is reasonable to assume that 56 or older actually means 56-60.
We will also drop the nan values.
# Cleaning extra space and adding the -
combined_updated['age'] = combined_updated['age'].str.replace(" ","-").str.replace('56 or older', '56-60')
combined_updated['age'].value_counts(dropna=False).sort_index()
#Creating a pivot table to calculate the percentage of dissatisfaction by age group
pv_age = pd.pivot_table(combined_updated, index='age', values='dissatisfied')
pv_age
pv_age.plot(kind = 'bar', ylim = (0,0.7), title = 'Dissatisfaction by Age Group', legend=False)
Employees over 60 years old are the majority - 52.1% - of the dissatisfied age groups.
Let's now analyze each survey separately to see what institute DETE or TAFE had more employees leaving due to job dissatisfaction
pv_dete_tafe = pd.pivot_table(combined_updated, index='service_cat', columns='institute', values='dissatisfied', margins=True)
pv_dete_tafe
pv_dete_tafe.plot(kind = 'bar', ylim = (0,1), title = 'Dissatisfaction by Employee Group and Institute')
Let's conclude our analysis by answers the two initial questions we asked at the beginning of this project:
1 - Are employees who only worked for the institutes for a short period of time resigning due to some kind of dissatisfaction? What about employees who have been there longer?
The data shows that the longer the years of service the higher is the dissatisfaction. If you take for instance employees with 7 - 10 years of work with, 51.6% of them pointed job dissatisfaction as a reason for their resignation. For the Veteran group (11+ years), 48.5% of resignations were due to dissatisfaction.
2 - Are younger employees resigning due to some kind of dissatisfaction? What about older employees?
We uncovered that the age group 61+ had the highest disssatisfaction rate, 52.1% of them left the company due to dissatisfaction. With the expception of the range 26-30, the younger groups' dissatisfaction was below 40%
Comparing both institutes DETE and TAFE: The DETE institute has the highest rate of employees leaving due to job dissatisfaction, 47.9%. For the TAFE institute, only 26.7 former employees pointed that they were leaving due to not being satisfied.