In this project, we'll clean and analyse the surveys from employees of Department of Education, Training and Employment and the Technical and Further Education (TAFE) institute in Queensland, Australia.
Our aim is to pretend our stakeholders want to find out the following questions:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
# Read in data for dete
dete_survey = pd.read_csv('dete_survey.csv')
pd.options.display.max_columns = 150 # This line will avoid truncated output for the ease of visualisation
dete_survey.head()
dete_survey.info()
# Read in data for tafe
tafe_survey = pd.read_csv('tafe_survey.csv')
tafe_survey.head()
tafe_survey.info()
From the initial observation of both datasets, we discovered the following:
dete_survey
dataframe contains values that are set to 'N' or 'A' instead of NaN
First, we'll correct the Not Stated
values and drop some of the columns we don't need for our analysis.
# Read in the data again, but this time replace `Not Stated` values with `NaN`
dete_survey = pd.read_csv('dete_survey.csv', na_values='Not Stated')
dete_survey.head()
Since there are too many columns in the datasets, we will remove the columns we don't need for our analysis.
# Remove columns we don't need for our analysis
dete_survey_updated = dete_survey.drop(dete_survey.columns[28:49], axis=1)
tafe_survey_updated = tafe_survey.drop(tafe_survey.columns[17:66], axis=1)
dete_survey_updated.columns
tafe_survey_updated.columns
Each dataframe contains many of the same column, but with different names. Below are some of the columns we'd like to use for our final analysis:
dete_survey | tafe_survey | Definition |
---|---|---|
ID | Record ID | An id used to identify the participant of the survey |
SeparationType | Reason for ceasing employment | The reason why the participant's employment ended |
Cease Date | CESSATION YEAR | The year or month the participant's employment ended |
DETE Start Date | The year the participant began employment with the DETE | |
LengthofServiceOverall. Overall Length of Service at Institute (in years) | The length of the person's employment (in years) | |
Age | CurrentAge. Current Age | The age of the participant |
Gender | Gender. What is your Gender? | The gender of the participant |
We want to concatenate both dataset, so we'll need to standardise the column names.
We want our column names to meet the following criterias:
dete_survey_updated.columns = dete_survey_updated.columns.str.lower().str.strip().str.replace(' ', '_')
dete_survey_updated.columns
# Update column names to match the names in dete_survey_updated
mapping = {'Record ID': 'id',
'CESSATION YEAR': 'cease_date',
'Reason for ceasing employment': 'separationtype',
'Gender. What is your Gender?': 'gender',
'CurrentAge. Current Age': 'age',
'Employment Type. Employment Type': 'employment_status',
'Classification. Classification': 'position',
'LengthofServiceOverall. Overall Length of Service at Institute (in years)': 'institute_service',
'LengthofServiceCurrent. Length of Service at current workplace (in years)': 'role_service',
'seperationtype': 'separationtype'}
tafe_survey_updated = tafe_survey_updated.rename(mapping, axis=1)
tafe_survey_updated.head()
dete_survey_updated.head()
dete_survey_updated['separationtype'].value_counts()
tafe_survey_updated['separationtype'].value_counts()
tafe_survey_updated['institute_service'].value_counts().sum()
Recall that out aim is to answer the following question:
By looking at seperationtype
columns of both dataframes, we can see different reasons for resignation. However, we are only interested in the values that contain the string Resignation
.
There are 3 different reasons of resignation that contain the string Resignation
, so we will modifiy the strings to display only Resignation
.
dete_survey_updated['separationtype'] = dete_survey_updated['separationtype'].str.split('-').str[0]
dete_survey_updated['separationtype'].value_counts()
# Select only the resignation separation types from each dataframe
dete_resignations = dete_survey_updated[dete_survey_updated['separationtype'] == 'Resignation'].copy()
tafe_resignations = tafe_survey_updated[tafe_survey_updated['separationtype'] == 'Resignation'].copy()
Before we start cleaning and manipulating the rest of our data, we need to verify that the data doesn't contain any major inconsistencies. It may not be possible to catch all of the errors, but by making sure the data seems reasonable to the best of our knowledge, we can stop ourselves from completing a data analysis project that winds up being useless because of bad data.
To ensure we manipulate our data to the highest quality possible, we will focus on verying the columns one by one.
First, we'll look at cease_date
and dete_start_date
columns. Since cease_date
is the last year of the person's employment and the dete_start_date
is the person's first year of emplyment, it wouldn't make sense to have years after the current date.
Given that most people in this field start working in their 20s, it's also unliekly that the dete_start_date
was before the year 1940.
Hence, we will eliminate any data with years higher than the current date or lower than 1940.
dete_resignations['cease_date'].value_counts()
# Extract the years and convert them to a float type
dete_resignations['cease_date'] = dete_resignations['cease_date'].str.split('/').str[-1]
dete_resignations['cease_date'] = dete_resignations['cease_date'].astype('float')
dete_resignations['cease_date'].value_counts()
dete_resignations['dete_start_date'].value_counts().sort_values(ascending=False)
tafe_resignations['cease_date'].value_counts().sort_values()
To calculate the years of service in dete_survey
, we will subtract dete_start_date
from cease_date
and create a new column named institute_service
.
dete_resignations['institute_service'] = dete_resignations['cease_date'] - dete_resignations['dete_start_date']
dete_resignations['institute_service'].head()
Now, we'll identify any employees who resigned because they were dissatisfied. Below are the columns we'll use to categorise employees as 'dissatisfied' from each dataframe.
tafe_survey_updated
Contributing Factors. Dissatisfaction
Contributing Factors. Job Dissatisfaction
dete_survey_updated
job_dissatisfaction
dissatisfaction_with_the_department
physical_work_environment
lack_of_recognition
lack_of_job_security
work_location
employment_conditions
work_life_balance
workload
If the employee indicated any of the factors above caused them to resign, we'll mark them as dissatisfied
in a new column.
In order to do this, we need to:
Contributing Factors. Dissatisfaction
and Contributing Factors. Job Dissatisfaction
columns in the tafe_resignations
dataframe to True
, False
, or NaN
values.True
value, we'll add a True
value to a new column named dissatisfied
.tafe_resignations['Contributing Factors. Dissatisfaction'].value_counts()
tafe_resignations['Contributing Factors. Job Dissatisfaction'].value_counts()
# Define a function to replace the values in the contributing columns to either True, False or NaN
def update_vals(x):
if x == '-':
return False
elif pd.isnull(x):
return np.nan
else:
return True
tafe_resignations['dissatisfied'] = tafe_resignations[['Contributing Factors. Dissatisfaction',
'Contributing Factors. Job Dissatisfaction']].applymap(update_vals).any(axis=1, skipna=False)
tafe_resignations_up = tafe_resignations.copy()
tafe_resignations_up['dissatisfied'].value_counts(dropna=False)
dete_resignations['dissatisfied'] = dete_resignations[['job_dissatisfaction',
'dissatisfaction_with_the_department', 'physical_work_environment',
'lack_of_recognition', 'lack_of_job_security', 'work_location',
'employment_conditions', 'work_life_balance',
'workload']].any(axis=1, skipna=False)
dete_resignations_up = dete_resignations.copy()
dete_resignations_up['dissatisfied'].value_counts(dropna=False)
We'll now add an institute column so that we can differentiate the data from each survey after we combine them. Then, we'll combine the dataframes and drop any remaining columns we don't need.
# Add an institute column
dete_resignations_up['institute'] = 'DETE'
tafe_resignations_up['institute'] = 'TAFE'
combined = pd.concat([dete_resignations_up, tafe_resignations_up], ignore_index=True)
combined['institute'].value_counts()
# Verify the number of non null values in each column
combined.notnull().sum().sort_values()
combined.info()
# Drop columns with less than 500 non null values
combined_updated = combined.dropna(thresh=500, axis=1).copy()
combined_updated.info()
Next, we'll clean the institute_service
column and categorise employees according to the following definitions:
combined_updated.head()
combined_updated['institute_service'].value_counts()