This project aims to clean and analyze employee exit surveys from department of Education , Training(DETE) and Employment and Technical & Further Education (TAFE) Institute in Queensland , Australia. Both data sets can be found through following links: https://data.gov.au/dataset/ds-qld-fe96ff30-d157-4a81-851d-215f2a0fe26d/details?q = dete exit survey, and https://data.gov.au/dataset/ds-qld-89970a3b-182b-41ea-aea2-6f9f17b5907e/details?q= fate exit. It's project strategy is to know if employees exit job due to dissatisfaction. Thus, I will start by identifying and cleaning our datasets, drop data not needed for analysis, verify quality of data, create new columns and perform initial anaysis.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
dete_survey = pd.read_csv('dete_survey.csv')
tafe_survey = pd.read_csv('tafe_survey.csv')
dete_survey.info()
tafe_survey.info()
dete_survey.head(10)
tafe_survey.head(5)
dete_survey.isnull().sum()
tafe_survey.isnull().sum()
dete_survey = pd.read_csv('dete_survey.csv', na_values = 'Not Stated')# re-read to change Not Stated to NaN
dete_survey.head(10)
dete_survey_updated = dete_survey.drop(dete_survey.columns[28:49], axis = 1) # To drop any column not important for analysis
dete_survey_updated.head(5)
print(dete_survey_updated.columns) # To check columns are droped
tafe_survey_updated = tafe_survey.drop(tafe_survey.columns[17:66], axis = 1)
tafe_survey_updated.head(5)
print(tafe_survey_updated.columns) # To check columns are droped
Before I proceed with identification of missing values, in region column, there are missing values indicated as 'Not Stated', which are not represented as NaN, because pd.read_csv() function only specifies missing values as NaN. There fore, I re-read the dete_survey data again by setting na_values parameter to 'Not Stated'. Besides that, some of the columns are not necessary for my analysis, therefore I used DataFrame.drop() method to drop some columns from both data sets
In order to update, Dataframe.columns attribute along with vectorized string methods was used on dete_survey data. While, DataFrame.rename() method with a dictionary as argument to the columns parameters is used on tafe_survey because it works well with any number of columns. Its important to rename columns for both data sets to look uniform because it will be easier to combined the data sets for further analysis later.
# Clean the column names
dete_survey_updated.columns = dete_survey_updated.columns.str.lower().str.strip().str.replace(' ', '_')
# Check that the column names were updated correctly
dete_survey_updated.columns
# Rename column, show column names for tafe_survey
columns = {'Record ID':'id','CESSATION YEAR':'cease_date',
'Reason for ceasing employment':'separationtype',
'Gender. What is your Gender':'gender', 'CurrentAge. Current Age':'age',
'Employment Type.Employment Type':'employment_status','Classification. Classification':'position',
'LengthofServiceOverall. Overall Length of Service at Institute (in years)':'institute_service',
'LengthofServiceCurrent. Length of Service at current workplace (in years)':'role_service'
}
tafe_survey_updated = tafe_survey_updated.rename(columns, axis = 1)
tafe_survey_updated.columns
In order to filter the data set, series.value_counts() method is used to accounts for columns of employees who were resigning . Many reasons cause employees to exit work but in this project, I would like to focus on those employees who resigned and those who quit for other reasons e.g. age versus role services(number of years in job before quiting job)
#Check the unique values for the separationtype column for tafe_survey_updated
tafe_survey_updated['separationtype'].value_counts()
# Check the unique values for the separationtype column for dete_survey_updated
dete_survey_updated['separationtype'].value_counts()
# Update all separation types containing the word "resignation" to 'Resignation'
dete_survey_updated['separationtype'] = dete_survey_updated['separationtype'].str.split('-').str[0]
# Check the values in the separationtype column were updated correctly
dete_survey_updated['separationtype'].value_counts()
# Select only the resignation separation types from each dataframe
dete_resignations = dete_survey_updated[dete_survey_updated['separationtype'] == 'Resignation'].copy()
tafe_resignations = tafe_survey_updated[tafe_survey_updated['separationtype'] == 'Resignation'].copy()
tafe_survey_updated['role_service'].value_counts() # To check the number of years of service by employees
Before cleaning and Manipulating our data sets, i will try to verify if there is any inconsistancies in the columns. In this mission, i would focus on verifying the years in cease_date and dete_start_date columns. Ideally, most employees start working at their 20's, therefore it won't make sense to have dete_start_date before 1940.
# Check the unique values
dete_resignations['cease_date'].value_counts()
#Extract the years and convert them to a float type
dete_resignations['cease_date'] = dete_resignations['cease_date'].str.split('/').str[-1]
dete_resignations['cease_date'] = dete_resignations['cease_date'].astype("float")
# Check the values again and look for outliers
dete_resignations['cease_date'].value_counts()
tafe_resignations['cease_date'].value_counts()
# Check the unique values and look for outliers
dete_resignations['dete_start_date'].value_counts().sort_values(ascending = True)
Based on my findings, there aren't any inconsistencies in cease_date and dete_start_date columns.
Since there is already an institute_service in tafe_resignations dataframe, I wil create a similar column in dete_resignations by subtracting dete_start_date from cease_date as seen below. I have made these changes to maneveur a way to have common columns names to join them later for further anaylsis
# Calculate the length of time an employee spent in their respective workplace and create a new column
dete_resignations['institute_service'] = dete_resignations['cease_date'] - dete_resignations['dete_start_date']
# display results
dete_resignations['institute_service'].head()
In order to idnetify dissatisfied employees, Contributing Factors. Dissatisfaction,Contributing Factors. Job Dissatisfaction columns in tafe_survey_updated data set is used. While in dete_survey_updated, job_dissatisfaction, dissatisfaction_with_the_department,physical_work_environment, lack_of_recognition, lack_of_job_security, work_location, employment_conditions, work_life_balance, workload columns are used to identify dissatistified employees.
# checking for employees who resigned due dissatifaction
tafe_resignations['Contributing Factors. Dissatisfaction'].value_counts()
tafe_resignations['Contributing Factors. Job Dissatisfaction'].value_counts()
# function to check for NaN and '_' characters in tafe_resignations dataframe
def update_vals(value):
if value == '-':
return False
elif pd.isnull(value):
return np.nan
else:
return True
tafe_resignations['dissatisfied'] = tafe_resignations[['Contributing Factors. Dissatisfaction', 'Contributing Factors. Job Dissatisfaction']].applymap(update_vals).any(1, skipna=False)
tafe_resignations_up = tafe_resignations.copy()
# Check the unique values after the updates
tafe_resignations_up['dissatisfied'].value_counts(dropna=False)
tafe_resignations.applymap(update_vals)
# checking for updates in dete_resignations
dete_resignations['dissatisfied'] = dete_resignations[['job_dissatisfaction', 'dissatisfaction_with_the_department',
'physical_work_environment','lack_of_recognition',
'lack_of_job_security',
'work_location','employment_conditions','work_life_balance','workload']].any(1, skipna=False)
dete_resignations_up = dete_resignations.copy() # To create a copy of data and avoid SettingWithCopy warning
# Check the unique values after the update in dete_resignations
dete_resignations_up['dissatisfied'].value_counts(dropna=False)
dete_resignations.applymap(update_vals)
First, I used df.valuecounts() to check for number for employees who resigned due to dissatisfaction. Then a function is created to check for NaN, '' characters in those columns contributing to dissatisfaction of employees. Df.any() method is use to create columns in both tafe_resignations and dete_resignations dataframe. Finally, Df.copy is used to create a copy of the results and avoided the settingWithCopy Wanrning.
# To add column to each dataframe
dete_resignations_up['institute'] = 'DETE'
tafe_resignations_up['institute'] = 'TAFE'
# To combine columns using Concatenate dataframes horizontally (axis=1)
combined = pd.concat([dete_resignations_up, tafe_resignations_up],ignore_index = True)
#Verify the number of non null values in each column
combined.notnull().sum().sort_values()
combined_updated = combined.dropna(thresh=500, axis = 1).copy()
Before combining the two columns (dete_resignations_up, tafe_resignations_up), I added column name(institute) for both dataframe. This easily allowed me to distinguish between the two dataframes. Then, I combined these columns using Concatenate dataframes horizontally (axis=1) and Verified the number of non null values in each column.Finally, Dataframe.dropna() method is used to drop any columns with less that 500 non null values as thresh parameter.
# Extracting years of service from institute_service column
combined_updated['institute_service'].value_counts(dropna = False) # To check the institute_service groups per service role in job
# Extract the years of service and convert them to float
combined_updated['institute_service'] = combined_updated['institute_service'].astype('str').str.extract(r'(\d+)')
combined_updated['institute_service'] = combined_updated['institute_service'].astype(float)
combined_updated['institute_service'].value_counts()
def career_vals(x):
if x >= 11:
return 'Veteran'
elif 7 <= x <= 10:
return 'Establish'
elif 3 <= x <= 6:
return 'Experience'
elif pd.isnull(x):
return np.nan
else:
return ' New'
combined_updated['service_cat'] = combined_updated['institute_service'].apply(career_vals)
# Check the unique values after the updates
combined_updated['service_cat'].value_counts()
# To confirm that the number of True and False in dissatisfied column
combined_updated['dissatisfied'].value_counts(dropna = False)
#To replace missing values in dissatisfied column with value that occur more frequently in the column
combined_updated['dissatisfied'] = combined_updated['dissatisfied'].fillna(False)
dis_cat = combined_updated.pivot_table(index='service_cat', values='dissatisfied')
%matplotlib inline
dis_cat.plot(kind = 'bar', rot = 30)
Based on inital analysis of bar plot, most employees who resigned because they were dissatisfied were new to the company (193 employee) and experience (172 employees). while veterans and Establish employees were the least to resign due to job dissatisfactions. However, vetarans and establish employee are likely to exit job due retirement
# To check tafe_survey age group
tafe_survey_updated['age'].value_counts(dropna=False) # To check the age groups per service role in job
# Check dete_survey age group
dete_survey_updated['age'].value_counts(dropna=False)
# Extract the age of employee and convert them to float for dete_survey
dete_survey_updated['age'] = dete_survey_updated['age'].astype('str').str.extract(r'(\d+)')
dete_survey_updated['age'] = dete_survey_updated['age'].astype(float)
dete_survey_updated['age'].value_counts()
# Extract the age of employee and convert them to float for tafe_survey group
tafe_survey_updated['age'] = tafe_survey_updated['age'].astype('str').str.extract(r'(\d+)')
tafe_survey_updated['age'] = tafe_survey_updated['age'].astype(float)
tafe_survey_updated['age'].value_counts()
# To add column to each dataframe for age colums
dete_resignations_up['age'] = 'DETE'
tafe_resignations_up['age'] = 'TAFE'
# To combine columns using Concatenate dataframes horizontally (axis=1)
combined_age = pd.concat([dete_survey_updated, tafe_survey_updated],ignore_index = True)
#Verify the number of non null values in each column
combined_age.notnull().sum().sort_values()
combined_age_updated = combined_age.dropna(thresh=500, axis = 1).copy()
def employee_retirement_age(age):
age = float(age) # Convert string to float before comparing
if age >= 50:
return 'Old Age'
elif 40 <= age < 50:
return 'Middle Age Adult'
elif 20 >= age < 40:
return 'Young Adult'
elif pd.isnull(age):
return np.nan
else:
return 'Adolescent'
combined_age_updated['Age_retirement_cat'] = combined_age_updated['age'].apply(employee_retirement_age)
# Check the unique values after the updates
combined_age_updated['Age_retirement_cat'].value_counts()
fig = plt.figure()
ax = fig.add_axes([0,0,1,1])
ax.axis('equal')
Age_retirement_cat = ['Old Age', 'Adolescent', 'Middle Age Adult', 'Young Adult']
column =[60,49,39,19]
ax.pie(column, labels = Age_retirement_cat, autopct='%1.2f%%')
plt.show()
Based on pie-chart plot above, about 35.93% of employees are likely to exit job due to old age, probably due to retirement. 29.34% are adolescent age and might still be too young to work while both Middle age adult(23.35%) and young adults(11.38%) are less likely to ritire from job but these group are prone to changing job
From my point of view, Young and new employees are more likely to exit job due to job dissatisfaction. For Example, about 193 new employees and 29.34% young adolescent employees are likely to quit job. Veterans and establish employees at their old ages are likely to exit job due to retirement. However, Both middle and young experience employees are likely to change job due to either job dissatisfaction or find new challenges in other companies