#!/usr/bin/env python
# coding: utf-8
# # Employee Exit Surveys : Clean & Analyze
# ## Introduction
# BACKGROUND:
# - Exit surveys from employees of the Department of Education, Training and Employment (DETE) and the Technical and Further Education (TAFE) institute in Queensland, Australia. You can find the DETE survey here(https://data.gov.au/dataset/ds-qld-fe96ff30-d157-4a81-851d-215f2a0fe26d/details?q=exit%20survey). The TAFE survey is no longer available. We've made some slight modifications to these datasets to make them easier to work with, including changing the encoding to UTF-8 (the original ones are encoded using cp1252.)
#
#
#
# TARGET:
# - Are employees who only worked for the institutes for a short period of time resigning due to some kind of dissatisfaction? What about employees who have been there longer?
# - Are younger employees resigning due to some kind of dissatisfaction? What about older employees?
#
# Below is a preview of a couple columns we'll work with from the dete_survey.csv:
#
# - ID: An id used to identify the participant of the survey
# - SeparationType: The reason why the person's employment ended
# - Cease Date: The year or month the person's employment ended
# - DETE Start Date: The year the person began employment with the DETE
#
#
#
# Below is a preview of a couple columns we'll work with from the tafe_survey.csv:
#
# - Record ID: An id used to identify the participant of the survey
# - Reason for ceasing employment: The reason why the person's employment ended
# - LengthofServiceOverall. Overall Length of Service at Institute (in years): The length of the person's employment (in years)
#
# ### Import Packages
# In[1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
get_ipython().run_line_magic('matplotlib', 'inline')
# ## Load Data
# In[2]:
dete_survey = pd.read_csv('dete_survey.csv',na_values='Not Stated')
tafe_survey = pd.read_csv('tafe_survey.csv')
# ## 1st Glance at data sets
# Get Informationabout content, shape, size of data sets
# ### DETE DATA
# In[3]:
dete_survey.info()
# Check NaN/null values in DETE dataframe
# In[4]:
sorted = dete_survey.set_index('Region').sort_values(['Region', 'SeparationType'])
sns.heatmap(sorted.isnull(), cbar=False)
# In[5]:
dete_survey.isnull().sum()
# In[6]:
dete_survey.columns[dete_survey.isnull().sum()>100]
# Check for columns with remaining 'Not Stated' elements (fixed via na_values parameter of read_csv() )
# In[7]:
for col in dete_survey.columns:
if ('Not Stated' in dete_survey[col].unique().tolist()):
print(col)
# Check for total amount of columns that contain NaN
# In[8]:
dete_survey.columns[dete_survey.isnull().sum()>0].shape
# Check Data Types of columns
# In[9]:
dt = dete_survey.dtypes
dt.value_counts()
# #### Observations DETE SURVEY DATA:
# - 56 columns and 821 rows
# - Column names seem cleaned but some columns contain Camel Case Naming
# - 36 columns contain NaN values (57%)
# - 8 column contain more than 100 NaN (13%) and are almost entirely empty (Career Aspirations, Feedback, Further PD, Communication, My say, Information, Kept informed)
# - most columns are strings/object(35), some boolean(18) , two float (DETE start date, Role start date) and one single int64 (ID)
# ### TAFE DATA
# In[10]:
tafe_survey.info()
# In[11]:
tafe_survey.columns
# In[12]:
tafe_survey.sample()
# Check NaN/null values in TAFE dataframe
# In[13]:
sorted = tafe_survey.set_index('Institute').sort_values(['Institute', 'Reason for ceasing employment'])
sns.heatmap(sorted.isnull(), cbar=False)
# In[14]:
tafe_survey.columns[tafe_survey.isnull().sum()>0]
# In[15]:
tafe_survey.columns[tafe_survey.isnull().sum()>100].shape
# In[16]:
dt = tafe_survey.dtypes
dt.value_counts()
# #### Observations TAFE SURVEY DATA:
# - contains 72 columns and 702 rows
# - Column names seem not cleaned, contain a lot of special characters and are not formated in any way
# - 69 columns contain NaN values (96%)
# - 38 column contain more than 100 NaN (53%)
# - most columns are strings/object(70), some are float64 (2)
# ## Data set cleaning
# ### Clean DETE Data set
# Drop columns that are not needed for the analysis. Those columns contain work place and career related topics.
# In[17]:
dete_survey_updated = dete_survey.drop(dete_survey.columns[28:49],axis=1)
# Format column names to lowercase, remove spaces - to have common naming between DETE and TAFE data set
# In[18]:
dete_survey_updated.columns = dete_survey_updated.columns.str.lower().str.strip().str.replace(' ','_')
# In[19]:
dete_survey_updated.columns
# ### Clean TAFE Data set
# Drop columns that are not needed for the analysis. Those columns contain detailed questions/answers regarding the reason to leave the institute
# In[20]:
tafe_survey_updated = tafe_survey.drop(tafe_survey.columns[17:66],axis=1)
# Rename column names - to have common naming between DETE and TAFE data set
# In[21]:
col_map={'Record ID':'id',
'CESSATION YEAR': 'cease_date',
'Reason for ceasing employment':'separationtype',
'Gender. What is your Gender?':'gender',
'CurrentAge. Current Age':'age',
'Employment Type. Employment Type':'employment_status',
'Classification. Classification':'position',
'LengthofServiceOverall. Overall Length of Service at Institute (in years)':'institute_service',
'LengthofServiceCurrent. Length of Service at current workplace (in years)':'role_service'
}
# In[22]:
tafe_survey_updated.rename(col_map,axis=1,inplace=True)
# In[23]:
tafe_survey_updated.columns
# ### Check amount Resignation cases in each data set
# In[24]:
dete_survey_updated.separationtype.value_counts()
# Extract only Resignation cases (seperationtype entry starts with Resignation...)
# In[25]:
dete_resignations = dete_survey_updated[dete_survey_updated.separationtype.str.contains('^Resignation.*')].copy()
# In[26]:
dete_resignations.info()
# In[27]:
tafe_survey_updated.separationtype.value_counts()
# Extract only cases where seperationtype column gives Resignation value
# In[28]:
tafe_resignations = tafe_survey_updated[tafe_survey_updated.separationtype=='Resignation'].copy()
# In[29]:
tafe_resignations.info()
# ### Check data Consistency (year/date related columns)
# In[30]:
dete_resignations.cease_date.value_counts()
# Convert "cease_date" column from string to float, extracting only year
# In[31]:
pattern = r"(2\d{3})"
cease_yr = dete_resignations['cease_date'].str.extract(pattern)
dete_resignations['cease_date'] = cease_yr.squeeze().astype('float') # convert to float and squeeze to column array/Series
# In[32]:
# check resulting cease date column
print(dete_resignations['cease_date'] .value_counts().sort_index())
print('Contained NaN values: {}'.format(dete_resignations['cease_date'].isnull().sum()))
# In[33]:
#check start date column
print(dete_resignations['dete_start_date'] .value_counts().sort_index())
print('Contained NaN values: {}'.format(dete_resignations['dete_start_date'] .isnull().sum()))
# In[34]:
#check tafe cease_date
print(tafe_resignations['cease_date'].value_counts().sort_index())
print('Contained NaN values:{}'.format(tafe_resignations['cease_date'].isnull().sum()))
# In[35]:
import matplotlib.pyplot as plt
# In[36]:
fig,ax = plt.subplots(1,2)
dete_resignations.boxplot(column='cease_date',ax=ax[0])
tafe_resignations.boxplot(column='cease_date',ax=ax[1])
ax[0].set_title('DETE')
ax[1].set_title('TAFE')
fig.tight_layout()
# __Observations :__
# - DETE cease_date column dates are spread from 2012-2014,with date beeing spread from quartiles Q1(25%) - Q3(75%) between 2012/13
# - TAFE cease_date column dates are spread from 2010-2013,with date beeing spread from quartiles Q1(25%) - Q3(75%) between 2010/12
# In[37]:
tafe_resignations['institute_service'].value_counts()
# Create new column, calculated time worked at DETE from difference of "cease_date" - "dete_start_date"
# In[38]:
dete_resignations['institute_service'] = dete_resignations['cease_date'] - dete_resignations['dete_start_date']
# In[39]:
#check results
print(dete_resignations['institute_service'].value_counts().sort_index())
print('Contained NaN values: {}'.format(dete_resignations['institute_service'].isnull().sum()))
# In[40]:
dete_resignations.boxplot(column='institute_service')
# __Observations:__
# - The worked time at DETE (column "institute_service") is spread from 0-49 years, the column contains 38 NaN values
# ## Extract employees resignments due to dissatisfaction
# Convert 'Contributing Factors. Dissatisfaction' column to boolean values
# In[41]:
## ALTERNATIVE CODE
#rp_dict = {'Contributing Factors. Dissatisfaction ':True,
# '-':False}
#tafe_resignations['Contributing Factors. Dissatisfaction'] = tafe_resignations['Contributing Factors. Dissatisfaction'].replace(rp_dict)
#tafe_resignations['Contributing Factors. Dissatisfaction'].value_counts()
#rp_dict = {'Job Dissatisfaction':True,
# '-':False}
#tafe_resignations['Contributing Factors. Job Dissatisfaction'] = tafe_resignations['Contributing Factors. Job Dissatisfaction'].replace(rp_dict)
#tafe_resignations['Contributing Factors. Job Dissatisfaction'].value_counts()
# In[42]:
print(tafe_resignations['Contributing Factors. Dissatisfaction'].value_counts())
print(tafe_resignations['Contributing Factors. Job Dissatisfaction'].value_counts())
# In[43]:
def update_vals(element):
if pd.isnull(element):
return np.nan
elif element == '-':
return False
else:
return True
# Update columns 'Contruting Factors. Job Dissatisfaction' and ' Contibuting Factor. Dissatisfaction' with boolean values (True/False) or NaN
# In[44]:
tafe_resignations[['Contributing Factors. Job Dissatisfaction','Contributing Factors. Dissatisfaction']] = tafe_resignations[['Contributing Factors. Job Dissatisfaction','Contributing Factors. Dissatisfaction']].applymap(update_vals)
# Create a single column dissatisfaction containig boolean value in function of the two columns:
# - Contributing Factors. Job Dissatisfaction
# - Contributing Factors. Dissatisfaction
# In[45]:
factors = ['Contributing Factors. Job Dissatisfaction','Contributing Factors. Dissatisfaction']
tafe_resignations['dissatisfied'] = tafe_resignations[factors].any(axis=1,skipna=False)
# Check if NaN elements are skipped >> represented as NaN
# In[46]:
ii_nan = (tafe_resignations[factors[0]].isnull()) | (tafe_resignations[factors[1]].isnull())
# In[47]:
tafe_resignations.loc[ii_nan,["dissatisfied","Contributing Factors. Dissatisfaction","Contributing Factors. Dissatisfaction"]]
# In[48]:
display(tafe_resignations[factors[0]].value_counts(dropna=False))
display(tafe_resignations[factors[1]].value_counts(dropna=False))
display(tafe_resignations['dissatisfied'].value_counts(dropna=False))
# In[49]:
tafe_resignations_up = tafe_resignations.copy()
# Create a single column 'dissatisfied' containing boolean value in function of the below mentioned 9 columns indicating dissatisfaction:
# - job_dissatisfaction
# - dissatisfaction_with_the_department
# - physical_work_environment
# - lack_of_recognition
# - lack_of_job_security
# - work_location
# - employment_conditions
# - work_life_balance
# - workload
# In[50]:
factors = ['job_dissatisfaction','dissatisfaction_with_the_department',
'physical_work_environment','lack_of_recognition',
'lack_of_job_security','work_location','employment_conditions',
'work_life_balance','workload']
dete_resignations['dissatisfied'] = dete_resignations[factors].any(axis=1,skipna=False)
# In[51]:
ii_nan = (dete_resignations[factors[0]].isnull()) | \
(dete_resignations[factors[1]].isnull()) | \
(dete_resignations[factors[2]].isnull()) | \
(dete_resignations[factors[3]].isnull()) | \
(dete_resignations[factors[4]].isnull()) | \
(dete_resignations[factors[5]].isnull()) | \
(dete_resignations[factors[6]].isnull()) | \
(dete_resignations[factors[7]].isnull()) | \
(dete_resignations[factors[8]].isnull())
# Check for NaN in the 9 "dissatified" columns
# In[52]:
[dete_resignations[el].isnull().sum() for el in factors]
# In[53]:
dc_str = [el for el in factors]
dc_str.append("dissatisfied")
dete_resignations.loc[ii_nan,dc_str]
# In[54]:
dete_resignations_up = dete_resignations.copy()
# ### FRAME8: Combine data frames: DETE and TAFE
# In[55]:
dete_resignations_up['institute'] = 'DETE'
tafe_resignations_up['institute'] = 'TAFE'
# In[56]:
dete_resignations_up['age'].value_counts()
# In[57]:
tafe_resignations_up['age'].value_counts()
# Examine common columns between both data sets before joining
# In[58]:
set.intersection(set(dete_resignations_up.keys()),set(tafe_resignations_up.keys()))
# Concatenate data sets
# In[59]:
combined = pd.concat([dete_resignations_up, tafe_resignations_up],ignore_index=True)
combined.info()
# Drop all columns with more then 500 NaN
# In[60]:
combined_updated = combined.dropna(axis=1,thresh=500)
combined_updated.info()
# Checking contents of ["institute_service"] column
# In[61]:
combined_updated.institute_service.value_counts()
# In[62]:
combined_updated['institute_service'].astype('str').value_counts()
# ### Frame 9: Extract numbers/years and calc float value
# In[63]:
# map function - calculating YEAR info
# - based on extracted> float-converted 2 columns of numbers
def calc_yrs(ycol):
if ycol[1]==0: # if second number is zero (original value notes as float e.g. 3.0)
return ycol[0] # return 1st column
else:
return (ycol[0]+ycol[1])/2 # return average of 1st+2nd number
# Extract the years worked at service in numeric/float format
# In[64]:
yrs_str = combined_updated['institute_service'].astype('str') # convert to string
patt = r"([0-9]+)[-\.]?([0-9]+)?" #regular exp pattern to extract up to 2 numbers
yrs_extr = yrs_str.str.extract(patt) # extract numbers
yrs_cln = yrs_extr.dropna(how='all') # remove all lines with both elements NaN - 88 lines(ref above)
yrs_calc = yrs_cln.fillna('0').astype('float').apply(calc_yrs,axis=1) #fill single column NaN with '0' >> convert to float >> deploy calc_yrs function
# Mapping function for Carreer stages
# In[65]:
# map function to convert YEAR info to carreer stages
def carr_stage(el):
if el<3.0:
return 'New'
elif (el>=3.0) & (el<7.0):
return 'Experienced'
elif (el>=7.0) & (el<11.0):
return 'Established'
elif el>=11.0:
return 'Veteran'
# In[66]:
combined_updated.info()
# Create a new colum based on the calculated years at service Series mapped with the carreer stages
# In[119]:
# add column "service_cat"
#combined_updated['service_cat'] = yrs_calc
#combined_updated['service_cat'].map(carr_stage,inplace=True)
#combined_updated['service_cat'] = yrs_calc.map(carr_stage)#.copy()
combined_updated.loc[:,'service_cat'] = yrs_calc.map(carr_stage).copy()
# In[68]:
combined_updated.info()
# ### Frame 10: First analysis
# In[69]:
combined_updated.dissatisfied.value_counts(dropna=False)
# Fill remaining NaN in ["dissatisfied"] column with "False", as it is the most occuring value
# In[70]:
combined_updated.loc[:,'dissatisfied'] = combined_updated.dissatisfied.fillna(value=False).copy()
# In[71]:
combined_updated.dissatisfied.value_counts(dropna=False)
# Cross-Check values vs pivot_table/mean approach
# In[72]:
def dissatisfied_perc(group):
n = len(group)
diss_n = group[group==True].value_counts()
return diss_n/n*100
# In[73]:
combined_updated.pivot_table(values='dissatisfied',index='service_cat',aggfunc=dissatisfied_perc)
# In[84]:
combined_updated.pivot_table(values='dissatisfied',index='service_cat')*100
# In[75]:
combined_updated.pivot_table(values='dissatisfied',index='service_cat').reindex(['New','Experienced','Established','Veteran']).plot(kind='bar')
# # Conclusion 1
# __Answering the first question__:
# " Are employees who only worked for the institutes for a short period of time resigning due to some kind of dissatisfaction?"
#
" What about employees who have been there longer?"
# * We observe a continuous increase of dissatisfaction with seniority, peaking with 7-10 years(Established).
#
#
Therefore we can say that __"New" employees are not resigning due to dissatisfaction with higher rates__ then other seniority levels.
#
The highest rates were found for the "Established" level.
# ### Compare DETE vs TAFE
# In[82]:
combined_updated.pivot_table(values='dissatisfied',index=['institute','service_cat']).reindex([['DETE','DETE','DETE','DETE','TAFE','TAFE','TAFE','TAFE'],['New','Experienced','Established','Veteran','New','Experienced','Established','Veteran']]).plot(kind='bar',stacked=True)
# #### Conclusion
# * The DETE sourced data seems to be responsible for the trend we observed in the combined data set
# * TAFE datas shows equallevel of dissatisfaction except for a small peak for the Established (7-10yrs) employees
#
__This can be confirmed in the alternative plot below in which we displayed the data from DETE (blue) and TAFE (red) as stacked bar graphs__
# In[83]:
fig,axi = plt.subplots(1,1)
combined_updated[combined_updated.institute=='DETE'].pivot_table(values='dissatisfied',index='service_cat').reindex(['New','Experienced','Established','Veteran']).plot(kind='bar',ax=axi,color='b',stacked=True)
combined_updated[combined_updated.institute=='TAFE'].pivot_table(values='dissatisfied',index='service_cat').reindex(['New','Experienced','Established','Veteran']).plot(kind='bar',ax=axi,color='r',stacked=True)
# ### Analyse influence of Age
# In[77]:
combined_updated['age'].value_counts()
# In[128]:
rp_age = {'41 45':'41-45',
'46 50':'46-50',
'21 25':'21-25',
'36 40':'36-40',
'31 35':'31-35',
'26 30':'26-30'}
combined_updated['age'].replace(rp_age,inplace=True)
# In[127]:
combined_updated.pivot_table(values='dissatisfied',index='age').plot(kind='bar')
# # Conclusion 2
# Answering the second question:
#
"Are younger employees resigning due to some kind of dissatisfaction?
#
"What about older employees?"
#
# * Dissaticfaction seems to increase with age, peaking for age groups older than 56yrs (remark that "56 or older" bar could be equally added to the "56-60" and "61 or older" bars)
# * A first peakof dissatisfaction seems to occur for the 26-30yrs old
#
#
Difficult to detect a significant trend for younger employees (<30years), but dissatisfaction seems to peak with the age group of 26-30 years. As this is as well the age group where most people become parents for a first time being aged 25-34 years:
#
https://aifs.gov.au/facts-and-figures/births-in-australia
#
It would have been interesting to investigate the influence of an upcoming child birth on the resigning rate in that age group
#
#
We abosere a rather big jump in resigning rate for the age groups above 56years
# Quick visualization of remaining "NaN/Null" Values
# In[76]:
sns.heatmap(combined_updated.isnull(), cbar=False)