We will take a look on employee exit surveys from employees of the Department of Education, Training and Employment) (DETE) and the Technical and Further Education (TAFE) institute in Queensland, Australia.
Our aim is to answer on following questions:
We'll work on combined 2 data sets (dete_survey.csv and tafe_survey.csv) to try answer on questions above.
A data dictionary wasn't provided with the dataset. For this project, we'll use our general knowledge to define couple columns.
Column name | Description |
---|---|
ID | An id used to identify the participant of the survey |
SeparationType | The reason why the person's employment ended |
Cease Date | The year or month the person's employment ended |
DETE Start Date | The year the person began employment with the DETE |
Column name | Description |
---|---|
Record ID | An id used to identify the participant of the survey |
Reason for ceasing employment | The reason why the person's employment ended |
LengthofServiceOverall. Overall Length of Service at Institute (in years) | The length of the person's employment (in years) |
You can find the TAFE exit survey here and the survey for the DETE here.
Let's start with opening data sets and take first look at both of them.
#import pandas liblary
import pandas as pd
pd.options.display.max_columns = 150 # to avoid truncated output
#opening dete_survey.csv file
dete_survey = pd.read_csv('dete_survey.csv', encoding='UTF-8')
#opening tafe_survey.csv file
tafe_survey = pd.read_csv('tafe_survey.csv', encoding='UTF-8')
#display info about dete_survey columns
dete_survey.info()
Dete_survey data set contains with 822 rows and 56 columns. Only the first column ID has numeric values, the rest are object or bool type. Some of columns have many missing values, like Torres Strait and South Sea. We have to examine columns more closely.
#display first 5 rows in dete_survey
dete_survey.head()
#display info about tafe_survey columns
tafe_survey.info()
#display first 5 rows in tafe_survey
tafe_survey.head()
#display Nan values in tafe_survey
tafe_survey.isnull().sum().sort_values(ascending=False)
First look at both data sets, give us following observations:
'Not Stated'
values which indicate values are missing, but they aren't represented as NaN
,'-'
character and they are not marked as NaN
values,We will read our data set again, but this time we'll add na_values
parameter.
#opening dete_survey with na_values parameter. 'Not Stated' values are treated as NaN now'
dete_survey = pd.read_csv('dete_survey.csv', na_values='Not Stated')
We will remove some columns, because many of them will not have big influence on our goals. We will focous on these columns which contains similar factors in both data sets.
#delete columns which we won't use in analysys in dete_survey
dete_survey_updated = dete_survey.drop(dete_survey.columns[28:49], axis=1)
#delete columns which we won't use in analysys in tafe_survey
tafe_survey_updated = tafe_survey.drop(tafe_survey.columns[17:66], axis=1)
#check updated dete_survey columns
dete_survey_updated.info()
Our dete_survey
data set contains 35 columns now and almost half of them are bool type.
#check updated tafe_survey columns
tafe_survey_updated.info()
tafe_survey
data set has 23 columns after change.
As we mentioned above, both data sets have similar contributing factors, but column names are different. We'll change their names to combaine date sets.
First, we will do standardize the column names in dete_survey_updated
.
#standardization column names in dete_survey_updated
dete_survey_updated.columns = (dete_survey_updated.columns.str.lower() #lowercase capitalization
.str.strip() #Remove any trailing whitespace from the end of the strings.
.str.replace(' ','_') #Replace spaces with underscores ('_')
)
#display columns after standardization in dete_survey_updated
dete_survey_updated.columns
Next, we'll change some column names in tafe_survey_updated
. We could noticed that many column names are written as whole question from survey and they're too long.
#changing column names in tafe_survey_updated
tafe_survey_updated = tafe_survey_updated.rename(columns={'Record ID': 'id',
'CESSATION YEAR': 'cease_date',
'Reason for ceasing employment': 'separationtype',
'Gender. What is your Gender?': 'gender',
'CurrentAge. Current Age': 'age',
'Employment Type. Employment Type': 'employment_status',
'Classification. Classification': 'position',
'LengthofServiceOverall. Overall Length of Service at Institute (in years)': 'institute_service',
'LengthofServiceCurrent. Length of Service at current workplace (in years)': 'role_service'
})
#display columns after renaming in tafe_survey_updated
tafe_survey_updated.columns
Let's check separationtype
column in both data set
#checking unique values in dete_survey_updated
dete_survey_updated['separationtype'].value_counts()
#checking unique values in tafe_survey_updated
tafe_survey_updated['separationtype'].value_counts()
We see that there are 3 types of Resignation in dete_survey_updated
. In tafe_survey_updated
there is one value with resigantion.
To work easier with both data sets we will treat these 3 types of Resignation in dete_survey_updated
as one value - Resignation
.
#convert all values which contains world Resignation in one value
dete_survey_updated['separationtype'] = dete_survey_updated['separationtype'].str.split('-').str.get(0)
#display values in separationtype column in dete_survey_updated after changes
dete_survey_updated['separationtype'].value_counts()
After changes, we can select only the data for survey respondents who have a Resignation separation type in both data sets.
#selecting rows with Resignation value in separationtype column for each DataFrame
#make copy to avoid SettingWithCopy Warning
dete_resignations = dete_survey_updated[dete_survey_updated['separationtype'] == 'Resignation'].copy()
tafe_resignations = tafe_survey_updated[tafe_survey_updated['separationtype'] == 'Resignation'].copy()
Let's look at cease_date
and dete_start_date
columns to verify if the years look reliably.
First, we have to convert values to float in dete_resignations
dataframe.
#check unique values
dete_resignations['cease_date'].value_counts()
Many values contain also number of mounth, we can remove them at leave only year.
#convering cease_date column to float
dete_resignations['cease_date'] = (dete_resignations['cease_date'].str.split('/') #split by character '/'
.str.get(-1) #get last item in each value
.astype(float) #convert to float
)
#sort unique values in cease_date column
dete_resignations_end_sorted = dete_resignations['cease_date'].value_counts().sort_index(ascending = False)
#check values after converting
dete_resignations_end_sorted
#sort unique values in dete_start_date column
dete_resignations_start_sorted = dete_resignations['dete_start_date'].value_counts().sort_index(ascending = False)
dete_resignations_start_sorted
#sort unique values in cease_date column
tafe_resignations['cease_date'].value_counts().sort_index(ascending = False)
#plot values of cease_date column in tafe_resignations
tafe_resignations.boxplot(column=['cease_date'])
#plot values of cease_date column in dete_resignations
dete_resignations.boxplot(column=['cease_date'])
We compared cease_date
column in both Dataframes. All values seem to be real.
However, we see that some values are not a equal. For example, in tafe_resignations
there are some values with 2009 year and there not in dete_resignations
. Most popular resigantion year is 2013 in dete_resignations
and 2011 in tafe_resignations
.
One of our goal, as a remainder is:
To ask on question above, we need an information about period of employment. In tafe_resignations
column with these values already exist.
Let' calculate the period of service each employee in dete_resignations
. We have 2 columns here: dete_start_date
and cease_date
. They contain floats so we have to only substract them.
Let's first check if dates are real. Period of employment can't be a negative value, so we will check if any value in dete_start_date
column is bigger than cease_date
column.
#compare values in cease_date column with dete_start_date column
dete_resignations[dete_resignations.cease_date < dete_resignations.dete_start_date].copy()
All look ok, so we can calculate period of employment.
#Subtracting values in cease_date column from cease_date and create new column -> institute_service
dete_resignations['institute_service'] = dete_resignations.cease_date - dete_resignations.dete_start_date
#display 5 first rows
dete_resignations.head()
Now, we will identify dissatisfied employees. To do this, we have chosen following columns, which in our opinion most contributes to dissatisfaction:
First, we'll look at tafe_resignations
DataFrame.
# check unique values in Contributing Factors. Dissatisfaction column
tafe_resignations['Contributing Factors. Dissatisfaction'].value_counts(dropna=False) #set dropna parameter to see NaN values
# check unique values in Contributing Factors. Job Dissatisfaction
tafe_resignations['Contributing Factors. Job Dissatisfaction'].value_counts(dropna=False) #set dropna parameter to see NaN values
There are only 2 values in both columns. One: '-'
character, which means False and second values is the same like column name, so it means that this is True. There are also few NaN
values.
We are going to create a new column with marking if employee's resigantion is due to any kind of dissatisfaction.
First, we have to make a function to convert values on True
or False
.
#import numpy liblary to use NaN values
import numpy as np
#creating update_vals function
def update_vals(value):
if pd.isnull(value):
return np.nan ##use NaN values
elif value == '-':
return False
else:
return True
Our created funcation is ready to apply. We can simply convert values to bool type. Next, we can use df.any()
function. If any column in tafe_resignations
(we chosen columns above) will has a True
value the row in dissatisfied
column will be True also. If all values are False
, then it means that the resigantion is not due to dissatisfaction and it is marked aas False. NaN
values will remain unchanged.
tafe_resignations['dissatisfied'] = (tafe_resignations[['Contributing Factors. Dissatisfaction', 'Contributing Factors. Job Dissatisfaction']]
.applymap(update_vals) #apply function update_vals
.any(axis=1, skipna=False) #set row in dissatisfied column (True, False or NaN (skipna parameter))
)
tafe_resignations_up = tafe_resignations.copy() #make copy to avoid SettingWithCopy Warning
#dispaly values in dissatisfied column
tafe_resignations_up['dissatisfied'].value_counts(dropna=False)
#show columns in dete_resignations
dete_resignations.info()
All columns which we have chosen to our analysy are bool type, so only we have to do is creat also new column dissatisfied
in dete_resignations
and use df.any()
.
#create dissatisfied column and apply True, False or NaN value to each row
dete_resignations['dissatisfied'] = dete_resignations[['job_dissatisfaction',
'dissatisfaction_with_the_department', 'physical_work_environment',
'lack_of_recognition', 'lack_of_job_security', 'work_location',
'employment_conditions', 'work_life_balance',
'workload']].any(axis=1, skipna=False)
dete_resignations_up = dete_resignations.copy() #make copy to avoid SettingWithCopy Warning
#dispaly values in dissatisfied column
dete_resignations_up['dissatisfied'].value_counts(dropna=False)
tafe_resignations_up.info()
It's time to combine both DataFrames. First, we will add new column institute
with DATE
value for dete_resignations_up
and TAFE
for tafe_resignations_up
. We'll know which row come from which DataFrame.
dete_resignations_up['institute'] = 'DATE' #create new column and add DATE value to each row
tafe_resignations_up['institute'] = 'TAFE' #create new column and add TAFE value to each row
#combining DataFrames with pd.concat
combined = pd.concat([dete_resignations_up, tafe_resignations_up], ignore_index=True) #set ignore_index on True to sort out indexes
#check columns in combined
combined.info()
There are many columns which we don't need in our analysys, so we we will drop columns which contain les than 500 non null values.
#check which columns have less than 500 no null values
combined.notnull().sum().sort_values()
#droping columns with less than 500 no null values
combined_updated = combined.dropna(axis=1, thresh=500).copy() #make copy to avoid SettingWithCopy Warning
combined_updated.head()
We'll check all values in institute_service
column if there any tricky values.
#checking all values in institute_service column
combined_updated['institute_service'].value_counts(dropna=False) #check also NaN values
Most values are ok, but some of them are written in range, like 1-2
. There are also 2 values with some words, More than 20 years
and Less than 1 year
.
We have to convert all values to get one numeric value which will indicates on peroid of employment.
We'll group all values on four categories:
This modification help us with our analysys. Moreover, problem with choosing number of years goes away, because we can chose the first number from the range.
For example, for range:
7-10
We'll choose 7
.
#extract only numbers from string and then convert it to float
combined_updated['institute_service'] = (combined_updated['institute_service'].astype(str)
.str.extract('(\d+)')
.astype(float))
#check all values in institute_service column after converting
combined_updated['institute_service'].value_counts(dropna=False)
As we mentioned above, we will divide values from institute_service
column info four categories.
We'll create a function to compare each value and add category each of them.
#creating function set_category
def set_category(value):
if value < 3:
return 'New'
elif value >= 3 and value <= 6:
return 'Experienced'
elif value > 6 and value <=10:
return 'Established'
elif pd.isnull(value):
return np.nan #return NaN value if it was NaN
else:
return 'Veteran'
#create new column service_cat and add category for each row
combined_updated['service_cat'] = combined_updated['institute_service'].apply(set_category)
#check values after change
combined_updated['service_cat'].value_counts(dropna=False)
We can start do our initial analysys. Let's first check all values in dissatisfied
column.
#checking all values in dissatisfied column
combined_updated['dissatisfied'].value_counts(dropna=False) #check also NaN values
We noticed that most values are False, almost 60% more than True values. There are also 8 NaN
values. We'll replace them on False.
#replacing all missing values with False
combined_updated['dissatisfied'] = combined_updated['dissatisfied'].fillna(False)
#check all values in dissatisfied column after replacing
combined_updated['dissatisfied'].value_counts(dropna=False) #check also NaN values
#create pivot table to calcualte percentage of dissatisfied employee in each group
dissatisfied_perc = combined_updated.pivot_table(index='service_cat', values='dissatisfied')
#display plot in Jupyter
%matplotlib inline
#create bar plot
dissatisfied_perc.plot(kind='bar')
As we can see, the most employees whose resignaed due to dissatisfaction are from group Established
(7-10 years at a company) and Veteran
(11 or more years at a company). We can say that every second person who leave job, because they were dissatisfied has 7 years exeprience or more.
Let's check also age of each employee who letf the job due to dissatisfaction.
#check all values in age column
combined_updated['age'].value_counts(dropna=False)
As we can see, we have many ranges with age. We decaided to divide all employees on 8 categories:
#convering values in age column to float
combined_updated['age'] = (combined_updated['age'].astype(str)
.str.extract('(\d+)')
.astype(float)
)
#see new values after change
combined_updated['age'].value_counts(dropna=False) #display NaN values also
Now, we can creat function to assign each age to each range.
#creating set_age function
def set_age(value):
if value < 25:
return '18 - 25'
elif value >= 26 and value <= 30:
return '26 - 30'
elif value >= 31 and value <= 35:
return '31 - 35'
elif value >= 36 and value <= 40:
return '36 - 40'
elif value >= 41 and value <= 45:
return '41 - 45'
elif value >= 46 and value <= 50:
return '46 - 45'
elif value >= 51 and value <= 55:
return '51 - 55'
elif value >= 56:
return '56 and more'
elif pd.isnull(value): #return NaN
return np.nan
#create new column and apply each value
combined_updated['age_range'] = combined_updated['age'].apply(set_age)
#see new values in age_range column after change
combined_updated['age_range'].value_counts(dropna=False).sort_index()
#create pivot table
pv_age = pd.pivot_table(combined_updated, index='age_range', values='dissatisfied')
pv_age
#create plot
pv_age.plot(kind='bar', rot=45, ylim=(0,0.5), legend=False)
As we can noticed, most often employees who left the job due to dissatisfaction are in 3 age ranges:
Let's create a plot for each group of service to check what is most common age range of dissatisfied employees.
#import matplotlib liblary
import matplotlib.pyplot as plt
#creating plots
plt.figure(figsize=(12,10))
plt.subplot(2,2,1)
(combined_updated[(combined_updated['service_cat'] == 'New') & (combined_updated['dissatisfied'] == True)]['age_range'].value_counts(normalize=True)).plot(kind='bar', rot=45)
plt.title('New')
plt.subplot(2,2,2)
(combined_updated[(combined_updated['service_cat'] == 'Experienced') & (combined_updated['dissatisfied'] == True)]['age_range'].value_counts(normalize=True)).plot(kind='bar', rot=45)
plt.title('Experienced')
plt.subplot(2,2,3)
(combined_updated[(combined_updated['service_cat'] == 'Established') & (combined_updated['dissatisfied'] == True)]['age_range'].value_counts(normalize=True)).plot(kind='bar', rot=45)
plt.title('Established')
plt.subplot(2,2,4)
(combined_updated[(combined_updated['service_cat'] == 'Veteran') & (combined_updated['dissatisfied'] == True)]['age_range'].value_counts(normalize=True)).plot(kind='bar', rot=45)
plt.title('Veteran')
plt.show()
We can noticed some observations:
New
category every fifth person who left job and was dissatisfied has from 18 till 25 years.Let's ask on one more question:
#creat pivot table
pv_institute = combined_updated.pivot_table(index = 'institute', values = 'dissatisfied')
pv_institute
Let's summarize our analysys. We're ready on answer following questions:
As we can see, the most employees whose resignaed due to dissatisfaction are from group Established
(7-10 years at a company) and Veteran
(11 or more years at a company). We can say that every second person who leave job, because they were dissatisfied has 7 years exeprience or more.
Dissatisfied employees, most often letf the job due to some kind of dissatisfaction in age 51 or older and in age 26-30 . In each of gropu, more than 40% employees letf the job who are not be satisfied. Whereas, less than 30% of people in age 18-25 leave job due to dissatisfaction.
During whole analysys we also found that:
DETE
survey end their employment almost twice as much as employees from TAFE
survey.New
category every fifth person who left job and was dissatisfied has from 18 till 25 years.