In this project, we will worked with exit surveys from employees of the Department of Education, Training and Employment (DETE) and the Technical and Further Education (TAFE) institute in Queensland, Australia. We have 2 dataset dete_survey.csv and tafe_survey.csv Below are some columns of dete_survey:
Below is a preview of a couple columns we'll work with from the tafe_survey.csv:
Are employees who only worked for the institutes for a short period of time resigning due to some kind of dissatisfaction? What about employees who have been there longer?
Are younger employees resigning due to some kind of dissatisfaction? What about older employees?
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
dete_survey=pd.read_csv('dete_survey.csv')
tafe_survey=pd.read_csv('tafe_survey.csv')
dete_survey.info()
dete_survey.columns
len(dete_survey.columns)
dete_survey.head()
dete_survey.isnull().sum().sort_values()
In dete_survey ,there are 56 columns and large no. of columns in which there are null values 'Disability', 'South Sea', 'Aboriginal' etc.
tafe_survey.info()
len(tafe_survey.columns)
tafe_survey.head()
tafe_survey.isnull().sum().sort_values()
dete_survey=pd.read_csv('dete_survey.csv',na_values='Not Stated')
dete_survey_updated=dete_survey.drop(dete_survey.columns[28:49],axis=1)
tafe_survey=pd.read_csv('tafe_survey.csv',na_values='Not Stated')
tafe_survey_updated=tafe_survey.drop(tafe_survey.columns[17:66],axis=1)
We have made the changes in dete_survey and tafe_survey because the values in their columns is not stated , undefined values like 'N' and 'A' are stated which is meaningless
dete_survey_updated.columns=dete_survey_updated.columns.str.replace(' ','_').str.lower()
rename_columns={
'Record ID': 'id',
'CESSATION YEAR': 'cease_date',
'Reason for ceasing employment': 'separationtype',
'Gender. What is your Gender?': 'gender',
'CurrentAge. Current Age': 'age',
'Employment Type. Employment Type': 'employment_status',
'Classification. Classification': 'position',
'LengthofServiceOverall. Overall Length of Service at Institute (in years)': 'institute_service',
'LengthofServiceCurrent. Length of Service at current workplace (in years)': 'role_service'
}
tafe_survey_updated=tafe_survey_updated.rename(rename_columns, axis=1)
tafe_survey_updated.columns
we have rename the columns because names of the columns were complex which are required for our analysis so to simple our calculation we rename the necesary columns.
dete_survey_updated['separationtype']=dete_survey_updated[
'separationtype'].str.split('-').str[0]
dete_survey_updated['separationtype'].value_counts()
dete_resignations=dete_survey_updated[dete_survey_updated['separationtype']=='Resignation'].copy()
tafe_resignations=tafe_survey_updated[tafe_survey_updated['separationtype']=='Resignation'].copy()
we will take only those rows in which 'separtiontype' value is resignation and in dete_survey_updated there is 3 types of 'resignation' values , we combine all of them
dete_resignations['cease_date'].value_counts()
dete_resignations['cease_date']=dete_resignations['cease_date'].str.split('/').str[-1]
dete_resignations['cease_date']=dete_resignations['cease_date'].astype('float')
dete_resignations['cease_date'].value_counts()
dete_resignations['dete_start_date'].value_counts().sort_index()
%matplotlib inline
dete_resignations.boxplot(['dete_start_date','cease_date'])
plt.show()
dete_resignations['institute_service']=dete_resignations['cease_date']-dete_resignations['dete_start_date']
dete_resignations['institute_service'].value_counts()
tafe_resignations['Contributing Factors. Dissatisfaction'].value_counts()
tafe_resignations['Contributing Factors. Job Dissatisfaction'].value_counts()
def update_vals(df):
if pd.isnull(df):
return np.nan
if df=='-':
return False
else:
return True
tafe_resignations['dissatisfied']=tafe_resignations[
['Contributing Factors. Dissatisfaction','Contributing Factors. Job Dissatisfaction']].applymap(update_vals).any(1,skipna=False)
tafe_resignations_up=tafe_resignations.copy()
tafe_resignations_up['dissatisfied'].value_counts(dropna=False)
dete_resignations['dissatisfied']=dete_resignations[[
'job_dissatisfaction',
'dissatisfaction_with_the_department',
'physical_work_environment',
'lack_of_recognition',
'lack_of_job_security',
'work_location',
'employment_conditions',
'work_life_balance',
'workload']].applymap(update_vals).any(1,skipna=False)
dete_resignations_up=dete_resignations.copy()
dete_resignations_up['dissatisfied'].value_counts(dropna=False)
dete_resignations_up['institute']='DETE'
tafe_resignations_up['institute']='TAFE'
combined=pd.concat([dete_resignations_up,tafe_resignations_up])
combined.notnull().sum().sort_values()
combined_updated=combined.dropna(thresh=500,axis=1).copy()
combined_updated['institute_service'].value_counts(dropna=False)
combined_updated['institute_service_up'] = combined_updated['institute_service'].astype('str').str.extract(r'(\d+)')
combined_updated['institute_service_up'] = combined_updated['institute_service_up'].astype('float')
# Check the years extracted are correct
combined_updated['institute_service_up'].value_counts()
def set_service(val):
if val>=11:
return 'Veteran'
elif val>=7:
return 'Established'
elif val>=3:
return 'Experienced'
else:
return "New"
combined_updated['service_cat']=combined_updated['institute_service_up'].apply(set_service)
combined_updated['service_cat'].value_counts()
combined_updated['dissatisfied'].value_counts(dropna=False)
combined_updated['dissatisfied']=combined_updated['dissatisfied'].fillna(False)
table=combined_updated.pivot_table(values='dissatisfied',index='service_cat')
%matplotlib inline
table.plot(kind='bar')
According to the plot graph most dissatisfied employee are vetreran who are working in the company from 11 years or more...after that Established employees are dissatisfied then Experienced , lowest dissatisfied are New