The purpose of this analyses is to discover potential patterns regarding employee's who have terminated their employment at The Department of Education, Training and Employment (DETE) and the Technical and Further Education (TAFE) insititute in Queensland, Austrlia. The insights will be used as a basis for further analyses as well as for potential recommendations for improvements to reduce employee turnover for the two institutes. For this analyses, we will be using the employee exist survey's released by the insititutes, which can be found here (TAFE) and here (DETE).
The analyses as hypothetically requested by the institutes will be mainly focused on dissatisfaction as a factor of employee turnover. As such, we'll attempt to answer the following questions:
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
import seaborn as sns
import plotly.offline as pyo
pyo.init_notebook_mode()
import plotly.express as px
C:\Users\garci\Anaconda3\lib\site-packages\statsmodels\tools\_testing.py:19: FutureWarning: pandas.util.testing is deprecated. Use the functions in the public API at pandas.testing instead. import pandas.util.testing as tm
Let's start by loading and getting a quick overview of the datasets by printing out the top 5 rows, and their corresponding shape.
df1 = pd.read_csv('dete_survey.csv')
df2 = pd.read_csv('tafe_survey.csv')
print(df1.shape) #tuple of row and columns of dataframe
df1.head() #prints top 5 rows
(822, 56)
ID | SeparationType | Cease Date | DETE Start Date | Role Start Date | Position | Classification | Region | Business Unit | Employment Status | ... | Kept informed | Wellness programs | Health & Safety | Gender | Age | Aboriginal | Torres Strait | South Sea | Disability | NESB | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | Ill Health Retirement | 08/2012 | 1984 | 2004 | Public Servant | A01-A04 | Central Office | Corporate Strategy and Peformance | Permanent Full-time | ... | N | N | N | Male | 56-60 | NaN | NaN | NaN | NaN | Yes |
1 | 2 | Voluntary Early Retirement (VER) | 08/2012 | Not Stated | Not Stated | Public Servant | AO5-AO7 | Central Office | Corporate Strategy and Peformance | Permanent Full-time | ... | N | N | N | Male | 56-60 | NaN | NaN | NaN | NaN | NaN |
2 | 3 | Voluntary Early Retirement (VER) | 05/2012 | 2011 | 2011 | Schools Officer | NaN | Central Office | Education Queensland | Permanent Full-time | ... | N | N | N | Male | 61 or older | NaN | NaN | NaN | NaN | NaN |
3 | 4 | Resignation-Other reasons | 05/2012 | 2005 | 2006 | Teacher | Primary | Central Queensland | NaN | Permanent Full-time | ... | A | N | A | Female | 36-40 | NaN | NaN | NaN | NaN | NaN |
4 | 5 | Age Retirement | 05/2012 | 1970 | 1989 | Head of Curriculum/Head of Special Education | NaN | South East | NaN | Permanent Full-time | ... | N | A | M | Female | 61 or older | NaN | NaN | NaN | NaN | NaN |
5 rows × 56 columns
The data from the DETE survey consists of 822 rows and 56 columns. Clearly the columns are more than we will need to answer the questions set out in this analyses. As such, we'll keep columns that may be useful in answering our question. We'll print out the columns to select certain columns of interest and keep the relevant columns.
df1.columns
Index(['ID', 'SeparationType', 'Cease Date', 'DETE Start Date', 'Role Start Date', 'Position', 'Classification', 'Region', 'Business Unit', 'Employment Status', 'Career move to public sector', 'Career move to private sector', 'Interpersonal conflicts', 'Job dissatisfaction', 'Dissatisfaction with the department', 'Physical work environment', 'Lack of recognition', 'Lack of job security', 'Work location', 'Employment conditions', 'Maternity/family', 'Relocation', 'Study/Travel', 'Ill Health', 'Traumatic incident', 'Work life balance', 'Workload', 'None of the above', 'Professional Development', 'Opportunities for promotion', 'Staff morale', 'Workplace issue', 'Physical environment', 'Worklife balance', 'Stress and pressure support', 'Performance of supervisor', 'Peer support', 'Initiative', 'Skills', 'Coach', 'Career Aspirations', 'Feedback', 'Further PD', 'Communication', 'My say', 'Information', 'Kept informed', 'Wellness programs', 'Health & Safety', 'Gender', 'Age', 'Aboriginal', 'Torres Strait', 'South Sea', 'Disability', 'NESB'], dtype='object')
#From the above columns, the following were selected based on their relevancy to the dependent
#variabe (dissatisfaction) and/or independent variables (service year, gender, age, and seperation type):
columns = ['SeparationType', 'Cease Date', 'DETE Start Date',
'Interpersonal conflicts',
'Job dissatisfaction', 'Dissatisfaction with the department',
'Physical work environment', 'Lack of recognition',
'Lack of job security', 'Work location', 'Employment conditions',
'Work life balance', 'Workload',
'Opportunities for promotion', 'Staff morale', 'Workplace issue',
'Physical environment', 'Worklife balance',
'Stress and pressure support', 'Performance of supervisor',
'Peer support', 'Gender','Age']
dete = df1[columns].copy()
Let's now check out the condensed dataframe.
dete.sample(2)
SeparationType | Cease Date | DETE Start Date | Interpersonal conflicts | Job dissatisfaction | Dissatisfaction with the department | Physical work environment | Lack of recognition | Lack of job security | Work location | ... | Opportunities for promotion | Staff morale | Workplace issue | Physical environment | Worklife balance | Stress and pressure support | Performance of supervisor | Peer support | Gender | Age | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
252 | Age Retirement | 2012 | 2011 | False | False | False | False | False | False | False | ... | N | N | A | A | N | N | A | A | Male | 61 or older |
314 | Resignation-Other employer | 2013 | 2001 | False | False | False | False | False | False | False | ... | N | N | N | D | SA | A | N | A | Female | 36-40 |
2 rows × 23 columns
With the potential columns saved and kept into a new dataframe, let's check out the general information from it using the .info() Pandas method--displays shape, null values, types of columns.
dete.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 822 entries, 0 to 821 Data columns (total 23 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 SeparationType 822 non-null object 1 Cease Date 822 non-null object 2 DETE Start Date 822 non-null object 3 Interpersonal conflicts 822 non-null bool 4 Job dissatisfaction 822 non-null bool 5 Dissatisfaction with the department 822 non-null bool 6 Physical work environment 822 non-null bool 7 Lack of recognition 822 non-null bool 8 Lack of job security 822 non-null bool 9 Work location 822 non-null bool 10 Employment conditions 822 non-null bool 11 Work life balance 822 non-null bool 12 Workload 822 non-null bool 13 Opportunities for promotion 735 non-null object 14 Staff morale 816 non-null object 15 Workplace issue 788 non-null object 16 Physical environment 817 non-null object 17 Worklife balance 815 non-null object 18 Stress and pressure support 810 non-null object 19 Performance of supervisor 813 non-null object 20 Peer support 812 non-null object 21 Gender 798 non-null object 22 Age 811 non-null object dtypes: bool(10), object(13) memory usage: 91.6+ KB
Our dataframe consists of 822 rows, 23 columns with roughly half of the columns being of type bool and the other of object type. In additon, relatviely low number of null vlaues have been identified. We'll make sure to check out the null values in the preprocessing stage as well change the data types of certain columns such as the 'Cease Date' one given that it will make more sense of having it as a integer or Datetime Pandas object.
dete.describe(include='all').transpose()
count | unique | top | freq | |
---|---|---|---|---|
SeparationType | 822 | 9 | Age Retirement | 285 |
Cease Date | 822 | 25 | 2012 | 344 |
DETE Start Date | 822 | 51 | Not Stated | 73 |
Interpersonal conflicts | 822 | 2 | False | 788 |
Job dissatisfaction | 822 | 2 | False | 733 |
Dissatisfaction with the department | 822 | 2 | False | 761 |
Physical work environment | 822 | 2 | False | 806 |
Lack of recognition | 822 | 2 | False | 765 |
Lack of job security | 822 | 2 | False | 794 |
Work location | 822 | 2 | False | 795 |
Employment conditions | 822 | 2 | False | 788 |
Work life balance | 822 | 2 | False | 605 |
Workload | 822 | 2 | False | 735 |
Opportunities for promotion | 735 | 6 | A | 242 |
Staff morale | 816 | 6 | A | 335 |
Workplace issue | 788 | 6 | A | 357 |
Physical environment | 817 | 6 | A | 467 |
Worklife balance | 815 | 6 | A | 359 |
Stress and pressure support | 810 | 6 | A | 342 |
Performance of supervisor | 813 | 6 | A | 349 |
Peer support | 812 | 6 | A | 401 |
Gender | 798 | 2 | Female | 573 |
Age | 811 | 10 | 61 or older | 222 |
From the output, retirement is the most common reason for DETE employees leaving their jobs with most frequent age being above 61 years or older. Not suprisingly, most reported levels of dissatsifaction seem to be relatively unusual as indicated by the large amounts of False counts across the above columns. Lastly, Females seemed to represent the majority of the employees who have left their jobs at DETE.
Let's now shift our attention towards the TAFE survey dataset. We'll take the same steps that we did with the DETE dataframe by printing it's shape and first five rows.
print(df2.shape)
df2.head()
(702, 72)
Record ID | Institute | WorkArea | CESSATION YEAR | Reason for ceasing employment | Contributing Factors. Career Move - Public Sector | Contributing Factors. Career Move - Private Sector | Contributing Factors. Career Move - Self-employment | Contributing Factors. Ill Health | Contributing Factors. Maternity/Family | ... | Workplace. Topic:Does your workplace promote a work culture free from all forms of unlawful discrimination? | Workplace. Topic:Does your workplace promote and practice the principles of employment equity? | Workplace. Topic:Does your workplace value the diversity of its employees? | Workplace. Topic:Would you recommend the Institute as an employer to others? | Gender. What is your Gender? | CurrentAge. Current Age | Employment Type. Employment Type | Classification. Classification | LengthofServiceOverall. Overall Length of Service at Institute (in years) | LengthofServiceCurrent. Length of Service at current workplace (in years) | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 6.341330e+17 | Southern Queensland Institute of TAFE | Non-Delivery (corporate) | 2010.0 | Contract Expired | NaN | NaN | NaN | NaN | NaN | ... | Yes | Yes | Yes | Yes | Female | 26 30 | Temporary Full-time | Administration (AO) | 1-2 | 1-2 |
1 | 6.341337e+17 | Mount Isa Institute of TAFE | Non-Delivery (corporate) | 2010.0 | Retirement | - | - | - | - | - | ... | Yes | Yes | Yes | Yes | NaN | NaN | NaN | NaN | NaN | NaN |
2 | 6.341388e+17 | Mount Isa Institute of TAFE | Delivery (teaching) | 2010.0 | Retirement | - | - | - | - | - | ... | Yes | Yes | Yes | Yes | NaN | NaN | NaN | NaN | NaN | NaN |
3 | 6.341399e+17 | Mount Isa Institute of TAFE | Non-Delivery (corporate) | 2010.0 | Resignation | - | - | - | - | - | ... | Yes | Yes | Yes | Yes | NaN | NaN | NaN | NaN | NaN | NaN |
4 | 6.341466e+17 | Southern Queensland Institute of TAFE | Delivery (teaching) | 2010.0 | Resignation | - | Career Move - Private Sector | - | - | - | ... | Yes | Yes | Yes | Yes | Male | 41 45 | Permanent Full-time | Teacher (including LVT) | 3-4 | 3-4 |
5 rows × 72 columns
Like the dete dataset there seems to be certain columns that will not be relevant for our analyses and as such we'll be selecing to work with only relevant ones.
df2.columns
Index(['Record ID', 'Institute', 'WorkArea', 'CESSATION YEAR', 'Reason for ceasing employment', 'Contributing Factors. Career Move - Public Sector ', 'Contributing Factors. Career Move - Private Sector ', 'Contributing Factors. Career Move - Self-employment', 'Contributing Factors. Ill Health', 'Contributing Factors. Maternity/Family', 'Contributing Factors. Dissatisfaction', 'Contributing Factors. Job Dissatisfaction', 'Contributing Factors. Interpersonal Conflict', 'Contributing Factors. Study', 'Contributing Factors. Travel', 'Contributing Factors. Other', 'Contributing Factors. NONE', 'Main Factor. Which of these was the main factor for leaving?', 'InstituteViews. Topic:1. I feel the senior leadership had a clear vision and direction', 'InstituteViews. Topic:2. I was given access to skills training to help me do my job better', 'InstituteViews. Topic:3. I was given adequate opportunities for personal development', 'InstituteViews. Topic:4. I was given adequate opportunities for promotion within %Institute]Q25LBL%', 'InstituteViews. Topic:5. I felt the salary for the job was right for the responsibilities I had', 'InstituteViews. Topic:6. The organisation recognised when staff did good work', 'InstituteViews. Topic:7. Management was generally supportive of me', 'InstituteViews. Topic:8. Management was generally supportive of my team', 'InstituteViews. Topic:9. I was kept informed of the changes in the organisation which would affect me', 'InstituteViews. Topic:10. Staff morale was positive within the Institute', 'InstituteViews. Topic:11. If I had a workplace issue it was dealt with quickly', 'InstituteViews. Topic:12. If I had a workplace issue it was dealt with efficiently', 'InstituteViews. Topic:13. If I had a workplace issue it was dealt with discreetly', 'WorkUnitViews. Topic:14. I was satisfied with the quality of the management and supervision within my work unit', 'WorkUnitViews. Topic:15. I worked well with my colleagues', 'WorkUnitViews. Topic:16. My job was challenging and interesting', 'WorkUnitViews. Topic:17. I was encouraged to use my initiative in the course of my work', 'WorkUnitViews. Topic:18. I had sufficient contact with other people in my job', 'WorkUnitViews. Topic:19. I was given adequate support and co-operation by my peers to enable me to do my job', 'WorkUnitViews. Topic:20. I was able to use the full range of my skills in my job', 'WorkUnitViews. Topic:21. I was able to use the full range of my abilities in my job. ; Category:Level of Agreement; Question:YOUR VIEWS ABOUT YOUR WORK UNIT]', 'WorkUnitViews. Topic:22. I was able to use the full range of my knowledge in my job', 'WorkUnitViews. Topic:23. My job provided sufficient variety', 'WorkUnitViews. Topic:24. I was able to cope with the level of stress and pressure in my job', 'WorkUnitViews. Topic:25. My job allowed me to balance the demands of work and family to my satisfaction', 'WorkUnitViews. Topic:26. My supervisor gave me adequate personal recognition and feedback on my performance', 'WorkUnitViews. Topic:27. My working environment was satisfactory e.g. sufficient space, good lighting, suitable seating and working area', 'WorkUnitViews. Topic:28. I was given the opportunity to mentor and coach others in order for me to pass on my skills and knowledge prior to my cessation date', 'WorkUnitViews. Topic:29. There was adequate communication between staff in my unit', 'WorkUnitViews. Topic:30. Staff morale was positive within my work unit', 'Induction. Did you undertake Workplace Induction?', 'InductionInfo. Topic:Did you undertake a Corporate Induction?', 'InductionInfo. Topic:Did you undertake a Institute Induction?', 'InductionInfo. Topic: Did you undertake Team Induction?', 'InductionInfo. Face to Face Topic:Did you undertake a Corporate Induction; Category:How it was conducted?', 'InductionInfo. On-line Topic:Did you undertake a Corporate Induction; Category:How it was conducted?', 'InductionInfo. Induction Manual Topic:Did you undertake a Corporate Induction?', 'InductionInfo. Face to Face Topic:Did you undertake a Institute Induction?', 'InductionInfo. On-line Topic:Did you undertake a Institute Induction?', 'InductionInfo. Induction Manual Topic:Did you undertake a Institute Induction?', 'InductionInfo. Face to Face Topic: Did you undertake Team Induction; Category?', 'InductionInfo. On-line Topic: Did you undertake Team Induction?process you undertook and how it was conducted.]', 'InductionInfo. Induction Manual Topic: Did you undertake Team Induction?', 'Workplace. Topic:Did you and your Manager develop a Performance and Professional Development Plan (PPDP)?', 'Workplace. Topic:Does your workplace promote a work culture free from all forms of unlawful discrimination?', 'Workplace. Topic:Does your workplace promote and practice the principles of employment equity?', 'Workplace. Topic:Does your workplace value the diversity of its employees?', 'Workplace. Topic:Would you recommend the Institute as an employer to others?', 'Gender. What is your Gender?', 'CurrentAge. Current Age', 'Employment Type. Employment Type', 'Classification. Classification', 'LengthofServiceOverall. Overall Length of Service at Institute (in years)', 'LengthofServiceCurrent. Length of Service at current workplace (in years)'], dtype='object')
#From the above columns, the following were selected based on their relevancy to the dependent
#variabe (dissatisfaction) and/or independent variables (service year, gender, age, and seperation type):
columns =['CESSATION YEAR',
'Reason for ceasing employment',
'Contributing Factors. Dissatisfaction',
'Contributing Factors. Job Dissatisfaction',
'Contributing Factors. Interpersonal Conflict',
'Main Factor. Which of these was the main factor for leaving?',
'Gender. What is your Gender?', 'CurrentAge. Current Age',
'LengthofServiceOverall. Overall Length of Service at Institute (in years)',]
tafe =df2[columns].copy()
Let's get a quick overview of the general information of the newly condensed dataframe.
tafe.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 702 entries, 0 to 701 Data columns (total 9 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 CESSATION YEAR 695 non-null float64 1 Reason for ceasing employment 701 non-null object 2 Contributing Factors. Dissatisfaction 437 non-null object 3 Contributing Factors. Job Dissatisfaction 437 non-null object 4 Contributing Factors. Interpersonal Conflict 437 non-null object 5 Main Factor. Which of these was the main factor for leaving? 113 non-null object 6 Gender. What is your Gender? 596 non-null object 7 CurrentAge. Current Age 596 non-null object 8 LengthofServiceOverall. Overall Length of Service at Institute (in years) 596 non-null object dtypes: float64(1), object(8) memory usage: 49.5+ KB
The TAFE dataframe consists of 702 rows and 9 columns with all but one being of object type. A relatively large amount of null values; the Main Factor column missing a large section of it's data. Possible reasons may be due to it's values being reflected in the other columns such as the dissatisfactions ones.
Based on the available data, let's check out the descriptive statistics for our columns.
tafe.describe(include='all').transpose()
count | unique | top | freq | mean | std | min | 25% | 50% | 75% | max | |
---|---|---|---|---|---|---|---|---|---|---|---|
CESSATION YEAR | 695 | NaN | NaN | NaN | 2011.42 | 0.905977 | 2009 | 2011 | 2011 | 2012 | 2013 |
Reason for ceasing employment | 701 | 6 | Resignation | 340 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
Contributing Factors. Dissatisfaction | 437 | 2 | - | 371 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
Contributing Factors. Job Dissatisfaction | 437 | 2 | - | 360 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
Contributing Factors. Interpersonal Conflict | 437 | 2 | - | 410 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
Main Factor. Which of these was the main factor for leaving? | 113 | 11 | Dissatisfaction with %[Institute]Q25LBL% | 23 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
Gender. What is your Gender? | 596 | 2 | Female | 389 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
CurrentAge. Current Age | 596 | 9 | 56 or older | 162 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
LengthofServiceOverall. Overall Length of Service at Institute (in years) | 596 | 7 | Less than 1 year | 147 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
From the output, about half of the data comes for the years 2011 to 2012 with resignations as the most common reason for ceasing employement amongst former TAFE employees. In addition, from the available values in the main factor for leaving column, dissatisfaction with the institution seems to be the most common factor leading to termination of employment. Lastly, similar to the DETE employees are mostly represented as being Females, in their later years (56 or older) with the top service dates being less than a year.
From the output, retirement is the most common reason for DETE employees leaving their jobs with most frequent age being above 61 years or older. Not suprisingly, most reported levels of dissatsifaction seem to be relatively unusual as indicated by the large amounts of False counts across the above columns. Lastly, Females seemed to represent the majority of the employees who have left their jobs at DETE.From the output, about half of the data comes for the years 2011 to 2012 with resignations as the most common reason for ceasing employement amongst former TAFE employees. In addition, from the available values in the main factor for leaving column, dissatisfaction with the institution seems to be the most common factor leading to termination of employment.Lastly, similar to the DETE employees are mostly represented as being Females, in their later years (56 or older) with the top service dates being less than a year. Our dataframe consists of 822 rows, 23 columns with roughly half of the columns being of type bool and the other of object type. In additon, relatviely low number of null vlaues have been identified. We'll make sure to check out the null values in the preprocessing stage as well change the data types of certain columns such as the 'Cease Date' one given that it will make more sense of having it as a integer or Datetime Pandas object.The TAFE dataframe consists of 702 rows and 9 columns with all but one being of object type. A relatively large amount of null values; the Main Factor column missing a large section of it's data. Possible reasons may be due to it's values being reflected in the other columns such as the dissatisfactions ones.
def missing_values(data):
miss = {}
if len(data.columns) <= 0:
print('Something went wrong. No columns detected.')
for i in data.columns: #iterate through all columns in dataset to search for each defined missing value
if data[i].isnull().any():
miss[i] = 'Null values ' + str(
round(
len(data[data[i].isnull() == True]) / len(data) * 100, 2))
if len(data[data[i] == 0]) > 1:
miss[i] = 'Zero values ' + str(
round(len(data[data[i] == 0]) / len(data) * 100, 2))
try:
if len(np.where(data[i] == '')) > 1:
miss[i] = 'Empty strings ' + str(
round(len(data[data[i] == '']) / len(data) * 100, 2))
except:
pass
if np.issubdtype(data[i].dtype,
np.number) and len(data[data[i] < 0]) > 1:
miss[i] = 'Negative values ' + str(
round(len(data[data[i] < 0]) / len(data) * 100, 2))
try:
if len(data[data[i].str.lower.isin(['n/a', 'na', 'an', 'n\a', '?'
])]) > 1:
miss[i] = 'other null values present'
except:
if len(data[data[i].isin(['N/A', '?', 'na', 'Na', 'N/a'])]) > 1:
miss[i] = 'other null values present'
return sorted(miss.items(),
key=lambda x: float(miss[x[0]].split(' ')[-1]),
reverse=True)
def find_anomalies(df, col, STD):
# Set upper and lower limit to 3 standard deviation
random_data_std = np.std(df[col])
random_data_mean = np.mean(df[col])
anomaly_cut_off = random_data_std * STD
lower_limit = float(random_data_mean - anomaly_cut_off)
upper_limit = float(random_data_mean + anomaly_cut_off)
print('Lower limit:', lower_limit)
print('Upper limit:', upper_limit)
# Generate outliers
outliers = df[(df[col] < lower_limit) | (df[col] > upper_limit)]
return outliers
To better work with the dataset's we'll rename them to be all lowercases and replace spaces with underscores.
tafe.columns = [i.lower().replace(' ', '_') for i in tafe]
dete.columns = [i.lower().replace(' ', '_') for i in dete]
Our next step will be to check for and determine what to do with missing values in both dataframes. We'll make use of the missing_values defined earlier to check for null values. Let's start with the dete dataframe.
miss =missing_values(dete)
miss
C:\Users\garci\Anaconda3\lib\site-packages\pandas\core\ops\array_ops.py:253: FutureWarning: elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison
[('physical_work_environment', 'Zero values 98.05'), ('work_location', 'Zero values 96.72'), ('lack_of_job_security', 'Zero values 96.59'), ('interpersonal_conflicts', 'Zero values 95.86'), ('employment_conditions', 'Zero values 95.86'), ('lack_of_recognition', 'Zero values 93.07'), ('dissatisfaction_with_the_department', 'Zero values 92.58'), ('workload', 'Zero values 89.42'), ('job_dissatisfaction', 'Zero values 89.17'), ('work_life_balance', 'Zero values 73.6'), ('opportunities_for_promotion', 'Null values 10.58'), ('workplace_issue', 'Null values 4.14'), ('gender', 'Null values 2.92'), ('stress_and_pressure_support', 'Null values 1.46'), ('age', 'Null values 1.34'), ('peer_support', 'Null values 1.22'), ('performance_of_supervisor', 'Null values 1.09'), ('worklife_balance', 'Null values 0.85'), ('staff_morale', 'Null values 0.73'), ('physical_environment', 'Null values 0.61')]
The output reveals various columns with a high percentage of 0 values; however, this is due to the True/False values seen earlier in the general dataframe exploration stage. When looking at the null values in the dataframe, we can observe that they constitute .61% up to 10.58% of the given columns.
So now, we'll filter out the columns identified as having zero values and save those with nulls values to a new variable.
columns = [i[0] for i in miss if i[1].split(' ')[0] == 'Null']
Given that the dataframe has 10% of missing values, we'll impute them with a central tendency. To aide us in that decision, we'll use a for loop to print out the unique values and type of each relevant column.
for i in columns:
print(i, dete[i].apply(type).unique(), dete[i].unique())
opportunities_for_promotion [<class 'str'> <class 'float'>] ['A' 'N' 'SD' nan 'M' 'SA' 'D'] workplace_issue [<class 'str'> <class 'float'>] ['N' 'D' 'A' 'SD' 'SA' nan 'M'] gender [<class 'str'> <class 'float'>] ['Male' 'Female' nan] stress_and_pressure_support [<class 'str'> <class 'float'>] ['A' 'N' 'D' 'SA' 'SD' 'M' nan] age [<class 'str'> <class 'float'>] ['56-60' '61 or older' '36-40' '41-45' '31-35' '46-50' '51-55' '21-25' '26-30' nan '20 or younger'] peer_support [<class 'str'> <class 'float'>] ['A' 'N' nan 'SA' 'D' 'SD' 'M'] performance_of_supervisor [<class 'str'> <class 'float'>] ['A' 'N' 'D' 'SA' 'SD' 'M' nan] worklife_balance [<class 'str'> <class 'float'>] ['A' 'N' 'D' 'SA' 'SD' 'M' nan] staff_morale [<class 'str'> <class 'float'>] ['N' 'SD' 'A' 'D' 'SA' 'M' nan] physical_environment [<class 'str'> <class 'float'>] ['N' 'A' 'D' 'SD' 'SA' 'M' nan]
So it seems that the null containing columns are all categorical. As such, we'll impute using the mode.
for i in columns:
dete[i] = dete[i].replace('nan', np.nan).fillna(dete[i].mode()[0])
#sanity check
dete.isnull().sum()
separationtype 0 cease_date 0 dete_start_date 0 interpersonal_conflicts 0 job_dissatisfaction 0 dissatisfaction_with_the_department 0 physical_work_environment 0 lack_of_recognition 0 lack_of_job_security 0 work_location 0 employment_conditions 0 work_life_balance 0 workload 0 opportunities_for_promotion 0 staff_morale 0 workplace_issue 0 physical_environment 0 worklife_balance 0 stress_and_pressure_support 0 performance_of_supervisor 0 peer_support 0 gender 0 age 0 dtype: int64
We'll repeat the same process for the TAFE dataframe.
miss2 = missing_values(tafe)
From the output, the column with the 89% null values will be dropped.
tafe.drop(columns=miss2[0][0], inplace=True)
miss2 = miss2[1:]
miss2
[('contributing_factors._dissatisfaction', 'Null values 37.75'), ('contributing_factors._job_dissatisfaction', 'Null values 37.75'), ('contributing_factors._interpersonal_conflict', 'Null values 37.75'), ('gender._what_is_your_gender?', 'Null values 15.1'), ('currentage._current_age', 'Null values 15.1'), ('lengthofserviceoverall._overall_length_of_service_at_institute_(in_years)', 'Null values 15.1'), ('cessation_year', 'Null values 1.0'), ('reason_for_ceasing_employment', 'Null values 0.14')]
for i in miss2:
print(i[0], tafe[i[0]].apply(type).unique(), tafe[i[0]].unique(),
'\n') #string and float types in there
contributing_factors._dissatisfaction [<class 'float'> <class 'str'>] [nan '-' 'Contributing Factors. Dissatisfaction '] contributing_factors._job_dissatisfaction [<class 'float'> <class 'str'>] [nan '-' 'Job Dissatisfaction'] contributing_factors._interpersonal_conflict [<class 'float'> <class 'str'>] [nan '-' 'Interpersonal Conflict'] gender._what_is_your_gender? [<class 'str'> <class 'float'>] ['Female' nan 'Male'] currentage._current_age [<class 'str'> <class 'float'>] ['26 30' nan '41 45' '56 or older' '20 or younger' '46 50' '36 40' '21 25' '51-55' '31 35'] lengthofserviceoverall._overall_length_of_service_at_institute_(in_years) [<class 'str'> <class 'float'>] ['1-2' nan '3-4' '7-10' 'Less than 1 year' '11-20' '5-6' 'More than 20 years'] cessation_year [<class 'float'>] [2010. 2009. nan 2011. 2012. 2013.] reason_for_ceasing_employment [<class 'str'> <class 'float'>] ['Contract Expired' 'Retirement' 'Resignation' 'Retrenchment/ Redundancy' 'Termination' 'Transfer' nan]
We can observe that all but one (cessation_year) of the columns contain categorical values. Notice also the '-' value in some of the columns. This can be represented as a False value as the only other valid value represents the name of the columns thus implying a True occurrence. In addition, knowing that this is survey data, we can assume that the null values are indication for False occurrences. Therefore, we'll impute the first three columns (with '-' values) using a 0 value. The other categorical values we'll impute using the mode and lastly, and impute using the median for the cessation_year column.
bool_cat = [i[0] for i in miss2[:3]]
miss2 = miss2[3:]
tafe[bool_cat] = tafe[bool_cat].fillna(0)
Imputing using the mode for the categorical columns.
for i in miss2:
if i[0] == 'cessation_year': continue
else:
tafe[i[0]] = tafe[i[0]].replace('nan',
np.nan).fillna(tafe[i[0]].mode()[0])
Then impute using the median for the cessation_year column.
tafe['cessation_year'].fillna(tafe['cessation_year'].median(), inplace=True)
#sanity check
tafe.isnull().sum()
cessation_year 0 reason_for_ceasing_employment 0 contributing_factors._dissatisfaction 0 contributing_factors._job_dissatisfaction 0 contributing_factors._interpersonal_conflict 0 gender._what_is_your_gender? 0 currentage._current_age 0 lengthofserviceoverall._overall_length_of_service_at_institute_(in_years) 0 dtype: int64
We'll convert the years to integer type as it'll be needed further down the line.
tafe['cessation_year'] =tafe['cessation_year'].apply(lambda x: int(x)) #change given null check
Let's quickly look at the distribution of the years in the dataframe.
tafe['cessation_year'].value_counts()
2011 275 2012 235 2010 103 2013 85 2009 4 Name: cessation_year, dtype: int64
Most of the datas stem from 2011 and 2012 with some in 2010 and very data values in 2013 and 2009. Let's check if this is also reflected in the DETE dataframe.
dete['cease_date'].value_counts()
2012 344 2013 200 01/2014 43 12/2013 40 09/2013 34 Not Stated 34 06/2013 27 07/2013 22 10/2013 20 11/2013 16 08/2013 12 05/2013 7 05/2012 6 07/2014 2 04/2014 2 08/2012 2 04/2013 2 02/2014 2 09/2010 1 11/2012 1 07/2006 1 2010 1 09/2014 1 07/2012 1 2014 1 Name: cease_date, dtype: int64
Hmmm. The column needs some preprocessing before we can proceed to work with it. We'll use a lambda function that retrieves the relevant year and turn the "Not stated" value into a 0 of which we'll impute using the median.
dete['cease_date'] = dete['cease_date'].apply(
lambda x: int(x.split('/')[-1]) if x.split('/')[-1].isdigit() else 0)
dete.loc[dete['cease_date'] == 0, 'cease_date'] = dete['cease_date'].median()
Now let's compare how the cessation years differ amongst the two dataframes.
print(dete['cease_date'].value_counts(), tafe['cessation_year'].value_counts())
2013.0 414 2012.0 354 2014.0 51 2010.0 2 2006.0 1 Name: cease_date, dtype: int64 2011 275 2012 235 2010 103 2013 85 2009 4 Name: cessation_year, dtype: int64
Most of the years from the TAFE dataframe come from 2012 with some in 2010 and very few in 2013 and 2009; years amongst the two dataframe vary but not by much.
Lastly,we'll check for the start year in the DETE dataframe; this won't apply to TAFE as it doesn't contain such columns.
pd.DataFrame(dete['dete_start_date'].value_counts(ascending=False)).transpose()
Not Stated | 2011 | 2007 | 2008 | 2010 | 2012 | 2009 | 2006 | 1970 | 2013 | ... | 1983 | 1973 | 1985 | 1987 | 1982 | 1963 | 1968 | 1967 | 1966 | 1965 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
dete_start_date | 73 | 40 | 34 | 31 | 27 | 27 | 24 | 23 | 21 | 21 | ... | 9 | 8 | 8 | 7 | 4 | 4 | 3 | 2 | 1 | 1 |
1 rows × 51 columns
The start years for the employees in the DETE org. range from 1965 to 2011. Looking at the low count values, this doesn't seem unusual as there is a likelihood of an old staff that have been loyal to the organization since their start date. The mode, 'Not Stated', will need to be imputed using the median for the purpose of answering the business problem.
#print the median
dete.loc[dete['dete_start_date'] == 'Not Stated',
'dete_start_date'] = dete['dete_start_date'].str.replace(
'Not Stated', '0').apply(int).median()
Before proceeding to the next data preprocess step, we'll quickly rename the columns to make them easier to work with.
tafe.columns
Index(['cessation_year', 'reason_for_ceasing_employment', 'contributing_factors._dissatisfaction', 'contributing_factors._job_dissatisfaction', 'contributing_factors._interpersonal_conflict', 'gender._what_is_your_gender?', 'currentage._current_age', 'lengthofserviceoverall._overall_length_of_service_at_institute_(in_years)'], dtype='object')
tafe.columns = ['cease_date', 'seperationtype',
'dissatisfaction',
'job_dissatisfaction',
'interpersonal_conflict',
'gender', 'age',
'institute_service']
Let's take a look at the value counts for the types of separation.
tafe['seperationtype'].value_counts()
Resignation 341 Contract Expired 127 Retrenchment/ Redundancy 104 Retirement 82 Transfer 25 Termination 23 Name: seperationtype, dtype: int64
Interestingly enough, the most common type of seperation for TAFE employees come from resignation. Keeping the business problem in mind, we'll take rows with resignation as the reason for seperation.
tafe = tafe.query("seperationtype =='Resignation'").copy()
tafe.reset_index(drop=True, inplace=True)
tafe.head()
cease_date | seperationtype | dissatisfaction | job_dissatisfaction | interpersonal_conflict | gender | age | institute_service | |
---|---|---|---|---|---|---|---|---|
0 | 2010 | Resignation | - | - | - | Female | 56 or older | Less than 1 year |
1 | 2010 | Resignation | - | - | - | Male | 41 45 | 3-4 |
2 | 2010 | Resignation | - | - | - | Female | 56 or older | 7-10 |
3 | 2010 | Resignation | - | - | - | Male | 20 or younger | 3-4 |
4 | 2010 | Resignation | - | - | - | Male | 46 50 | 3-4 |
Let's check how this looks like for the DETE employees.
dete['separationtype'].value_counts()
Age Retirement 285 Resignation-Other reasons 150 Resignation-Other employer 91 Resignation-Move overseas/interstate 70 Voluntary Early Retirement (VER) 67 Ill Health Retirement 61 Other 49 Contract Expired 34 Termination 15 Name: separationtype, dtype: int64
Here, retirement is the most common separation reason but this is tied with resignation reasons if we combine the three resignation values. We'll merge three resignation-relevant values using a lambda function.
dete['seperationtype'] = dete['separationtype'].apply(
lambda x: x.split('-')[0] if x.startswith('Resi') == True else x)
Then filter for the relevant values.
dete = dete.query("seperationtype =='Resignation'").copy()
dete.reset_index(drop=True, inplace=True)
dete.head()
separationtype | cease_date | dete_start_date | interpersonal_conflicts | job_dissatisfaction | dissatisfaction_with_the_department | physical_work_environment | lack_of_recognition | lack_of_job_security | work_location | ... | staff_morale | workplace_issue | physical_environment | worklife_balance | stress_and_pressure_support | performance_of_supervisor | peer_support | gender | age | seperationtype | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Resignation-Other reasons | 2012.0 | 2005 | False | False | False | False | False | False | False | ... | N | N | A | A | N | N | A | Female | 36-40 | Resignation |
1 | Resignation-Other reasons | 2012.0 | 1994 | False | False | False | False | False | False | False | ... | SD | D | SD | D | D | A | A | Female | 41-45 | Resignation |
2 | Resignation-Other reasons | 2012.0 | 2009 | False | False | False | False | False | False | False | ... | D | D | N | SD | SD | D | D | Female | 31-35 | Resignation |
3 | Resignation-Other employer | 2012.0 | 1997 | True | True | True | False | False | False | False | ... | SD | SD | A | SD | SD | SD | SD | Female | 46-50 | Resignation |
4 | Resignation-Move overseas/interstate | 2012.0 | 2009 | False | False | False | False | False | False | False | ... | N | D | N | N | N | N | A | Male | 31-35 | Resignation |
5 rows × 24 columns
Moving on, since we're interested in how experience affects resignation, we'll need to make a column for the DETE dataframe as calculated by the difference in their ending and start dates.
dete['institute_service'] = dete['cease_date'].apply(
int) - dete['dete_start_date'].apply(int)
We'll then move on and retrieve columns with boolan values to create a dissatisfied column where: at least one dissatisfied column is identified as True.
dict(dete.dtypes[dete.dtypes ==bool]).keys()
dict_keys(['interpersonal_conflicts', 'job_dissatisfaction', 'dissatisfaction_with_the_department', 'physical_work_environment', 'lack_of_recognition', 'lack_of_job_security', 'work_location', 'employment_conditions', 'work_life_balance', 'workload'])
params = [
'interpersonal_conflicts', 'job_dissatisfaction',
'dissatisfaction_with_the_department', 'physical_work_environment',
'lack_of_recognition', 'lack_of_job_security', 'work_location',
'employment_conditions', 'work_life_balance', 'workload'
]
dete['dissatisfied'] = dete[params].any(1, skipna=False)
dete['dissatisfied'].value_counts()
False 157 True 154 Name: dissatisfied, dtype: int64
We'll do the same with the TAFE organization but this time using a function that accounts for the '-' value.
def update_vals(element):
if element == '-':
return False
elif pd.isnull(element):
return np.nan
else:
return True
tafe.columns
Index(['cease_date', 'seperationtype', 'dissatisfaction', 'job_dissatisfaction', 'interpersonal_conflict', 'gender', 'age', 'institute_service'], dtype='object')
param = ['dissatisfaction', 'job_dissatisfaction', 'interpersonal_conflict']
# Creating 'dissatisfied' column for tafe_resignations:
tafe[param] = (tafe[param].applymap(update_vals))
tafe['dissatisfied'] = tafe[param].any(1, skipna=False)
tafe['dissatisfied'].value_counts()
False 235 True 106 Name: dissatisfied, dtype: int64
With those changes done, we'll proceed to join the dataframe--to make it easier to work with. We'll create one last column indicating the institute before creating the dataframe.
dete['institute'] = 'DETE'
tafe['institute'] = 'TAFE'
combined = pd.concat([dete, tafe], ignore_index=True, sort=True)
Before we can proceed, we'll check for null values.
combined.isnull().sum().sort_values(ascending=False) / len(combined) * 100
workplace_issue 52.300613 performance_of_supervisor 52.300613 dete_start_date 52.300613 dissatisfaction_with_the_department 52.300613 employment_conditions 52.300613 interpersonal_conflicts 52.300613 lack_of_job_security 52.300613 workload 52.300613 opportunities_for_promotion 52.300613 peer_support 52.300613 lack_of_recognition 52.300613 physical_environment 52.300613 separationtype 52.300613 staff_morale 52.300613 worklife_balance 52.300613 stress_and_pressure_support 52.300613 physical_work_environment 52.300613 work_location 52.300613 work_life_balance 52.300613 interpersonal_conflict 47.699387 dissatisfaction 47.699387 dissatisfied 0.000000 cease_date 0.000000 institute 0.000000 gender 0.000000 institute_service 0.000000 job_dissatisfaction 0.000000 seperationtype 0.000000 age 0.000000 dtype: float64
There are many non-null vales present and as such we'll need to drop them from the dataframe. However, in this case, we'll only tolerate columns with at least 50% of non-null values. We set this threshold as it'll keep the columns of interest (see table above for irrelevant variables with 50% null values ).
thresh = len(combined) /2
thresh
326.0
combined_updated = combined.dropna(thresh = thresh, axis =1)
combined_updated.reset_index(drop=True, inplace=True)
combined_updated.sample(3)
age | cease_date | dissatisfaction | dissatisfied | gender | institute | institute_service | interpersonal_conflict | job_dissatisfaction | seperationtype | |
---|---|---|---|---|---|---|---|---|---|---|
433 | 36 40 | 2011.0 | False | False | Female | TAFE | Less than 1 year | False | False | Resignation |
422 | 21 25 | 2011.0 | False | False | Male | TAFE | 1-2 | False | False | Resignation |
586 | 56 or older | 2012.0 | False | True | Female | TAFE | 3-4 | False | True | Resignation |
To end this step, we'll keep the columns that pertain to the parameter of interest: dissatifaction.
combined_updated.columns
Index(['age', 'cease_date', 'dissatisfaction', 'dissatisfied', 'gender', 'institute', 'institute_service', 'interpersonal_conflict', 'job_dissatisfaction', 'seperationtype'], dtype='object')
combined_updated =combined_updated[['age', 'cease_date', 'dissatisfied', 'gender', 'institute', 'institute_service']]
combined_updated.head()
age | cease_date | dissatisfied | gender | institute | institute_service | |
---|---|---|---|---|---|---|
0 | 36-40 | 2012.0 | False | Female | DETE | 7 |
1 | 41-45 | 2012.0 | True | Female | DETE | 18 |
2 | 31-35 | 2012.0 | False | Female | DETE | 3 |
3 | 46-50 | 2012.0 | True | Female | DETE | 15 |
4 | 31-35 | 2012.0 | False | Male | DETE | 3 |
combined_updated['institute_service'].unique()
array([7, 18, 3, 15, 14, 5, 30, 32, 39, 17, 9, 6, 1, 35, 38, 36, 19, 4, 26, 10, 8, 2, 0, 23, 13, 16, 12, 21, 20, 24, 33, 22, 28, 49, 11, 41, 27, 42, 25, 29, 34, 31, 'Less than 1 year', '3-4', '7-10', '1-2', '11-20', '5-6', 'More than 20 years'], dtype=object)
Before we categorize the rows based on experience, we'll first want to retrieve and convert the years into integer type. For that, given the characters in the institute_service column, we'll use the extract string method to retrieve only the digits in that column.
combined_updated['institute_service'] = combined_updated[
'institute_service'].astype('str').str.extract(r'(\d+)').astype(int)
combined_updated.head()
age | cease_date | dissatisfied | gender | institute | institute_service | |
---|---|---|---|---|---|---|
0 | 36-40 | 2012.0 | False | Female | DETE | 7 |
1 | 41-45 | 2012.0 | True | Female | DETE | 18 |
2 | 31-35 | 2012.0 | False | Female | DETE | 3 |
3 | 46-50 | 2012.0 | True | Female | DETE | 15 |
4 | 31-35 | 2012.0 | False | Male | DETE | 3 |
Lastly, we'll create a function that assigns an experience level based on the institute_service value passed to it. NOTE: The threshold years chosen are based off the DataQuest project instructions.
def check(x):
if x < 3: return 'New'
elif x >= 3 and x < 7: return 'Experienced'
elif x >= 7 and x < 11: return 'Established'
elif x >= 11: return 'Veteran'
elif str(x) == ['nan']: return np.nan
combined_updated['service_cat'] = combined_updated['institute_service'].apply(
lambda x: check(x))
combined_updated.sample(5)
age | cease_date | dissatisfied | gender | institute | institute_service | service_cat | |
---|---|---|---|---|---|---|---|
299 | 36-40 | 2013.0 | False | Female | DETE | 13 | Veteran |
312 | 41 45 | 2010.0 | False | Male | TAFE | 3 | Experienced |
57 | 61 or older | 2012.0 | True | Female | DETE | 14 | Veteran |
442 | 21 25 | 2011.0 | True | Female | TAFE | 1 | New |
319 | 31 35 | 2010.0 | False | Female | TAFE | 11 | Veteran |
Before proceeding, let's quickly check for null values in the dataframe.
combined_updated.isnull().sum()
age 0 cease_date 0 dissatisfied 0 gender 0 institute 0 institute_service 0 service_cat 0 dtype: int64
That was a lot, in this step we accomplished the following:
Absent any null values we'll proceed to answer the first question: Are employees who only worked for the institutes for a short period of time resigning due to some kind of dissatisfaction?
To do this we'll create a pivot table that shows the share of people who reported dissatisfaction as their reason for their resignation.
pivot = combined_updated.pivot_table(index='service_cat',
values='dissatisfied').sort_values(
'dissatisfied', ascending=False)
Let's visualize the results via a bar graph.
fig = px.bar(pivot,
x=pivot.index,
y=pivot['dissatisfied'],
text=pivot['dissatisfied'])
fig.update_layout(title='Reported Dissatisfaction By Experience Level',
xaxis_title="",
yaxis_title="Percent Dissatisfied",
font=dict(
family="Courier New, monospace",
size=18,
))
fig.update_traces(texttemplate='%{text:.2f}', textposition='outside')
fig.show()
Employees in the later two stages of experience level have more than >10% reported levels of dissatisfaction than employees with relatively low experience. More efforts should be introduced to understand why those with longer years of service are leaving at a higher rate.
For the next question: Are younger employees resigning due to some kind of dissatisfaction? We'll take the similar approach done in the first question.
combined_updated['age'].unique()
array(['36-40', '41-45', '31-35', '46-50', '61 or older', '56-60', '51-55', '21-25', '26-30', '20 or younger', '56 or older', '41 45', '46 50', '36 40', '21 25', '31 35', '26 30'], dtype=object)
Here, given that the first values are all age related, we'll split by '-' and ' ' in order to retrieve the first indicated age value.
combined_updated['age'] = combined_updated['age'].apply(
lambda x: int(x.split('-')[0].split(' ')[0]))
combined_updated['age'].value_counts()
56 106 41 93 46 81 36 73 51 71 26 67 21 62 31 61 61 28 20 10 Name: age, dtype: int64
We'll then make a function that'll assign the age into the following groups:
def age_categories(element):
if element < 21:
return '21 or less'
elif 21 <= element < 30:
return '21-29'
elif 31 <= element < 40:
return '31-39'
elif 41 <= element < 50:
return '41-49'
elif 51 <= element < 60:
return '51-59'
else:
return 'Above 60'
combined_updated['age'] = combined_updated['age'].apply(age_categories)
Lastly, we'll check for null values in case something went wrong.
combined_updated.isnull().sum()
age 0 cease_date 0 dissatisfied 0 gender 0 institute 0 institute_service 0 service_cat 0 dtype: int64
With the age categories formed, we'll create a pivot table to get each groups reported dissatisfaction.
pivot2 = combined_updated.pivot_table(
index='age', values='dissatisfied').sort_values('dissatisfied',
ascending=False)
We'll again, visualize this via a bar graph.
fig = px.bar(pivot2,
x=pivot2.index,
y=pivot2['dissatisfied'],
text=pivot2['dissatisfied'],
color=pivot2.index)
fig.update_layout(title='Reported Dissatisfaction By Experience Level',
xaxis_title="",
yaxis_title="Percent Dissatisfied",
font=dict(
family="Courier New, monospace",
size=18,
),
showlegend=False)
fig.update_traces(texttemplate='%{text:.2f}', textposition='outside')
fig.show()
When we take age into account, reported dissatisfaction levels range from 30% to 46%. From the chart, dissatisfaction levels increases the older the age group an employee lands on. This makes sense as the previous chart reflected experience level as a factor for higher dissatisfaction levels. However, here we get a more clearer view of which age groups may account for such high %: employees above the ages of 51.
Let's check how the reported dissatisfaction levels for each group fare relative to the mean. We'll calculate their Z-score for each value and then apply the mean to retrieve the average Z-score for each age group. The Z-score will inform us us how other groups fare relative to the average dissatisfaction percentage.
combined_updated['Zdissatisfied'] = combined_updated['dissatisfied'].apply(
lambda x: (x - combined_updated['dissatisfied'].mean()) / combined_updated[
'dissatisfied'].std())
#.pivot_table(index='institute', values = 'dissatisfied'))
pivot4 = combined_updated.pivot_table(
index='age', values='Zdissatisfied').sort_values('Zdissatisfied',
ascending=False)
fig = px.bar(pivot4,
x=pivot4.index,
y=pivot4['Zdissatisfied'],
text=pivot4['Zdissatisfied'])
fig.update_layout(
title='Reported Dissatisfaction (Z-scores) By Experience Level',
xaxis_title="",
yaxis_title="Percent Dissatisfied",
font=dict(
family="Courier New, monospace",
size=18,
),
showlegend=False)
fig.update_traces(texttemplate='%{text:.2f}', textposition='outside')
fig.show()
print(combined_updated['dissatisfied'].mean())
0.3987730061349693
Clearly now, employees above the age of 51 have higher reported levels of dissatisfaction relative to the average dissatisfied level (39%). Those in the age groups of 21 to 49 seems to fair slightly better whilst those younger than 21 have the lowest dissatisfaction levels. Organizations should create more efforts to increase the large dissatisfaction levels in the older age groups.
How does dissatisfaction compare between the two institutions? We'll to answer this we'll take the Z-scores calculated earlier and average them for the two institutions.
pivot3 = combined_updated.pivot_table(index='institute',
values='Zdissatisfied').sort_values(
'Zdissatisfied', ascending=False)
#%config InlineBackend.figure_format = 'retina'
fig = px.bar(pivot3,
x=pivot3.index,
y=pivot3['Zdissatisfied'],
text=pivot3['Zdissatisfied'])
colors = ['#EE3224', 'lightsalmon']
fig.update_layout(title='DETE and TAFE Dissatisfaction Z-score',
xaxis_title="",
yaxis_title="Percent Dissatisfied (Z-score)",
font=dict(
family="Courier New, monospace",
size=18,
))
fig.update_traces(marker_color=colors,
texttemplate='%{text: .2f}',
textposition='outside')
fig.show()
print(combined_updated['dissatisfied'].mean())
0.3987730061349693
Relative to the mean (39%), DETE employees on average had 20% higher levels of dissatisfaction (around 50%). On the other hand, TAFE employees have 18% lower dissatisfaction levels relative to the mean. Given these figures, the DETE institute needs far more work in improving it's employees dissatisfaction levels.
Let's dig deeper into this by taking into account both gender and experience level. We'll from our groups based on institute, gender, and service category taking the percent of dissatisfaction for both.
df = combined_updated.groupby(['institute', 'gender',
'service_cat']).mean()['dissatisfied']
df = df.unstack()[['Established', 'Veteran', 'Experienced', 'New']]
Lastly, we'll visualize the output via a side-by-side bar graph.
sns.set_context('talk')
ax= df.plot(kind='bar',
fontsize=12,
rot=0,
color=['#EE3224', '#F78F1E', '#FFC222', '#ff7a69'],figsize=(15,7))
ax.set_xticklabels([' '.join(i) for i in df.index])
plt.legend(fontsize=14)
# Axis styling.
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
ax.spines['left'].set_visible(False)
ax.spines['bottom'].set_color('#DDDDDD')
ax.tick_params(bottom=False, left=False,)
ax.set_axisbelow(True)
ax.yaxis.grid(True, color='#EEEEEE')
ax.xaxis.grid(False)
# Add axis and chart labels.
ax.set_xlabel('Institute and Gender', labelpad=15)
ax.set_ylabel('Percent Dissatisfied', labelpad=15)
ax.set_title(
'Reported Dissatisfaction by Experience Level - Per Institute and Gender',
size=20,
pad=25)
#text annotations
for p in ax.patches:
ax.annotate(str(round(p.get_height(), 2)),
(p.get_x() * 1.005, p.get_height() * 1.02),
fontsize=12)
plt.show()
The chart confirms the previous chart's finding wherein DETE employees have a higher percentage of dissatisfaction in comparison to TAFE employees. In this case, we can see several relationship. First, male employees in the DETE institute are the highest dissatisfied group with those with in the highest experience level entailing 67% to 70% of them reporting dissatisfaction. In addition, female employees in the TAFE institute with the hgihest experience level--Established--report the highest dissatisfaction in comparison to the other TAFE's gender groups and their experience level sub-groups. Lastly, male employees in the lowest experience level for the TAFE institute report the hgihest level of dissatisfaction for it's group.
Further analysis should dive deeper into the reasons for high dissatisfaction levels for older and experience employees, in particular for male DETE employee and female TAFE employees. Lastly, the DETE institute should also investigate into why their reported overall level of dissatisfaction is much higher when compared to the TAFE institute.