Exploring Employee Exit Surveys

Intro.

The purpose of this analyses is to discover potential patterns regarding employee's who have terminated their employment at The Department of Education, Training and Employment (DETE) and the Technical and Further Education (TAFE) insititute in Queensland, Austrlia. The insights will be used as a basis for further analyses as well as for potential recommendations for improvements to reduce employee turnover for the two institutes. For this analyses, we will be using the employee exist survey's released by the insititutes, which can be found here (TAFE) and here (DETE).

The analyses as hypothetically requested by the institutes will be mainly focused on dissatisfaction as a factor of employee turnover. As such, we'll attempt to answer the following questions:

Are employees who only worked for the institutes for a short period of time resigning due to some kind of dissatisfaction?
Are younger employees resigning due to some kind of dissatisfaction?
How does dissatisfaction compare between the two institutions?

Packages Used¶

In [2]:

import pandas as pd
import numpy as np

from matplotlib import pyplot as plt
import seaborn as sns
import plotly.offline as pyo
pyo.init_notebook_mode()
import plotly.express as px

C:\Users\garci\Anaconda3\lib\site-packages\statsmodels\tools\_testing.py:19: FutureWarning: pandas.util.testing is deprecated. Use the functions in the public API at pandas.testing instead.
  import pandas.util.testing as tm

1. Initial Preprocessing and Data Overview¶

Let's start by loading and getting a quick overview of the datasets by printing out the top 5 rows, and their corresponding shape.

In [3]:

df1 = pd.read_csv('dete_survey.csv')
df2 = pd.read_csv('tafe_survey.csv')

In [4]:

print(df1.shape) #tuple of row and columns of dataframe
df1.head() #prints top 5 rows

(822, 56)

Out[4]:

	ID	SeparationType	Cease Date	DETE Start Date	Role Start Date	Position	Classification	Region	Business Unit	Employment Status	...	Kept informed	Wellness programs	Health & Safety	Gender	Age	Aboriginal	Torres Strait	South Sea	Disability	NESB
0	1	Ill Health Retirement	08/2012	1984	2004	Public Servant	A01-A04	Central Office	Corporate Strategy and Peformance	Permanent Full-time	...	N	N	N	Male	56-60	NaN	NaN	NaN	NaN	Yes
1	2	Voluntary Early Retirement (VER)	08/2012	Not Stated	Not Stated	Public Servant	AO5-AO7	Central Office	Corporate Strategy and Peformance	Permanent Full-time	...	N	N	N	Male	56-60	NaN	NaN	NaN	NaN	NaN
2	3	Voluntary Early Retirement (VER)	05/2012	2011	2011	Schools Officer	NaN	Central Office	Education Queensland	Permanent Full-time	...	N	N	N	Male	61 or older	NaN	NaN	NaN	NaN	NaN
3	4	Resignation-Other reasons	05/2012	2005	2006	Teacher	Primary	Central Queensland	NaN	Permanent Full-time	...	A	N	A	Female	36-40	NaN	NaN	NaN	NaN	NaN
4	5	Age Retirement	05/2012	1970	1989	Head of Curriculum/Head of Special Education	NaN	South East	NaN	Permanent Full-time	...	N	A	M	Female	61 or older	NaN	NaN	NaN	NaN	NaN

5 rows × 56 columns

The data from the DETE survey consists of 822 rows and 56 columns. Clearly the columns are more than we will need to answer the questions set out in this analyses. As such, we'll keep columns that may be useful in answering our question. We'll print out the columns to select certain columns of interest and keep the relevant columns.

In [5]:

df1.columns

Out[5]:

Index(['ID', 'SeparationType', 'Cease Date', 'DETE Start Date',
       'Role Start Date', 'Position', 'Classification', 'Region',
       'Business Unit', 'Employment Status', 'Career move to public sector',
       'Career move to private sector', 'Interpersonal conflicts',
       'Job dissatisfaction', 'Dissatisfaction with the department',
       'Physical work environment', 'Lack of recognition',
       'Lack of job security', 'Work location', 'Employment conditions',
       'Maternity/family', 'Relocation', 'Study/Travel', 'Ill Health',
       'Traumatic incident', 'Work life balance', 'Workload',
       'None of the above', 'Professional Development',
       'Opportunities for promotion', 'Staff morale', 'Workplace issue',
       'Physical environment', 'Worklife balance',
       'Stress and pressure support', 'Performance of supervisor',
       'Peer support', 'Initiative', 'Skills', 'Coach', 'Career Aspirations',
       'Feedback', 'Further PD', 'Communication', 'My say', 'Information',
       'Kept informed', 'Wellness programs', 'Health & Safety', 'Gender',
       'Age', 'Aboriginal', 'Torres Strait', 'South Sea', 'Disability',
       'NESB'],
      dtype='object')

In [6]:

#From the above columns, the following were selected based on their relevancy to the dependent 
#variabe (dissatisfaction) and/or independent variables (service year, gender, age, and seperation type):
columns = ['SeparationType', 'Cease Date', 'DETE Start Date',    
       'Interpersonal conflicts',
       'Job dissatisfaction', 'Dissatisfaction with the department',
       'Physical work environment', 'Lack of recognition',
       'Lack of job security', 'Work location', 'Employment conditions',
       'Work life balance', 'Workload', 
       'Opportunities for promotion', 'Staff morale', 'Workplace issue',
       'Physical environment', 'Worklife balance',
       'Stress and pressure support', 'Performance of supervisor',
       'Peer support',  'Gender','Age']
dete = df1[columns].copy()  

Let's now check out the condensed dataframe.

In [7]:

dete.sample(2)

Out[7]:

	SeparationType	Cease Date	DETE Start Date	Interpersonal conflicts	Job dissatisfaction	Dissatisfaction with the department	Physical work environment	Lack of recognition	Lack of job security	Work location	...	Opportunities for promotion	Staff morale	Workplace issue	Physical environment	Worklife balance	Stress and pressure support	Performance of supervisor	Peer support	Gender	Age
252	Age Retirement	2012	2011	False	False	False	False	False	False	False	...	N	N	A	A	N	N	A	A	Male	61 or older
314	Resignation-Other employer	2013	2001	False	False	False	False	False	False	False	...	N	N	N	D	SA	A	N	A	Female	36-40

2 rows × 23 columns

With the potential columns saved and kept into a new dataframe, let's check out the general information from it using the .info() Pandas method--displays shape, null values, types of columns.

In [8]:

dete.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 822 entries, 0 to 821
Data columns (total 23 columns):
 #   Column                               Non-Null Count  Dtype 
---  ------                               --------------  ----- 
 0   SeparationType                       822 non-null    object
 1   Cease Date                           822 non-null    object
 2   DETE Start Date                      822 non-null    object
 3   Interpersonal conflicts              822 non-null    bool  
 4   Job dissatisfaction                  822 non-null    bool  
 5   Dissatisfaction with the department  822 non-null    bool  
 6   Physical work environment            822 non-null    bool  
 7   Lack of recognition                  822 non-null    bool  
 8   Lack of job security                 822 non-null    bool  
 9   Work location                        822 non-null    bool  
 10  Employment conditions                822 non-null    bool  
 11  Work life balance                    822 non-null    bool  
 12  Workload                             822 non-null    bool  
 13  Opportunities for promotion          735 non-null    object
 14  Staff morale                         816 non-null    object
 15  Workplace issue                      788 non-null    object
 16  Physical environment                 817 non-null    object
 17  Worklife balance                     815 non-null    object
 18  Stress and pressure support          810 non-null    object
 19  Performance of supervisor            813 non-null    object
 20  Peer support                         812 non-null    object
 21  Gender                               798 non-null    object
 22  Age                                  811 non-null    object
dtypes: bool(10), object(13)
memory usage: 91.6+ KB

Our dataframe consists of 822 rows, 23 columns with roughly half of the columns being of type bool and the other of object type. In additon, relatviely low number of null vlaues have been identified. We'll make sure to check out the null values in the preprocessing stage as well change the data types of certain columns such as the 'Cease Date' one given that it will make more sense of having it as a integer or Datetime Pandas object.

In [9]:

dete.describe(include='all').transpose()

Out[9]:

	count	unique	top	freq
SeparationType	822	9	Age Retirement	285
Cease Date	822	25	2012	344
DETE Start Date	822	51	Not Stated	73
Interpersonal conflicts	822	2	False	788
Job dissatisfaction	822	2	False	733
Dissatisfaction with the department	822	2	False	761
Physical work environment	822	2	False	806
Lack of recognition	822	2	False	765
Lack of job security	822	2	False	794
Work location	822	2	False	795
Employment conditions	822	2	False	788
Work life balance	822	2	False	605
Workload	822	2	False	735
Opportunities for promotion	735	6	A	242
Staff morale	816	6	A	335
Workplace issue	788	6	A	357
Physical environment	817	6	A	467
Worklife balance	815	6	A	359
Stress and pressure support	810	6	A	342
Performance of supervisor	813	6	A	349
Peer support	812	6	A	401
Gender	798	2	Female	573
Age	811	10	61 or older	222

From the output, retirement is the most common reason for DETE employees leaving their jobs with most frequent age being above 61 years or older. Not suprisingly, most reported levels of dissatsifaction seem to be relatively unusual as indicated by the large amounts of False counts across the above columns. Lastly, Females seemed to represent the majority of the employees who have left their jobs at DETE.

Let's now shift our attention towards the TAFE survey dataset. We'll take the same steps that we did with the DETE dataframe by printing it's shape and first five rows.

In [10]:

print(df2.shape)
df2.head()

(702, 72)

Out[10]:

	Record ID	Institute	WorkArea	CESSATION YEAR	Reason for ceasing employment	Contributing Factors. Career Move - Public Sector	Contributing Factors. Career Move - Private Sector	Contributing Factors. Career Move - Self-employment	Contributing Factors. Ill Health	Contributing Factors. Maternity/Family	...	Workplace. Topic:Does your workplace promote a work culture free from all forms of unlawful discrimination?	Workplace. Topic:Does your workplace promote and practice the principles of employment equity?	Workplace. Topic:Does your workplace value the diversity of its employees?	Workplace. Topic:Would you recommend the Institute as an employer to others?	Gender. What is your Gender?	CurrentAge. Current Age	Employment Type. Employment Type	Classification. Classification	LengthofServiceOverall. Overall Length of Service at Institute (in years)	LengthofServiceCurrent. Length of Service at current workplace (in years)
0	6.341330e+17	Southern Queensland Institute of TAFE	Non-Delivery (corporate)	2010.0	Contract Expired	NaN	NaN	NaN	NaN	NaN	...	Yes	Yes	Yes	Yes	Female	26 30	Temporary Full-time	Administration (AO)	1-2	1-2
1	6.341337e+17	Mount Isa Institute of TAFE	Non-Delivery (corporate)	2010.0	Retirement	-	-	-	-	-	...	Yes	Yes	Yes	Yes	NaN	NaN	NaN	NaN	NaN	NaN
2	6.341388e+17	Mount Isa Institute of TAFE	Delivery (teaching)	2010.0	Retirement	-	-	-	-	-	...	Yes	Yes	Yes	Yes	NaN	NaN	NaN	NaN	NaN	NaN
3	6.341399e+17	Mount Isa Institute of TAFE	Non-Delivery (corporate)	2010.0	Resignation	-	-	-	-	-	...	Yes	Yes	Yes	Yes	NaN	NaN	NaN	NaN	NaN	NaN
4	6.341466e+17	Southern Queensland Institute of TAFE	Delivery (teaching)	2010.0	Resignation	-	Career Move - Private Sector	-	-	-	...	Yes	Yes	Yes	Yes	Male	41 45	Permanent Full-time	Teacher (including LVT)	3-4	3-4

5 rows × 72 columns

Like the dete dataset there seems to be certain columns that will not be relevant for our analyses and as such we'll be selecing to work with only relevant ones.

In [11]:

df2.columns

Out[11]:

Index(['Record ID', 'Institute', 'WorkArea', 'CESSATION YEAR',
       'Reason for ceasing employment',
       'Contributing Factors. Career Move - Public Sector ',
       'Contributing Factors. Career Move - Private Sector ',
       'Contributing Factors. Career Move - Self-employment',
       'Contributing Factors. Ill Health',
       'Contributing Factors. Maternity/Family',
       'Contributing Factors. Dissatisfaction',
       'Contributing Factors. Job Dissatisfaction',
       'Contributing Factors. Interpersonal Conflict',
       'Contributing Factors. Study', 'Contributing Factors. Travel',
       'Contributing Factors. Other', 'Contributing Factors. NONE',
       'Main Factor. Which of these was the main factor for leaving?',
       'InstituteViews. Topic:1. I feel the senior leadership had a clear vision and direction',
       'InstituteViews. Topic:2. I was given access to skills training to help me do my job better',
       'InstituteViews. Topic:3. I was given adequate opportunities for personal development',
       'InstituteViews. Topic:4. I was given adequate opportunities for promotion within %Institute]Q25LBL%',
       'InstituteViews. Topic:5. I felt the salary for the job was right for the responsibilities I had',
       'InstituteViews. Topic:6. The organisation recognised when staff did good work',
       'InstituteViews. Topic:7. Management was generally supportive of me',
       'InstituteViews. Topic:8. Management was generally supportive of my team',
       'InstituteViews. Topic:9. I was kept informed of the changes in the organisation which would affect me',
       'InstituteViews. Topic:10. Staff morale was positive within the Institute',
       'InstituteViews. Topic:11. If I had a workplace issue it was dealt with quickly',
       'InstituteViews. Topic:12. If I had a workplace issue it was dealt with efficiently',
       'InstituteViews. Topic:13. If I had a workplace issue it was dealt with discreetly',
       'WorkUnitViews. Topic:14. I was satisfied with the quality of the management and supervision within my work unit',
       'WorkUnitViews. Topic:15. I worked well with my colleagues',
       'WorkUnitViews. Topic:16. My job was challenging and interesting',
       'WorkUnitViews. Topic:17. I was encouraged to use my initiative in the course of my work',
       'WorkUnitViews. Topic:18. I had sufficient contact with other people in my job',
       'WorkUnitViews. Topic:19. I was given adequate support and co-operation by my peers to enable me to do my job',
       'WorkUnitViews. Topic:20. I was able to use the full range of my skills in my job',
       'WorkUnitViews. Topic:21. I was able to use the full range of my abilities in my job. ; Category:Level of Agreement; Question:YOUR VIEWS ABOUT YOUR WORK UNIT]',
       'WorkUnitViews. Topic:22. I was able to use the full range of my knowledge in my job',
       'WorkUnitViews. Topic:23. My job provided sufficient variety',
       'WorkUnitViews. Topic:24. I was able to cope with the level of stress and pressure in my job',
       'WorkUnitViews. Topic:25. My job allowed me to balance the demands of work and family to my satisfaction',
       'WorkUnitViews. Topic:26. My supervisor gave me adequate personal recognition and feedback on my performance',
       'WorkUnitViews. Topic:27. My working environment was satisfactory e.g. sufficient space, good lighting, suitable seating and working area',
       'WorkUnitViews. Topic:28. I was given the opportunity to mentor and coach others in order for me to pass on my skills and knowledge prior to my cessation date',
       'WorkUnitViews. Topic:29. There was adequate communication between staff in my unit',
       'WorkUnitViews. Topic:30. Staff morale was positive within my work unit',
       'Induction. Did you undertake Workplace Induction?',
       'InductionInfo. Topic:Did you undertake a Corporate Induction?',
       'InductionInfo. Topic:Did you undertake a Institute Induction?',
       'InductionInfo. Topic: Did you undertake Team Induction?',
       'InductionInfo. Face to Face Topic:Did you undertake a Corporate Induction; Category:How it was conducted?',
       'InductionInfo. On-line Topic:Did you undertake a Corporate Induction; Category:How it was conducted?',
       'InductionInfo. Induction Manual Topic:Did you undertake a Corporate Induction?',
       'InductionInfo. Face to Face Topic:Did you undertake a Institute Induction?',
       'InductionInfo. On-line Topic:Did you undertake a Institute Induction?',
       'InductionInfo. Induction Manual Topic:Did you undertake a Institute Induction?',
       'InductionInfo. Face to Face Topic: Did you undertake Team Induction; Category?',
       'InductionInfo. On-line Topic: Did you undertake Team Induction?process you undertook and how it was conducted.]',
       'InductionInfo. Induction Manual Topic: Did you undertake Team Induction?',
       'Workplace. Topic:Did you and your Manager develop a Performance and Professional Development Plan (PPDP)?',
       'Workplace. Topic:Does your workplace promote a work culture free from all forms of unlawful discrimination?',
       'Workplace. Topic:Does your workplace promote and practice the principles of employment equity?',
       'Workplace. Topic:Does your workplace value the diversity of its employees?',
       'Workplace. Topic:Would you recommend the Institute as an employer to others?',
       'Gender. What is your Gender?', 'CurrentAge. Current Age',
       'Employment Type. Employment Type', 'Classification. Classification',
       'LengthofServiceOverall. Overall Length of Service at Institute (in years)',
       'LengthofServiceCurrent. Length of Service at current workplace (in years)'],
      dtype='object')

In [12]:

#From the above columns, the following were selected based on their relevancy to the dependent 
#variabe (dissatisfaction) and/or independent variables (service year, gender, age, and seperation type):
columns =['CESSATION YEAR', 
       'Reason for ceasing employment', 
       'Contributing Factors. Dissatisfaction',
       'Contributing Factors. Job Dissatisfaction',
       'Contributing Factors. Interpersonal Conflict', 
       'Main Factor. Which of these was the main factor for leaving?',
       'Gender. What is your Gender?', 'CurrentAge. Current Age',
       'LengthofServiceOverall. Overall Length of Service at Institute (in years)',]
tafe =df2[columns].copy()

Let's get a quick overview of the general information of the newly condensed dataframe.

In [13]:

tafe.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 702 entries, 0 to 701
Data columns (total 9 columns):
 #   Column                                                                     Non-Null Count  Dtype  
---  ------                                                                     --------------  -----  
 0   CESSATION YEAR                                                             695 non-null    float64
 1   Reason for ceasing employment                                              701 non-null    object 
 2   Contributing Factors. Dissatisfaction                                      437 non-null    object 
 3   Contributing Factors. Job Dissatisfaction                                  437 non-null    object 
 4   Contributing Factors. Interpersonal Conflict                               437 non-null    object 
 5   Main Factor. Which of these was the main factor for leaving?               113 non-null    object 
 6   Gender. What is your Gender?                                               596 non-null    object 
 7   CurrentAge. Current Age                                                    596 non-null    object 
 8   LengthofServiceOverall. Overall Length of Service at Institute (in years)  596 non-null    object 
dtypes: float64(1), object(8)
memory usage: 49.5+ KB

The TAFE dataframe consists of 702 rows and 9 columns with all but one being of object type. A relatively large amount of null values; the Main Factor column missing a large section of it's data. Possible reasons may be due to it's values being reflected in the other columns such as the dissatisfactions ones.

Based on the available data, let's check out the descriptive statistics for our columns.

In [14]:

tafe.describe(include='all').transpose()

Out[14]:

	count	unique	top	freq	mean	std	min	25%	50%	75%	max
CESSATION YEAR	695	NaN	NaN	NaN	2011.42	0.905977	2009	2011	2011	2012	2013
Reason for ceasing employment	701	6	Resignation	340	NaN	NaN	NaN	NaN	NaN	NaN	NaN
Contributing Factors. Dissatisfaction	437	2	-	371	NaN	NaN	NaN	NaN	NaN	NaN	NaN
Contributing Factors. Job Dissatisfaction	437	2	-	360	NaN	NaN	NaN	NaN	NaN	NaN	NaN
Contributing Factors. Interpersonal Conflict	437	2	-	410	NaN	NaN	NaN	NaN	NaN	NaN	NaN
Main Factor. Which of these was the main factor for leaving?	113	11	Dissatisfaction with %[Institute]Q25LBL%	23	NaN	NaN	NaN	NaN	NaN	NaN	NaN
Gender. What is your Gender?	596	2	Female	389	NaN	NaN	NaN	NaN	NaN	NaN	NaN
CurrentAge. Current Age	596	9	56 or older	162	NaN	NaN	NaN	NaN	NaN	NaN	NaN
LengthofServiceOverall. Overall Length of Service at Institute (in years)	596	7	Less than 1 year	147	NaN	NaN	NaN	NaN	NaN	NaN	NaN

From the output, about half of the data comes for the years 2011 to 2012 with resignations as the most common reason for ceasing employement amongst former TAFE employees. In addition, from the available values in the main factor for leaving column, dissatisfaction with the institution seems to be the most common factor leading to termination of employment. Lastly, similar to the DETE employees are mostly represented as being Females, in their later years (56 or older) with the top service dates being less than a year.

Conclusion¶

From the output, retirement is the most common reason for DETE employees leaving their jobs with most frequent age being above 61 years or older. Not suprisingly, most reported levels of dissatsifaction seem to be relatively unusual as indicated by the large amounts of False counts across the above columns. Lastly, Females seemed to represent the majority of the employees who have left their jobs at DETE.From the output, about half of the data comes for the years 2011 to 2012 with resignations as the most common reason for ceasing employement amongst former TAFE employees. In addition, from the available values in the main factor for leaving column, dissatisfaction with the institution seems to be the most common factor leading to termination of employment.Lastly, similar to the DETE employees are mostly represented as being Females, in their later years (56 or older) with the top service dates being less than a year. Our dataframe consists of 822 rows, 23 columns with roughly half of the columns being of type bool and the other of object type. In additon, relatviely low number of null vlaues have been identified. We'll make sure to check out the null values in the preprocessing stage as well change the data types of certain columns such as the 'Cease Date' one given that it will make more sense of having it as a integer or Datetime Pandas object.The TAFE dataframe consists of 702 rows and 9 columns with all but one being of object type. A relatively large amount of null values; the Main Factor column missing a large section of it's data. Possible reasons may be due to it's values being reflected in the other columns such as the dissatisfactions ones.

2. Data Preprocessing¶

Missing Values¶

In [15]:

def missing_values(data):
    miss = {}
    if len(data.columns) <= 0:
        print('Something went wrong. No columns detected.')
    for i in data.columns:  #iterate through all columns in dataset to search for each defined missing value
        if data[i].isnull().any():
            miss[i] = 'Null values ' + str(
                round(
                    len(data[data[i].isnull() == True]) / len(data) * 100, 2))
        if len(data[data[i] == 0]) > 1:
            miss[i] = 'Zero values ' + str(
                round(len(data[data[i] == 0]) / len(data) * 100, 2))
        try:
            if len(np.where(data[i] == '')) > 1:
                miss[i] = 'Empty strings ' + str(
                    round(len(data[data[i] == '']) / len(data) * 100, 2))
        except:
            pass
        if np.issubdtype(data[i].dtype,
                         np.number) and len(data[data[i] < 0]) > 1:
            miss[i] = 'Negative values ' + str(
                round(len(data[data[i] < 0]) / len(data) * 100, 2))
        try:
            if len(data[data[i].str.lower.isin(['n/a', 'na', 'an', 'n\a', '?'
                                                ])]) > 1:
                miss[i] = 'other null values present'
        except:
            if len(data[data[i].isin(['N/A', '?', 'na', 'Na', 'N/a'])]) > 1:
                miss[i] = 'other null values present'
    return sorted(miss.items(),
                  key=lambda x: float(miss[x[0]].split(' ')[-1]),
                  reverse=True)


def find_anomalies(df, col, STD):
    # Set upper and lower limit to 3 standard deviation
    random_data_std = np.std(df[col])
    random_data_mean = np.mean(df[col])
    anomaly_cut_off = random_data_std * STD

    lower_limit = float(random_data_mean - anomaly_cut_off)
    upper_limit = float(random_data_mean + anomaly_cut_off)
    print('Lower limit:', lower_limit)
    print('Upper limit:', upper_limit)

    # Generate outliers
    outliers = df[(df[col] < lower_limit) | (df[col] > upper_limit)]

    return outliers

To better work with the dataset's we'll rename them to be all lowercases and replace spaces with underscores.

In [16]:

tafe.columns = [i.lower().replace(' ', '_') for i in tafe]
dete.columns = [i.lower().replace(' ', '_') for i in dete]

Our next step will be to check for and determine what to do with missing values in both dataframes. We'll make use of the missing_values defined earlier to check for null values. Let's start with the dete dataframe.

In [17]:

miss =missing_values(dete)
miss

C:\Users\garci\Anaconda3\lib\site-packages\pandas\core\ops\array_ops.py:253: FutureWarning:

elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison

Out[17]:

[('physical_work_environment', 'Zero values 98.05'),
 ('work_location', 'Zero values 96.72'),
 ('lack_of_job_security', 'Zero values 96.59'),
 ('interpersonal_conflicts', 'Zero values 95.86'),
 ('employment_conditions', 'Zero values 95.86'),
 ('lack_of_recognition', 'Zero values 93.07'),
 ('dissatisfaction_with_the_department', 'Zero values 92.58'),
 ('workload', 'Zero values 89.42'),
 ('job_dissatisfaction', 'Zero values 89.17'),
 ('work_life_balance', 'Zero values 73.6'),
 ('opportunities_for_promotion', 'Null values 10.58'),
 ('workplace_issue', 'Null values 4.14'),
 ('gender', 'Null values 2.92'),
 ('stress_and_pressure_support', 'Null values 1.46'),
 ('age', 'Null values 1.34'),
 ('peer_support', 'Null values 1.22'),
 ('performance_of_supervisor', 'Null values 1.09'),
 ('worklife_balance', 'Null values 0.85'),
 ('staff_morale', 'Null values 0.73'),
 ('physical_environment', 'Null values 0.61')]

The output reveals various columns with a high percentage of 0 values; however, this is due to the True/False values seen earlier in the general dataframe exploration stage. When looking at the null values in the dataframe, we can observe that they constitute .61% up to 10.58% of the given columns.

So now, we'll filter out the columns identified as having zero values and save those with nulls values to a new variable.

In [18]:

columns = [i[0] for i in miss if i[1].split(' ')[0] == 'Null']

Given that the dataframe has 10% of missing values, we'll impute them with a central tendency. To aide us in that decision, we'll use a for loop to print out the unique values and type of each relevant column.

In [19]:

for i in columns:
    print(i, dete[i].apply(type).unique(), dete[i].unique())

opportunities_for_promotion [<class 'str'> <class 'float'>] ['A' 'N' 'SD' nan 'M' 'SA' 'D']
workplace_issue [<class 'str'> <class 'float'>] ['N' 'D' 'A' 'SD' 'SA' nan 'M']
gender [<class 'str'> <class 'float'>] ['Male' 'Female' nan]
stress_and_pressure_support [<class 'str'> <class 'float'>] ['A' 'N' 'D' 'SA' 'SD' 'M' nan]
age [<class 'str'> <class 'float'>] ['56-60' '61 or older' '36-40' '41-45' '31-35' '46-50' '51-55' '21-25'
 '26-30' nan '20 or younger']
peer_support [<class 'str'> <class 'float'>] ['A' 'N' nan 'SA' 'D' 'SD' 'M']
performance_of_supervisor [<class 'str'> <class 'float'>] ['A' 'N' 'D' 'SA' 'SD' 'M' nan]
worklife_balance [<class 'str'> <class 'float'>] ['A' 'N' 'D' 'SA' 'SD' 'M' nan]
staff_morale [<class 'str'> <class 'float'>] ['N' 'SD' 'A' 'D' 'SA' 'M' nan]
physical_environment [<class 'str'> <class 'float'>] ['N' 'A' 'D' 'SD' 'SA' 'M' nan]

So it seems that the null containing columns are all categorical. As such, we'll impute using the mode.

In [20]:

for i in columns:
    dete[i] = dete[i].replace('nan', np.nan).fillna(dete[i].mode()[0])

In [21]:

#sanity check
dete.isnull().sum()

Out[21]:

separationtype                         0
cease_date                             0
dete_start_date                        0
interpersonal_conflicts                0
job_dissatisfaction                    0
dissatisfaction_with_the_department    0
physical_work_environment              0
lack_of_recognition                    0
lack_of_job_security                   0
work_location                          0
employment_conditions                  0
work_life_balance                      0
workload                               0
opportunities_for_promotion            0
staff_morale                           0
workplace_issue                        0
physical_environment                   0
worklife_balance                       0
stress_and_pressure_support            0
performance_of_supervisor              0
peer_support                           0
gender                                 0
age                                    0
dtype: int64

We'll repeat the same process for the TAFE dataframe.

In [22]:

miss2 = missing_values(tafe)

From the output, the column with the 89% null values will be dropped.

In [23]:

tafe.drop(columns=miss2[0][0], inplace=True)
miss2 = miss2[1:]
miss2

Out[23]:

[('contributing_factors._dissatisfaction', 'Null values 37.75'),
 ('contributing_factors._job_dissatisfaction', 'Null values 37.75'),
 ('contributing_factors._interpersonal_conflict', 'Null values 37.75'),
 ('gender._what_is_your_gender?', 'Null values 15.1'),
 ('currentage._current_age', 'Null values 15.1'),
 ('lengthofserviceoverall._overall_length_of_service_at_institute_(in_years)',
  'Null values 15.1'),
 ('cessation_year', 'Null values 1.0'),
 ('reason_for_ceasing_employment', 'Null values 0.14')]

In [24]:

for i in miss2:
    print(i[0], tafe[i[0]].apply(type).unique(), tafe[i[0]].unique(),
          '\n')  #string and float types in there

contributing_factors._dissatisfaction [<class 'float'> <class 'str'>] [nan '-' 'Contributing Factors. Dissatisfaction '] 

contributing_factors._job_dissatisfaction [<class 'float'> <class 'str'>] [nan '-' 'Job Dissatisfaction'] 

contributing_factors._interpersonal_conflict [<class 'float'> <class 'str'>] [nan '-' 'Interpersonal Conflict'] 

gender._what_is_your_gender? [<class 'str'> <class 'float'>] ['Female' nan 'Male'] 

currentage._current_age [<class 'str'> <class 'float'>] ['26  30' nan '41  45' '56 or older' '20 or younger' '46  50' '36  40'
 '21  25' '51-55' '31  35'] 

lengthofserviceoverall._overall_length_of_service_at_institute_(in_years) [<class 'str'> <class 'float'>] ['1-2' nan '3-4' '7-10' 'Less than 1 year' '11-20' '5-6'
 'More than 20 years'] 

cessation_year [<class 'float'>] [2010. 2009.   nan 2011. 2012. 2013.] 

reason_for_ceasing_employment [<class 'str'> <class 'float'>] ['Contract Expired' 'Retirement' 'Resignation' 'Retrenchment/ Redundancy'
 'Termination' 'Transfer' nan]

We can observe that all but one (cessation_year) of the columns contain categorical values. Notice also the '-' value in some of the columns. This can be represented as a False value as the only other valid value represents the name of the columns thus implying a True occurrence. In addition, knowing that this is survey data, we can assume that the null values are indication for False occurrences. Therefore, we'll impute the first three columns (with '-' values) using a 0 value. The other categorical values we'll impute using the mode and lastly, and impute using the median for the cessation_year column.

In [25]:

bool_cat = [i[0] for i in miss2[:3]]
miss2 = miss2[3:]

In [26]:

tafe[bool_cat] = tafe[bool_cat].fillna(0)

Imputing using the mode for the categorical columns.

In [27]:

for i in miss2:
    if i[0] == 'cessation_year': continue
    else:
        tafe[i[0]] = tafe[i[0]].replace('nan',
                                        np.nan).fillna(tafe[i[0]].mode()[0])

Then impute using the median for the cessation_year column.

In [28]:

tafe['cessation_year'].fillna(tafe['cessation_year'].median(), inplace=True)

In [29]:

#sanity check
tafe.isnull().sum()

Out[29]:

cessation_year                                                               0
reason_for_ceasing_employment                                                0
contributing_factors._dissatisfaction                                        0
contributing_factors._job_dissatisfaction                                    0
contributing_factors._interpersonal_conflict                                 0
gender._what_is_your_gender?                                                 0
currentage._current_age                                                      0
lengthofserviceoverall._overall_length_of_service_at_institute_(in_years)    0
dtype: int64

Data Wrangling -Years¶

We'll convert the years to integer type as it'll be needed further down the line.

In [30]:

tafe['cessation_year'] =tafe['cessation_year'].apply(lambda x: int(x)) #change given null check 

Let's quickly look at the distribution of the years in the dataframe.

In [31]:

tafe['cessation_year'].value_counts()

Out[31]:

2011    275
2012    235
2010    103
2013     85
2009      4
Name: cessation_year, dtype: int64

Most of the datas stem from 2011 and 2012 with some in 2010 and very data values in 2013 and 2009. Let's check if this is also reflected in the DETE dataframe.

In [32]:

dete['cease_date'].value_counts()

Out[32]:

2012          344
2013          200
01/2014        43
12/2013        40
09/2013        34
Not Stated     34
06/2013        27
07/2013        22
10/2013        20
11/2013        16
08/2013        12
05/2013         7
05/2012         6
07/2014         2
04/2014         2
08/2012         2
04/2013         2
02/2014         2
09/2010         1
11/2012         1
07/2006         1
2010            1
09/2014         1
07/2012         1
2014            1
Name: cease_date, dtype: int64

Hmmm. The column needs some preprocessing before we can proceed to work with it. We'll use a lambda function that retrieves the relevant year and turn the "Not stated" value into a 0 of which we'll impute using the median.

In [33]:

dete['cease_date'] = dete['cease_date'].apply(
    lambda x: int(x.split('/')[-1]) if x.split('/')[-1].isdigit() else 0)

In [34]:

dete.loc[dete['cease_date'] == 0, 'cease_date'] = dete['cease_date'].median()

Now let's compare how the cessation years differ amongst the two dataframes.

In [35]:

print(dete['cease_date'].value_counts(), tafe['cessation_year'].value_counts())

2013.0    414
2012.0    354
2014.0     51
2010.0      2
2006.0      1
Name: cease_date, dtype: int64 2011    275
2012    235
2010    103
2013     85
2009      4
Name: cessation_year, dtype: int64

Most of the years from the TAFE dataframe come from 2012 with some in 2010 and very few in 2013 and 2009; years amongst the two dataframe vary but not by much.

Lastly,we'll check for the start year in the DETE dataframe; this won't apply to TAFE as it doesn't contain such columns.

In [36]:

pd.DataFrame(dete['dete_start_date'].value_counts(ascending=False)).transpose()

Out[36]:

	Not Stated	2011	2007	2008	2010	2012	2009	2006	1970	2013	...	1983	1973	1985	1987	1982	1963	1968	1967	1966	1965
dete_start_date	73	40	34	31	27	27	24	23	21	21	...	9	8	8	7	4	4	3	2	1	1

1 rows × 51 columns

The start years for the employees in the DETE org. range from 1965 to 2011. Looking at the low count values, this doesn't seem unusual as there is a likelihood of an old staff that have been loyal to the organization since their start date. The mode, 'Not Stated', will need to be imputed using the median for the purpose of answering the business problem.

In [37]:

#print the median
dete.loc[dete['dete_start_date'] == 'Not Stated',
         'dete_start_date'] = dete['dete_start_date'].str.replace(
             'Not Stated', '0').apply(int).median()

Before proceeding to the next data preprocess step, we'll quickly rename the columns to make them easier to work with.

In [38]:

tafe.columns

Out[38]:

Index(['cessation_year', 'reason_for_ceasing_employment',
       'contributing_factors._dissatisfaction',
       'contributing_factors._job_dissatisfaction',
       'contributing_factors._interpersonal_conflict',
       'gender._what_is_your_gender?', 'currentage._current_age',
       'lengthofserviceoverall._overall_length_of_service_at_institute_(in_years)'],
      dtype='object')

In [39]:

tafe.columns = ['cease_date', 'seperationtype',
       'dissatisfaction',
       'job_dissatisfaction',
       'interpersonal_conflict',
       'gender', 'age',
       'institute_service']

Data Wrangling - Separation Type¶

Let's take a look at the value counts for the types of separation.

In [40]:

tafe['seperationtype'].value_counts()

Out[40]:

Resignation                 341
Contract Expired            127
Retrenchment/ Redundancy    104
Retirement                   82
Transfer                     25
Termination                  23
Name: seperationtype, dtype: int64

Interestingly enough, the most common type of seperation for TAFE employees come from resignation. Keeping the business problem in mind, we'll take rows with resignation as the reason for seperation.

In [41]:

tafe = tafe.query("seperationtype =='Resignation'").copy()
tafe.reset_index(drop=True, inplace=True)
tafe.head()

Out[41]:

	cease_date	seperationtype	dissatisfaction	job_dissatisfaction	interpersonal_conflict	gender	age	institute_service
0	2010	Resignation	-	-	-	Female	56 or older	Less than 1 year
1	2010	Resignation	-	-	-	Male	41 45	3-4
2	2010	Resignation	-	-	-	Female	56 or older	7-10
3	2010	Resignation	-	-	-	Male	20 or younger	3-4
4	2010	Resignation	-	-	-	Male	46 50	3-4

Let's check how this looks like for the DETE employees.

In [42]:

dete['separationtype'].value_counts()

Out[42]:

Age Retirement                          285
Resignation-Other reasons               150
Resignation-Other employer               91
Resignation-Move overseas/interstate     70
Voluntary Early Retirement (VER)         67
Ill Health Retirement                    61
Other                                    49
Contract Expired                         34
Termination                              15
Name: separationtype, dtype: int64

Here, retirement is the most common separation reason but this is tied with resignation reasons if we combine the three resignation values. We'll merge three resignation-relevant values using a lambda function.

In [43]:

dete['seperationtype'] = dete['separationtype'].apply(
    lambda x: x.split('-')[0] if x.startswith('Resi') == True else x)

Then filter for the relevant values.

In [44]:

dete = dete.query("seperationtype =='Resignation'").copy()
dete.reset_index(drop=True, inplace=True)
dete.head()

Out[44]:

	separationtype	cease_date	dete_start_date	interpersonal_conflicts	job_dissatisfaction	dissatisfaction_with_the_department	physical_work_environment	lack_of_recognition	lack_of_job_security	work_location	...	staff_morale	workplace_issue	physical_environment	worklife_balance	stress_and_pressure_support	performance_of_supervisor	peer_support	gender	age	seperationtype
0	Resignation-Other reasons	2012.0	2005	False	False	False	False	False	False	False	...	N	N	A	A	N	N	A	Female	36-40	Resignation
1	Resignation-Other reasons	2012.0	1994	False	False	False	False	False	False	False	...	SD	D	SD	D	D	A	A	Female	41-45	Resignation
2	Resignation-Other reasons	2012.0	2009	False	False	False	False	False	False	False	...	D	D	N	SD	SD	D	D	Female	31-35	Resignation
3	Resignation-Other employer	2012.0	1997	True	True	True	False	False	False	False	...	SD	SD	A	SD	SD	SD	SD	Female	46-50	Resignation
4	Resignation-Move overseas/interstate	2012.0	2009	False	False	False	False	False	False	False	...	N	D	N	N	N	N	A	Male	31-35	Resignation

5 rows × 24 columns

Moving on, since we're interested in how experience affects resignation, we'll need to make a column for the DETE dataframe as calculated by the difference in their ending and start dates.

In [45]:

dete['institute_service'] = dete['cease_date'].apply(
    int) - dete['dete_start_date'].apply(int)

We'll then move on and retrieve columns with boolan values to create a dissatisfied column where: at least one dissatisfied column is identified as True.

In [46]:

dict(dete.dtypes[dete.dtypes ==bool]).keys()

Out[46]:

dict_keys(['interpersonal_conflicts', 'job_dissatisfaction', 'dissatisfaction_with_the_department', 'physical_work_environment', 'lack_of_recognition', 'lack_of_job_security', 'work_location', 'employment_conditions', 'work_life_balance', 'workload'])

In [47]:

params = [
    'interpersonal_conflicts', 'job_dissatisfaction',
    'dissatisfaction_with_the_department', 'physical_work_environment',
    'lack_of_recognition', 'lack_of_job_security', 'work_location',
    'employment_conditions', 'work_life_balance', 'workload'
]
dete['dissatisfied'] = dete[params].any(1, skipna=False)

In [48]:

dete['dissatisfied'].value_counts()

Out[48]:

False    157
True     154
Name: dissatisfied, dtype: int64

We'll do the same with the TAFE organization but this time using a function that accounts for the '-' value.

In [49]:

def update_vals(element):
    if element == '-':
        return False
    elif pd.isnull(element):
        return np.nan
    else:
        return True

In [50]:

tafe.columns

Out[50]:

Index(['cease_date', 'seperationtype', 'dissatisfaction',
       'job_dissatisfaction', 'interpersonal_conflict', 'gender', 'age',
       'institute_service'],
      dtype='object')

In [51]:

param = ['dissatisfaction', 'job_dissatisfaction', 'interpersonal_conflict']
# Creating 'dissatisfied' column for tafe_resignations:
tafe[param] = (tafe[param].applymap(update_vals))

tafe['dissatisfied'] = tafe[param].any(1, skipna=False)

In [52]:

tafe['dissatisfied'].value_counts()

Out[52]:

False    235
True     106
Name: dissatisfied, dtype: int64

Data Wrangling - Merging the Dataframes and Categorizing by Experience¶

With those changes done, we'll proceed to join the dataframe--to make it easier to work with. We'll create one last column indicating the institute before creating the dataframe.

In [53]:

dete['institute'] = 'DETE'
tafe['institute'] = 'TAFE'

In [54]:

combined = pd.concat([dete, tafe], ignore_index=True, sort=True)

Before we can proceed, we'll check for null values.

In [55]:

combined.isnull().sum().sort_values(ascending=False) / len(combined) * 100

Out[55]:

workplace_issue                        52.300613
performance_of_supervisor              52.300613
dete_start_date                        52.300613
dissatisfaction_with_the_department    52.300613
employment_conditions                  52.300613
interpersonal_conflicts                52.300613
lack_of_job_security                   52.300613
workload                               52.300613
opportunities_for_promotion            52.300613
peer_support                           52.300613
lack_of_recognition                    52.300613
physical_environment                   52.300613
separationtype                         52.300613
staff_morale                           52.300613
worklife_balance                       52.300613
stress_and_pressure_support            52.300613
physical_work_environment              52.300613
work_location                          52.300613
work_life_balance                      52.300613
interpersonal_conflict                 47.699387
dissatisfaction                        47.699387
dissatisfied                            0.000000
cease_date                              0.000000
institute                               0.000000
gender                                  0.000000
institute_service                       0.000000
job_dissatisfaction                     0.000000
seperationtype                          0.000000
age                                     0.000000
dtype: float64

There are many non-null vales present and as such we'll need to drop them from the dataframe. However, in this case, we'll only tolerate columns with at least 50% of non-null values. We set this threshold as it'll keep the columns of interest (see table above for irrelevant variables with 50% null values ).

In [56]:

thresh = len(combined) /2
thresh

Out[56]:

326.0

In [57]:

combined_updated = combined.dropna(thresh = thresh, axis =1)

In [58]:

combined_updated.reset_index(drop=True, inplace=True)

In [59]:

combined_updated.sample(3)

Out[59]:

	age	cease_date	dissatisfaction	dissatisfied	gender	institute	institute_service	interpersonal_conflict	job_dissatisfaction	seperationtype
433	36 40	2011.0	False	False	Female	TAFE	Less than 1 year	False	False	Resignation
422	21 25	2011.0	False	False	Male	TAFE	1-2	False	False	Resignation
586	56 or older	2012.0	False	True	Female	TAFE	3-4	False	True	Resignation

To end this step, we'll keep the columns that pertain to the parameter of interest: dissatifaction.

In [60]:

combined_updated.columns

Out[60]:

Index(['age', 'cease_date', 'dissatisfaction', 'dissatisfied', 'gender',
       'institute', 'institute_service', 'interpersonal_conflict',
       'job_dissatisfaction', 'seperationtype'],
      dtype='object')

In [61]:

combined_updated =combined_updated[['age', 'cease_date', 'dissatisfied', 'gender',  'institute', 'institute_service']]

In [62]:

combined_updated.head()

Out[62]:

	age	cease_date	dissatisfied	gender	institute	institute_service
0	36-40	2012.0	False	Female	DETE	7
1	41-45	2012.0	True	Female	DETE	18
2	31-35	2012.0	False	Female	DETE	3
3	46-50	2012.0	True	Female	DETE	15
4	31-35	2012.0	False	Male	DETE	3

In [63]:

combined_updated['institute_service'].unique()

Out[63]:

array([7, 18, 3, 15, 14, 5, 30, 32, 39, 17, 9, 6, 1, 35, 38, 36, 19, 4,
       26, 10, 8, 2, 0, 23, 13, 16, 12, 21, 20, 24, 33, 22, 28, 49, 11,
       41, 27, 42, 25, 29, 34, 31, 'Less than 1 year', '3-4', '7-10',
       '1-2', '11-20', '5-6', 'More than 20 years'], dtype=object)

Before we categorize the rows based on experience, we'll first want to retrieve and convert the years into integer type. For that, given the characters in the institute_service column, we'll use the extract string method to retrieve only the digits in that column.

In [64]:

combined_updated['institute_service'] = combined_updated[
    'institute_service'].astype('str').str.extract(r'(\d+)').astype(int)
combined_updated.head()

Out[64]:

	age	cease_date	dissatisfied	gender	institute	institute_service
0	36-40	2012.0	False	Female	DETE	7
1	41-45	2012.0	True	Female	DETE	18
2	31-35	2012.0	False	Female	DETE	3
3	46-50	2012.0	True	Female	DETE	15
4	31-35	2012.0	False	Male	DETE	3

Lastly, we'll create a function that assigns an experience level based on the institute_service value passed to it. NOTE: The threshold years chosen are based off the DataQuest project instructions.

In [65]:

def check(x):
    if x < 3: return 'New'
    elif x >= 3 and x < 7: return 'Experienced'
    elif x >= 7 and x < 11: return 'Established'
    elif x >= 11: return 'Veteran'
    elif str(x) == ['nan']: return np.nan

In [66]:

combined_updated['service_cat'] = combined_updated['institute_service'].apply(
    lambda x: check(x))

In [67]:

combined_updated.sample(5)

Out[67]:

	age	cease_date	dissatisfied	gender	institute	institute_service	service_cat
299	36-40	2013.0	False	Female	DETE	13	Veteran
312	41 45	2010.0	False	Male	TAFE	3	Experienced
57	61 or older	2012.0	True	Female	DETE	14	Veteran
442	21 25	2011.0	True	Female	TAFE	1	New
319	31 35	2010.0	False	Female	TAFE	11	Veteran

Before proceeding, let's quickly check for null values in the dataframe.

In [68]:

combined_updated.isnull().sum()

Out[68]:

age                  0
cease_date           0
dissatisfied         0
gender               0
institute            0
institute_service    0
service_cat          0
dtype: int64

Conclusion¶

That was a lot, in this step we accomplished the following:

Explored the data and figured out how to prepare it for analysis
Corrected some of the missing values
Dropped any data not needed for our analysis
Renamed our columns
Verified the quality of our data
Created a new institute_service column
Cleaned the Contributing Factors columns
Created a new column indicating if an employee resigned because they were dissatisfied in some way
Combined the data
Cleaned the institute_service column
Handled the missing values in the dissatisfied column
Aggregated the data

Data Analysis¶

Absent any null values we'll proceed to answer the first question: Are employees who only worked for the institutes for a short period of time resigning due to some kind of dissatisfaction?

To do this we'll create a pivot table that shows the share of people who reported dissatisfaction as their reason for their resignation.

In [69]:

pivot = combined_updated.pivot_table(index='service_cat',
                                     values='dissatisfied').sort_values(
                                         'dissatisfied', ascending=False)

Let's visualize the results via a bar graph.

In [70]:

fig = px.bar(pivot,
             x=pivot.index,
             y=pivot['dissatisfied'],
             text=pivot['dissatisfied'])
fig.update_layout(title='Reported Dissatisfaction By Experience Level',
                  xaxis_title="",
                  yaxis_title="Percent Dissatisfied",
                  font=dict(
                      family="Courier New, monospace",
                      size=18,
                  ))
fig.update_traces(texttemplate='%{text:.2f}', textposition='outside')
fig.show()

Employees in the later two stages of experience level have more than >10% reported levels of dissatisfaction than employees with relatively low experience. More efforts should be introduced to understand why those with longer years of service are leaving at a higher rate.

For the next question: Are younger employees resigning due to some kind of dissatisfaction? We'll take the similar approach done in the first question.

In [71]:

combined_updated['age'].unique()

Out[71]:

array(['36-40', '41-45', '31-35', '46-50', '61 or older', '56-60',
       '51-55', '21-25', '26-30', '20 or younger', '56 or older',
       '41  45', '46  50', '36  40', '21  25', '31  35', '26  30'],
      dtype=object)

Here, given that the first values are all age related, we'll split by '-' and ' ' in order to retrieve the first indicated age value.

In [72]:

combined_updated['age'] = combined_updated['age'].apply(
    lambda x: int(x.split('-')[0].split(' ')[0]))

In [73]:

combined_updated['age'].value_counts()

Out[73]:

56    106
41     93
46     81
36     73
51     71
26     67
21     62
31     61
61     28
20     10
Name: age, dtype: int64

We'll then make a function that'll assign the age into the following groups:

In [74]:

def age_categories(element):
    if element < 21:
        return '21 or less'
    elif 21 <= element < 30:
        return '21-29'
    elif 31 <= element < 40:
        return '31-39'
    elif 41 <= element < 50:
        return '41-49'
    elif 51 <= element < 60:
        return '51-59'
    else:
        return 'Above 60'


combined_updated['age'] = combined_updated['age'].apply(age_categories)

Lastly, we'll check for null values in case something went wrong.

In [75]:

combined_updated.isnull().sum()

Out[75]:

age                  0
cease_date           0
dissatisfied         0
gender               0
institute            0
institute_service    0
service_cat          0
dtype: int64

With the age categories formed, we'll create a pivot table to get each groups reported dissatisfaction.

In [76]:

pivot2 = combined_updated.pivot_table(
    index='age', values='dissatisfied').sort_values('dissatisfied',
                                                    ascending=False)

We'll again, visualize this via a bar graph.

In [77]:

fig = px.bar(pivot2,
             x=pivot2.index,
             y=pivot2['dissatisfied'],
             text=pivot2['dissatisfied'],
             color=pivot2.index)

fig.update_layout(title='Reported Dissatisfaction By Experience Level',
                  xaxis_title="",
                  yaxis_title="Percent Dissatisfied",
                  font=dict(
                      family="Courier New, monospace",
                      size=18,
                  ),
                  showlegend=False)
fig.update_traces(texttemplate='%{text:.2f}', textposition='outside')
fig.show()

When we take age into account, reported dissatisfaction levels range from 30% to 46%. From the chart, dissatisfaction levels increases the older the age group an employee lands on. This makes sense as the previous chart reflected experience level as a factor for higher dissatisfaction levels. However, here we get a more clearer view of which age groups may account for such high %: employees above the ages of 51.

Let's check how the reported dissatisfaction levels for each group fare relative to the mean. We'll calculate their Z-score for each value and then apply the mean to retrieve the average Z-score for each age group. The Z-score will inform us us how other groups fare relative to the average dissatisfaction percentage.

In [78]:

combined_updated['Zdissatisfied'] = combined_updated['dissatisfied'].apply(
    lambda x: (x - combined_updated['dissatisfied'].mean()) / combined_updated[
        'dissatisfied'].std())

In [79]:

#.pivot_table(index='institute', values = 'dissatisfied'))
pivot4 = combined_updated.pivot_table(
    index='age', values='Zdissatisfied').sort_values('Zdissatisfied',
                                                     ascending=False)
fig = px.bar(pivot4,
             x=pivot4.index,
             y=pivot4['Zdissatisfied'],
             text=pivot4['Zdissatisfied'])

fig.update_layout(
    title='Reported Dissatisfaction (Z-scores) By Experience Level',
    xaxis_title="",
    yaxis_title="Percent Dissatisfied",
    font=dict(
        family="Courier New, monospace",
        size=18,
    ),
    showlegend=False)
fig.update_traces(texttemplate='%{text:.2f}', textposition='outside')
fig.show()
print(combined_updated['dissatisfied'].mean())

0.3987730061349693

Clearly now, employees above the age of 51 have higher reported levels of dissatisfaction relative to the average dissatisfied level (39%). Those in the age groups of 21 to 49 seems to fair slightly better whilst those younger than 21 have the lowest dissatisfaction levels. Organizations should create more efforts to increase the large dissatisfaction levels in the older age groups.

How does dissatisfaction compare between the two institutions? We'll to answer this we'll take the Z-scores calculated earlier and average them for the two institutions.

In [80]:

pivot3 = combined_updated.pivot_table(index='institute',
                                      values='Zdissatisfied').sort_values(
                                          'Zdissatisfied', ascending=False)

In [81]:

#%config InlineBackend.figure_format = 'retina'
fig = px.bar(pivot3,
             x=pivot3.index,
             y=pivot3['Zdissatisfied'],
             text=pivot3['Zdissatisfied'])
colors = ['#EE3224', 'lightsalmon']

fig.update_layout(title='DETE and TAFE Dissatisfaction Z-score',
                  xaxis_title="",
                  yaxis_title="Percent Dissatisfied (Z-score)",
                  font=dict(
                      family="Courier New, monospace",
                      size=18,
                  ))

fig.update_traces(marker_color=colors,
                  texttemplate='%{text: .2f}',
                  textposition='outside')

fig.show()
print(combined_updated['dissatisfied'].mean())

0.3987730061349693

Relative to the mean (39%), DETE employees on average had 20% higher levels of dissatisfaction (around 50%). On the other hand, TAFE employees have 18% lower dissatisfaction levels relative to the mean. Given these figures, the DETE institute needs far more work in improving it's employees dissatisfaction levels.

Let's dig deeper into this by taking into account both gender and experience level. We'll from our groups based on institute, gender, and service category taking the percent of dissatisfaction for both.

In [82]:

df = combined_updated.groupby(['institute', 'gender',
                               'service_cat']).mean()['dissatisfied']

In [83]:

df = df.unstack()[['Established', 'Veteran', 'Experienced', 'New']]

Lastly, we'll visualize the output via a side-by-side bar graph.

In [87]:

sns.set_context('talk')
ax= df.plot(kind='bar',
             fontsize=12,
             rot=0,
             color=['#EE3224', '#F78F1E', '#FFC222', '#ff7a69'],figsize=(15,7))

ax.set_xticklabels([' '.join(i) for i in df.index])
plt.legend(fontsize=14)

# Axis styling.
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
ax.spines['left'].set_visible(False)
ax.spines['bottom'].set_color('#DDDDDD')
ax.tick_params(bottom=False, left=False,)
ax.set_axisbelow(True)
ax.yaxis.grid(True, color='#EEEEEE')
ax.xaxis.grid(False)

# Add axis and chart labels.
ax.set_xlabel('Institute and Gender', labelpad=15)
ax.set_ylabel('Percent Dissatisfied', labelpad=15)
ax.set_title(
    'Reported Dissatisfaction by Experience Level - Per Institute and Gender',
    size=20,
    pad=25)

#text annotations
for p in ax.patches:
    ax.annotate(str(round(p.get_height(), 2)),
                (p.get_x() * 1.005, p.get_height() * 1.02),
                fontsize=12)

plt.show()

The chart confirms the previous chart's finding wherein DETE employees have a higher percentage of dissatisfaction in comparison to TAFE employees. In this case, we can see several relationship. First, male employees in the DETE institute are the highest dissatisfied group with those with in the highest experience level entailing 67% to 70% of them reporting dissatisfaction. In addition, female employees in the TAFE institute with the hgihest experience level--Established--report the highest dissatisfaction in comparison to the other TAFE's gender groups and their experience level sub-groups. Lastly, male employees in the lowest experience level for the TAFE institute report the hgihest level of dissatisfaction for it's group.

Conclusion¶

Overall, employees with the highest experience level (Established and Veteran) report the highest dissatisfaction.
The older the age group, the higher the dissatisfaction percentage in particular for those older than 51.
DETE employees clearly report a higher level of dissatisfaction as a group in comparison to TAFE employees.
67 to 70% of male and experienced DETE employees report the highest dissatisfaction from all the other subgroups.
TAFE female employees with the highest experience level report the highest dissatisfaction from all the other subgroups within the TAFE institute.

Further analysis should dive deeper into the reasons for high dissatisfaction levels for older and experience employees, in particular for male DETE employee and female TAFE employees. Lastly, the DETE institute should also investigate into why their reported overall level of dissatisfaction is much higher when compared to the TAFE institute.