Clean and Analyze Employees Exit Surveys¶

Credit to David Chung, May 7th, 2020

In this guided project, we'll work with exit surveys from employees of the Department of Education, Training and Employment(DETE) and the Technical and Further Education(TAFE) institute in Queensland, Australia. You can find the DETE exit survey here and the survey for the TAFE here

The encoding was changed from cp1252 to UTF-8 to make them easier to work with.

In this project, we'll play the role of data analyst and pretend our stackholders want to know the following questions:

Are employees who only worked for the institutes for a short period of time resigning due to some kind of dissatisfaction? What about employees who have been there longer?
Are younger employees resigning due to some kind of dissatisfaction? What anout older employees?

Let's start by reading the datasets into pandas and exploring them.

In [1]:

# import libraries as below to begin with
% matplotlib inline
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

In [2]:

# read csv file into pandas 
dete_survey = pd.read_csv("dete_survey.csv")
tafe_survey = pd.read_csv("tafe_survey.csv")

Check State of DataFrame¶

In [3]:

# review information of dete_survey
dete_survey.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 822 entries, 0 to 821
Data columns (total 56 columns):
ID                                     822 non-null int64
SeparationType                         822 non-null object
Cease Date                             822 non-null object
DETE Start Date                        822 non-null object
Role Start Date                        822 non-null object
Position                               817 non-null object
Classification                         455 non-null object
Region                                 822 non-null object
Business Unit                          126 non-null object
Employment Status                      817 non-null object
Career move to public sector           822 non-null bool
Career move to private sector          822 non-null bool
Interpersonal conflicts                822 non-null bool
Job dissatisfaction                    822 non-null bool
Dissatisfaction with the department    822 non-null bool
Physical work environment              822 non-null bool
Lack of recognition                    822 non-null bool
Lack of job security                   822 non-null bool
Work location                          822 non-null bool
Employment conditions                  822 non-null bool
Maternity/family                       822 non-null bool
Relocation                             822 non-null bool
Study/Travel                           822 non-null bool
Ill Health                             822 non-null bool
Traumatic incident                     822 non-null bool
Work life balance                      822 non-null bool
Workload                               822 non-null bool
None of the above                      822 non-null bool
Professional Development               808 non-null object
Opportunities for promotion            735 non-null object
Staff morale                           816 non-null object
Workplace issue                        788 non-null object
Physical environment                   817 non-null object
Worklife balance                       815 non-null object
Stress and pressure support            810 non-null object
Performance of supervisor              813 non-null object
Peer support                           812 non-null object
Initiative                             813 non-null object
Skills                                 811 non-null object
Coach                                  767 non-null object
Career Aspirations                     746 non-null object
Feedback                               792 non-null object
Further PD                             768 non-null object
Communication                          814 non-null object
My say                                 812 non-null object
Information                            816 non-null object
Kept informed                          813 non-null object
Wellness programs                      766 non-null object
Health & Safety                        793 non-null object
Gender                                 798 non-null object
Age                                    811 non-null object
Aboriginal                             16 non-null object
Torres Strait                          3 non-null object
South Sea                              7 non-null object
Disability                             23 non-null object
NESB                                   32 non-null object
dtypes: bool(18), int64(1), object(37)
memory usage: 258.6+ KB

In [4]:

dete_survey.head(5)

Out[4]:

	ID	SeparationType	Cease Date	DETE Start Date	Role Start Date	Position	Classification	Region	Business Unit	Employment Status	...	Kept informed	Wellness programs	Health & Safety	Gender	Age	Aboriginal	Torres Strait	South Sea	Disability	NESB
0	1	Ill Health Retirement	08/2012	1984	2004	Public Servant	A01-A04	Central Office	Corporate Strategy and Peformance	Permanent Full-time	...	N	N	N	Male	56-60	NaN	NaN	NaN	NaN	Yes
1	2	Voluntary Early Retirement (VER)	08/2012	Not Stated	Not Stated	Public Servant	AO5-AO7	Central Office	Corporate Strategy and Peformance	Permanent Full-time	...	N	N	N	Male	56-60	NaN	NaN	NaN	NaN	NaN
2	3	Voluntary Early Retirement (VER)	05/2012	2011	2011	Schools Officer	NaN	Central Office	Education Queensland	Permanent Full-time	...	N	N	N	Male	61 or older	NaN	NaN	NaN	NaN	NaN
3	4	Resignation-Other reasons	05/2012	2005	2006	Teacher	Primary	Central Queensland	NaN	Permanent Full-time	...	A	N	A	Female	36-40	NaN	NaN	NaN	NaN	NaN
4	5	Age Retirement	05/2012	1970	1989	Head of Curriculum/Head of Special Education	NaN	South East	NaN	Permanent Full-time	...	N	A	M	Female	61 or older	NaN	NaN	NaN	NaN	NaN

5 rows × 56 columns

In [5]:

dete_survey.isnull().sum().head(5)

Out[5]:

ID                 0
SeparationType     0
Cease Date         0
DETE Start Date    0
Role Start Date    0
dtype: int64

In dete_survey, we take a look into ID, SeperationType, Cease Date and DETE Start Date and here's my observations:

There're 822 data in total.
We need to rename column names to uppercase for easy reading.
We need to review seperation type and rename data with dictionary.
Cease date is in mm/yyyy format and in string type.
There're missing values in DETE Start Date and we'll follow workflow of data cleaning to decide what to do.

In [6]:

tafe_survey.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 702 entries, 0 to 701
Data columns (total 72 columns):
Record ID 702 non-null float64
Institute 702 non-null object
WorkArea 702 non-null object
CESSATION YEAR 695 non-null float64
Reason for ceasing employment 701 non-null object
Contributing Factors. Career Move - Public Sector 437 non-null object
Contributing Factors. Career Move - Private Sector 437 non-null object
Contributing Factors. Career Move - Self-employment 437 non-null object
Contributing Factors. Ill Health 437 non-null object
Contributing Factors. Maternity/Family 437 non-null object
Contributing Factors. Dissatisfaction 437 non-null object
Contributing Factors. Job Dissatisfaction 437 non-null object
Contributing Factors. Interpersonal Conflict 437 non-null object
Contributing Factors. Study 437 non-null object
Contributing Factors. Travel 437 non-null object
Contributing Factors. Other 437 non-null object
Contributing Factors. NONE 437 non-null object
Main Factor. Which of these was the main factor for leaving? 113 non-null object
InstituteViews. Topic:1. I feel the senior leadership had a clear vision and direction 608 non-null object
InstituteViews. Topic:2. I was given access to skills training to help me do my job better 613 non-null object
InstituteViews. Topic:3. I was given adequate opportunities for personal development 610 non-null object
InstituteViews. Topic:4. I was given adequate opportunities for promotion within %Institute]Q25LBL% 608 non-null object
InstituteViews. Topic:5. I felt the salary for the job was right for the responsibilities I had 615 non-null object
InstituteViews. Topic:6. The organisation recognised when staff did good work 607 non-null object
InstituteViews. Topic:7. Management was generally supportive of me 614 non-null object
InstituteViews. Topic:8. Management was generally supportive of my team 608 non-null object
InstituteViews. Topic:9. I was kept informed of the changes in the organisation which would affect me 610 non-null object
InstituteViews. Topic:10. Staff morale was positive within the Institute 602 non-null object
InstituteViews. Topic:11. If I had a workplace issue it was dealt with quickly 601 non-null object
InstituteViews. Topic:12. If I had a workplace issue it was dealt with efficiently 597 non-null object
InstituteViews. Topic:13. If I had a workplace issue it was dealt with discreetly 601 non-null object
WorkUnitViews. Topic:14. I was satisfied with the quality of the management and supervision within my work unit 609 non-null object
WorkUnitViews. Topic:15. I worked well with my colleagues 605 non-null object
WorkUnitViews. Topic:16. My job was challenging and interesting 607 non-null object
WorkUnitViews. Topic:17. I was encouraged to use my initiative in the course of my work 610 non-null object
WorkUnitViews. Topic:18. I had sufficient contact with other people in my job 613 non-null object
WorkUnitViews. Topic:19. I was given adequate support and co-operation by my peers to enable me to do my job 609 non-null object
WorkUnitViews. Topic:20. I was able to use the full range of my skills in my job 609 non-null object
WorkUnitViews. Topic:21. I was able to use the full range of my abilities in my job. ; Category:Level of Agreement; Question:YOUR VIEWS ABOUT YOUR WORK UNIT] 608 non-null object
WorkUnitViews. Topic:22. I was able to use the full range of my knowledge in my job 608 non-null object
WorkUnitViews. Topic:23. My job provided sufficient variety 611 non-null object
WorkUnitViews. Topic:24. I was able to cope with the level of stress and pressure in my job 610 non-null object
WorkUnitViews. Topic:25. My job allowed me to balance the demands of work and family to my satisfaction 611 non-null object
WorkUnitViews. Topic:26. My supervisor gave me adequate personal recognition and feedback on my performance 606 non-null object
WorkUnitViews. Topic:27. My working environment was satisfactory e.g. sufficient space, good lighting, suitable seating and working area 610 non-null object
WorkUnitViews. Topic:28. I was given the opportunity to mentor and coach others in order for me to pass on my skills and knowledge prior to my cessation date 609 non-null object
WorkUnitViews. Topic:29. There was adequate communication between staff in my unit 603 non-null object
WorkUnitViews. Topic:30. Staff morale was positive within my work unit 606 non-null object
Induction. Did you undertake Workplace Induction? 619 non-null object
InductionInfo. Topic:Did you undertake a Corporate Induction? 432 non-null object
InductionInfo. Topic:Did you undertake a Institute Induction? 483 non-null object
InductionInfo. Topic: Did you undertake Team Induction? 440 non-null object
InductionInfo. Face to Face Topic:Did you undertake a Corporate Induction; Category:How it was conducted? 555 non-null object
InductionInfo. On-line Topic:Did you undertake a Corporate Induction; Category:How it was conducted? 555 non-null object
InductionInfo. Induction Manual Topic:Did you undertake a Corporate Induction? 555 non-null object
InductionInfo. Face to Face Topic:Did you undertake a Institute Induction? 530 non-null object
InductionInfo. On-line Topic:Did you undertake a Institute Induction? 555 non-null object
InductionInfo. Induction Manual Topic:Did you undertake a Institute Induction? 553 non-null object
InductionInfo. Face to Face Topic: Did you undertake Team Induction; Category? 555 non-null object
InductionInfo. On-line Topic: Did you undertake Team Induction?process you undertook and how it was conducted.] 555 non-null object
InductionInfo. Induction Manual Topic: Did you undertake Team Induction? 555 non-null object
Workplace. Topic:Did you and your Manager develop a Performance and Professional Development Plan (PPDP)? 608 non-null object
Workplace. Topic:Does your workplace promote a work culture free from all forms of unlawful discrimination? 594 non-null object
Workplace. Topic:Does your workplace promote and practice the principles of employment equity? 587 non-null object
Workplace. Topic:Does your workplace value the diversity of its employees? 586 non-null object
Workplace. Topic:Would you recommend the Institute as an employer to others? 581 non-null object
Gender. What is your Gender? 596 non-null object
CurrentAge. Current Age 596 non-null object
Employment Type. Employment Type 596 non-null object
Classification. Classification 596 non-null object
LengthofServiceOverall. Overall Length of Service at Institute (in years) 596 non-null object
LengthofServiceCurrent. Length of Service at current workplace (in years) 596 non-null object
dtypes: float64(2), object(70)
memory usage: 395.0+ KB

In [7]:

tafe_survey.head(5)

Out[7]:

	Record ID	Institute	WorkArea	CESSATION YEAR	Reason for ceasing employment	Contributing Factors. Career Move - Public Sector	Contributing Factors. Career Move - Private Sector	Contributing Factors. Career Move - Self-employment	Contributing Factors. Ill Health	Contributing Factors. Maternity/Family	...	Workplace. Topic:Does your workplace promote a work culture free from all forms of unlawful discrimination?	Workplace. Topic:Does your workplace promote and practice the principles of employment equity?	Workplace. Topic:Does your workplace value the diversity of its employees?	Workplace. Topic:Would you recommend the Institute as an employer to others?	Gender. What is your Gender?	CurrentAge. Current Age	Employment Type. Employment Type	Classification. Classification	LengthofServiceOverall. Overall Length of Service at Institute (in years)	LengthofServiceCurrent. Length of Service at current workplace (in years)
0	6.341330e+17	Southern Queensland Institute of TAFE	Non-Delivery (corporate)	2010.0	Contract Expired	NaN	NaN	NaN	NaN	NaN	...	Yes	Yes	Yes	Yes	Female	26 30	Temporary Full-time	Administration (AO)	1-2	1-2
1	6.341337e+17	Mount Isa Institute of TAFE	Non-Delivery (corporate)	2010.0	Retirement	-	-	-	-	-	...	Yes	Yes	Yes	Yes	NaN	NaN	NaN	NaN	NaN	NaN
2	6.341388e+17	Mount Isa Institute of TAFE	Delivery (teaching)	2010.0	Retirement	-	-	-	-	-	...	Yes	Yes	Yes	Yes	NaN	NaN	NaN	NaN	NaN	NaN
3	6.341399e+17	Mount Isa Institute of TAFE	Non-Delivery (corporate)	2010.0	Resignation	-	-	-	-	-	...	Yes	Yes	Yes	Yes	NaN	NaN	NaN	NaN	NaN	NaN
4	6.341466e+17	Southern Queensland Institute of TAFE	Delivery (teaching)	2010.0	Resignation	-	Career Move - Private Sector	-	-	-	...	Yes	Yes	Yes	Yes	Male	41 45	Permanent Full-time	Teacher (including LVT)	3-4	3-4

5 rows × 72 columns

In [8]:

tafe_survey.isnull().sum()

Out[8]:

Record ID 0
Institute 0
WorkArea 0
CESSATION YEAR 7
Reason for ceasing employment 1
Contributing Factors. Career Move - Public Sector 265
Contributing Factors. Career Move - Private Sector 265
Contributing Factors. Career Move - Self-employment 265
Contributing Factors. Ill Health 265
Contributing Factors. Maternity/Family 265
Contributing Factors. Dissatisfaction 265
Contributing Factors. Job Dissatisfaction 265
Contributing Factors. Interpersonal Conflict 265
Contributing Factors. Study 265
Contributing Factors. Travel 265
Contributing Factors. Other 265
Contributing Factors. NONE 265
Main Factor. Which of these was the main factor for leaving? 589
InstituteViews. Topic:1. I feel the senior leadership had a clear vision and direction 94
InstituteViews. Topic:2. I was given access to skills training to help me do my job better 89
InstituteViews. Topic:3. I was given adequate opportunities for personal development 92
InstituteViews. Topic:4. I was given adequate opportunities for promotion within %Institute]Q25LBL% 94
InstituteViews. Topic:5. I felt the salary for the job was right for the responsibilities I had 87
InstituteViews. Topic:6. The organisation recognised when staff did good work 95
InstituteViews. Topic:7. Management was generally supportive of me 88
InstituteViews. Topic:8. Management was generally supportive of my team 94
InstituteViews. Topic:9. I was kept informed of the changes in the organisation which would affect me 92
InstituteViews. Topic:10. Staff morale was positive within the Institute 100
InstituteViews. Topic:11. If I had a workplace issue it was dealt with quickly 101
InstituteViews. Topic:12. If I had a workplace issue it was dealt with efficiently 105
...
WorkUnitViews. Topic:25. My job allowed me to balance the demands of work and family to my satisfaction 91
WorkUnitViews. Topic:26. My supervisor gave me adequate personal recognition and feedback on my performance 96
WorkUnitViews. Topic:27. My working environment was satisfactory e.g. sufficient space, good lighting, suitable seating and working area 92
WorkUnitViews. Topic:28. I was given the opportunity to mentor and coach others in order for me to pass on my skills and knowledge prior to my cessation date 93
WorkUnitViews. Topic:29. There was adequate communication between staff in my unit 99
WorkUnitViews. Topic:30. Staff morale was positive within my work unit 96
Induction. Did you undertake Workplace Induction? 83
InductionInfo. Topic:Did you undertake a Corporate Induction? 270
InductionInfo. Topic:Did you undertake a Institute Induction? 219
InductionInfo. Topic: Did you undertake Team Induction? 262
InductionInfo. Face to Face Topic:Did you undertake a Corporate Induction; Category:How it was conducted? 147
InductionInfo. On-line Topic:Did you undertake a Corporate Induction; Category:How it was conducted? 147
InductionInfo. Induction Manual Topic:Did you undertake a Corporate Induction? 147
InductionInfo. Face to Face Topic:Did you undertake a Institute Induction? 172
InductionInfo. On-line Topic:Did you undertake a Institute Induction? 147
InductionInfo. Induction Manual Topic:Did you undertake a Institute Induction? 149
InductionInfo. Face to Face Topic: Did you undertake Team Induction; Category? 147
InductionInfo. On-line Topic: Did you undertake Team Induction?process you undertook and how it was conducted.] 147
InductionInfo. Induction Manual Topic: Did you undertake Team Induction? 147
Workplace. Topic:Did you and your Manager develop a Performance and Professional Development Plan (PPDP)? 94
Workplace. Topic:Does your workplace promote a work culture free from all forms of unlawful discrimination? 108
Workplace. Topic:Does your workplace promote and practice the principles of employment equity? 115
Workplace. Topic:Does your workplace value the diversity of its employees? 116
Workplace. Topic:Would you recommend the Institute as an employer to others? 121
Gender. What is your Gender? 106
CurrentAge. Current Age 106
Employment Type. Employment Type 106
Classification. Classification 106
LengthofServiceOverall. Overall Length of Service at Institute (in years) 106
LengthofServiceCurrent. Length of Service at current workplace (in years) 106
Length: 72, dtype: int64

In tefe_survey, we take a look into Recoded ID, Reason for ceasing employment and LengthofServiceOverall. Overall Length of Service at Institute (in years) and here's my observations:

There're 702 data in total.
Recorded ID has too many decimal points, maybe we can slightly correct for easy reading.
We need to rename column names for better searching.
Value in LengthofServiceOverall is a string of year range, maybe we can replace them as mean number for future analysis.

Clean Column Names¶

In [9]:

# read the dete_survey file into pandas again
# but this time read the "Not Stated" values in as "NaN" 
dete_survey = pd.read_csv("dete_survey.csv", na_values="Not Stated")

In [10]:

# drop columns which we won't use for analysis
# dete_survey
dete_survey_updated = dete_survey.drop(dete_survey.columns[28:49], axis=1)

# tafe_survey
tafe_survey_updated = tafe_survey.drop(tafe_survey.columns[17:66], axis=1)

In [11]:

# rename the remaining columns in lowercase, 
# remove whitespace and replace space with underline
dete_survey_updated.columns = dete_survey_updated.columns.str.lower().str.strip().str.replace(" ","_")

In [12]:

# use dataframe.rename() to update column names
# for tefe_survey_updated
cols_dict = {"Record ID":"id", "CESSATION YEAR":"cease_date",
             "Reason for ceasing employment": "separationtype",
             "Gender. What is your Gender?": "gender","CurrentAge. Current Age": "age",
             "Employment Type. Employment Type": "employment_status",
             "Classification. Classification": "position",
             "LengthofServiceOverall. Overall Length of Service at Institute (in years)": "institute_service",
             "LengthofServiceCurrent. Length of Service at current workplace (in years)": "role_service"}
tafe_survey_updated = tafe_survey_updated.rename(columns = cols_dict)
        

In [13]:

# look at the current state of both dataframe
print(dete_survey_updated.head(5))
print("-------------------------")
print(tafe_survey_updated.head(5))

   id                    separationtype cease_date  dete_start_date  \
0   1             Ill Health Retirement    08/2012           1984.0   
1   2  Voluntary Early Retirement (VER)    08/2012              NaN   
2   3  Voluntary Early Retirement (VER)    05/2012           2011.0   
3   4         Resignation-Other reasons    05/2012           2005.0   
4   5                    Age Retirement    05/2012           1970.0   

   role_start_date                                      position  \
0           2004.0                                Public Servant   
1              NaN                                Public Servant   
2           2011.0                               Schools Officer   
3           2006.0                                       Teacher   
4           1989.0  Head of Curriculum/Head of Special Education   

  classification              region                      business_unit  \
0        A01-A04      Central Office  Corporate Strategy and Peformance   
1        AO5-AO7      Central Office  Corporate Strategy and Peformance   
2            NaN      Central Office               Education Queensland   
3        Primary  Central Queensland                                NaN   
4            NaN          South East                                NaN   

     employment_status  ...   work_life_balance  workload  none_of_the_above  \
0  Permanent Full-time  ...               False     False               True   
1  Permanent Full-time  ...               False     False              False   
2  Permanent Full-time  ...               False     False               True   
3  Permanent Full-time  ...               False     False              False   
4  Permanent Full-time  ...                True     False              False   

   gender          age  aboriginal  torres_strait  south_sea  disability  nesb  
0    Male        56-60         NaN            NaN        NaN         NaN   Yes  
1    Male        56-60         NaN            NaN        NaN         NaN   NaN  
2    Male  61 or older         NaN            NaN        NaN         NaN   NaN  
3  Female        36-40         NaN            NaN        NaN         NaN   NaN  
4  Female  61 or older         NaN            NaN        NaN         NaN   NaN  

[5 rows x 35 columns]
-------------------------
             id                              Institute  \
0  6.341330e+17  Southern Queensland Institute of TAFE   
1  6.341337e+17            Mount Isa Institute of TAFE   
2  6.341388e+17            Mount Isa Institute of TAFE   
3  6.341399e+17            Mount Isa Institute of TAFE   
4  6.341466e+17  Southern Queensland Institute of TAFE   

                   WorkArea  cease_date    separationtype  \
0  Non-Delivery (corporate)      2010.0  Contract Expired   
1  Non-Delivery (corporate)      2010.0        Retirement   
2       Delivery (teaching)      2010.0        Retirement   
3  Non-Delivery (corporate)      2010.0       Resignation   
4       Delivery (teaching)      2010.0       Resignation   

  Contributing Factors. Career Move - Public Sector   \
0                                                NaN   
1                                                  -   
2                                                  -   
3                                                  -   
4                                                  -   

  Contributing Factors. Career Move - Private Sector   \
0                                                NaN    
1                                                  -    
2                                                  -    
3                                                  -    
4                       Career Move - Private Sector    

  Contributing Factors. Career Move - Self-employment  \
0                                                NaN    
1                                                  -    
2                                                  -    
3                                                  -    
4                                                  -    

  Contributing Factors. Ill Health Contributing Factors. Maternity/Family  \
0                              NaN                                    NaN   
1                                -                                      -   
2                                -                                      -   
3                                -                                      -   
4                                -                                      -   

      ...      Contributing Factors. Study Contributing Factors. Travel  \
0     ...                              NaN                          NaN   
1     ...                                -                       Travel   
2     ...                                -                            -   
3     ...                                -                       Travel   
4     ...                                -                            -   

  Contributing Factors. Other Contributing Factors. NONE  gender     age  \
0                         NaN                        NaN  Female  26  30   
1                           -                          -     NaN     NaN   
2                           -                       NONE     NaN     NaN   
3                           -                          -     NaN     NaN   
4                           -                          -    Male  41  45   

     employment_status                 position institute_service role_service  
0  Temporary Full-time      Administration (AO)               1-2          1-2  
1                  NaN                      NaN               NaN          NaN  
2                  NaN                      NaN               NaN          NaN  
3                  NaN                      NaN               NaN          NaN  
4  Permanent Full-time  Teacher (including LVT)               3-4          3-4  

[5 rows x 23 columns]

The main reason to change column names in both surveys is that we can make plots and directly compare with values with each other.

The column names of Tafe_survey_updated is much more different with Dete_survey_updated. So we need to build a dictionary to change name of Tafe while we only need to use function for Dete.

Filter the Data¶

In [14]:

# review the unique values in both dataframe
# dete_survey_updated
print(dete_survey_updated["separationtype"].value_counts())
print()

# tafe_survey_updated
print(tafe_survey_updated["separationtype"].value_counts())

Age Retirement                          285
Resignation-Other reasons               150
Resignation-Other employer               91
Resignation-Move overseas/interstate     70
Voluntary Early Retirement (VER)         67
Ill Health Retirement                    61
Other                                    49
Contract Expired                         34
Termination                              15
Name: separationtype, dtype: int64

Resignation                 340
Contract Expired            127
Retrenchment/ Redundancy    104
Retirement                   82
Transfer                     25
Termination                  23
Name: separationtype, dtype: int64

In order to analyze the dissatisfaction status of employees working short or long, we'd like to select the data with "Resignation" only. Remember there're three kinds of resignation in dete_survey_updated.

In [15]:

# dete_survey_updated
# select rows based on multiple values
# df[df["column_name].isin(["value_1", "value_2"])]
resignation_list = ["Resignation-Other reasons","Resignation-Other employer", "Resignation-Move overseas/interstate"]
dete_resignations = dete_survey_updated[dete_survey_updated["separationtype"].isin(resignation_list)].copy()

In [16]:

# tafe_survey_updated
tafe_resignations = tafe_survey_updated[tafe_survey_updated["separationtype"]=="Resignation"].copy()

We use Series.value_counts() to review unique values. There's 311 employees exit for resignation in dete_survey, and 340 employees in tafe_survey.

Since we'd like to analyze further for employees exit for resignation, we have to select the data with "resignation" in separation column.

dete_survey has 3 kinds of resignation so we need to select rows based on multiple values. After searching, we decide to use df[df["column_name"].isin(["value_1","value_2"])] method.

For tafe_survey, since there's only one kind of resignation, we can easily use boolean indexing to select rows.

Verify the Data¶

In [17]:

# view the unique values in cease_date
# select only year part and convert the float type
dete_resignations["cease_date"].value_counts()
dete_resignations["cease_date"] = dete_resignations["cease_date"].str[-4:].astype(float)

In [18]:

# Use series.value_counts().sort_values() 
# to check the values
dete_resignations["cease_date"].value_counts().sort_values(ascending=False)

Out[18]:

2013.0    146
2012.0    129
2014.0     22
2010.0      2
2006.0      1
Name: cease_date, dtype: int64

In [19]:

# Use series.value_counts().sort_index()
# to check the values by index
s1 = dete_resignations["dete_start_date"].value_counts().sort_index(ascending=False)
print(s1.head(10))

2013.0    10
2012.0    21
2011.0    24
2010.0    17
2009.0    13
2008.0    22
2007.0    21
2006.0    13
2005.0    15
2004.0    14
Name: dete_start_date, dtype: int64

In [20]:

tafe_resignations["cease_date"].value_counts().sort_index(ascending=False)

Out[20]:

2013.0     55
2012.0     94
2011.0    116
2010.0     68
2009.0      2
Name: cease_date, dtype: int64

The year in cease_date column of dete_resignation is string type. Some are {yyyy} format and some are {mm/yyyy}, and we'd like to unite the format to be {yyyy}. First, we use series.str() to select the last four words. Second, we chain series.astype(float) to change type of value.

Then we review values in cease_date and dete_start_date whether it's beyond our acceptance. (start_date earlier than 1940 or cease_date later than current date.) Luckily, there's no values beyond boundaries so we can move on.

Create a New Column¶

In [21]:

# subtract the "dete_start_date" from "cease_date"
# assign the result to a new column "institute_service"
dete_resignations["institute_service"] = dete_resignations["cease_date"] - dete_resignations["dete_start_date"]

We have to keep in mind that our final goal is to answer:

Are employees who have only worked for the institutes for a short period of time resigning due to some kind of dissatisfaction? What about employees who have been at the job longer?

So we need to calculate the service length by subtracting dete_start_date from cease_date, and assign the result to a new column institute_service.

Identify Dissatisfied Employees¶

In [22]:

# Convert the values to True, False or NaN
# # "Contributing Factors. Dissatisfaction" column
# tafe_resignations["Contributing Factors. Dissatisfaction"] = tafe_resignations["Contributing Factors. Dissatisfaction"].str.replace("-","False").str.replace("Contributing Factors. Dissatisfaction","True")

# # "Contributing Factors. Job Dissatisfaction" column
# tafe_resignations["Contributing Factors. Job Dissatisfaction"] = tafe_resignations["Contributing Factors. Job Dissatisfaction"].str.replace("-","False").str.replace("Job Dissatisfaction","True")

In [23]:

tafe_cols = ["Contributing Factors. Dissatisfaction",
            "Contributing Factors. Job Dissatisfaction"]
dete_cols = ["job_dissatisfaction", "dissatisfaction_with_the_department",
            "physical_work_environment","lack_of_recognition",
            "lack_of_job_security","work_location",
            "employment_conditions","work_life_balance","workload"]

# write a function which transform NaN to np.nan; "-" to False
# other situation to True
def update_vals(val):
    if pd.isnull(val):
        return np.nan
    elif val == "-":
        return False
    else:
        return True

# tafe_resignations
# use df.applymap(function).any(axis=1,skipna=False)
tafe_resignations["dissatisfied"] = tafe_resignations[tafe_cols].applymap(update_vals).any(axis=1, skipna=False)
tafe_resignations_up = tafe_resignations.copy()

# dete_resignations
dete_resignations["dissatisfied"] = dete_resignations[dete_cols].any(axis=1, skipna=False)
dete_resignations_up = dete_resignations.copy()

In [24]:

# TAFE
tafe_resignations_up["dissatisfied"].value_counts(dropna=False)

Out[24]:

False    241
True      91
NaN        8
Name: dissatisfied, dtype: int64

In [25]:

# DETE
dete_resignations_up["dissatisfied"].value_counts(dropna=False)

Out[25]:

False    162
True     149
Name: dissatisfied, dtype: int64

We write a function to transform value into the format we expected and sort the resignation reasons of dissatisfied into groups.

Combine the Data¶

In [26]:

# Add a column "institute" to dete_resignations_up
dete_resignations_up["institute"] = "DETE"

# Do the same to tafe_resignations_up
tafe_resignations_up["institute"] = "TAFE"

In [27]:

# Combine both dataframes, assign to combined
combined = pd.concat([dete_resignations_up ,tafe_resignations_up])

In [28]:

# use df.dropna(thresh=x) to remove any columns(axis=1)
# with less than 500 non-null values, x=500
combined = combined.dropna(axis=1, thresh=500)

# df.notnull().sum() to check if result is what we expected
combined.notnull().sum()

Out[28]:

age                  596
cease_date           635
dissatisfied         643
employment_status    597
gender               592
id                   651
institute            651
institute_service    563
position             598
separationtype       651
dtype: int64

After many cleaning steps, we're finally ready to combine both datasets. Our goal is to aggregate the data according to institute_service(how many years working here) column.

We want to remove some columns with too many null values. Since there're 651 datas and we decide to keep columns with more than 500 not-null values to make further analysis.

Clean the Service Column¶

Classify the employees into four groups:

Newbie(Less than 3 years)
Sophomore(3-6 years)
Tenured(7-10 years)
Sage(11 or more years)

In [29]:

# convert value into string type, extract number only
# convert the result back to float and assign to new column
years = combined["institute_service"].astype("str").str.extract(r"(\d+)", expand=True)
combined["institute_service_up"] = years.astype(float)

In [30]:

# write a function to classify into groups
# remember to classify null value seperately
def classify_year(val):
    if pd.isnull(val):
        return np.nan
    elif val < 3:
        return "Newbie"
    elif 3 <= val <7:
        return "Sophomore"
    elif 7 <= val <11:
        return "Tenured"
    else:
        return "Sage"

#  apply function and assign into new column
combined["service_cat"] = combined["institute_service_up"].apply(classify_year)

In [31]:

# check if the result is what we expect

# combined[combined["service_cat"] == "Tenured"]
# combined[combined["service_cat"] == "Sage"]
# combined[combined["service_cat"] == "Newbie"]
# combined[combined["service_cat"] == "Sophomore"]

If we look into institute_service column, we can find the format is inconsistent and hard to analyze. So we convert values into string dtype, extract only digits and convert back to float dtype. Finally, we assign new values into new column institute_service_up

We write a function in order to classify values of institute_service_up into four groups: Newbie, Sophomore, Tenured and Sage. Use series.apply() to activate the function and assign it into new column service_cat

Remember to check if the result is what we expected.

Perform Initial Analysis¶

In [32]:

# confirm the number of True & False in dissatisfied column
# use dropna=False to also confirm number of missing value

combined["dissatisfied"].value_counts(dropna=False)

Out[32]:

False    403
True     240
NaN        8
Name: dissatisfied, dtype: int64

In [33]:

# use DataFrame.fillna() to replace missing value 
# with the value that occurs most frequently
combined["dissatisfied"] = combined["dissatisfied"].fillna(False)

In [34]:

# use DataFrame.pivot_table(df, index, column, value, aggfunc)
# default aggfunc is mean
# since True is considered as 1 and False as 0
# mean value in this situation can be considered as percentage
dissatisfied_by_year = pd.pivot_table(combined, index=["service_cat"], values=["dissatisfied"])
dissatisfied_by_year

Out[34]:

	dissatisfied
service_cat
Newbie	0.295337
Sage	0.485294
Sophomore	0.343023
Tenured	0.516129

In [35]:

dissatisfied_by_year.plot(kind="bar", legend=None, title="Dissatisfaction Percentage by Service Year")
plt.tick_params(right="off", top="off")
plt.show()

We replace missing values of column dissatisfied with most frequent values False. Then, use pivot_table() to generate a table. True is considered as 1 while False as 0, plus the defalut aggfunc of pivot_table is mean. We don't need to set aggfunc and we can get mean values which can be considered as percentage in this situation.

We can plot the table and make some brief conclusion:

For employees worked for lower than 7 years, they are less likely to resign due to dissatisfied reasons(about 30%).
For employees worked for over 7 years, they are more likely to resign due to dissatisfation(about 50%).

According to this brief conclusion, if we're manager of these companies, we should take opinions from employees who worked for many years seriously or we'll easily lose their loyalty.

Next Steps¶

How many people in each career stage resigned due to some kind of dissatisfaction?¶

In [51]:

# 算出各個階段因為不滿意現況而離職的人數
grouped = combined.groupby(["service_cat","dissatisfied"])["service_cat"].agg("count")
print(grouped)

service_cat  dissatisfied
Newbie       False           136
             True             57
Sage         False            70
             True             66
Sophomore    False           113
             True             59
Tenured      False            30
             True             32
Name: service_cat, dtype: int64

The number exit for dissatisfied of Newbie: 57

The number exit for dissatisfied of Sophomore: 59

The number exit for dissatisfied of Tenurated: 32

The number exit for dissatisfied of Newbie: 66

Clean the age column. How many people in each age group resgined due to some kind of dissatisfaction?¶

In [57]:

# review the value status and figure out how to clean data
combined["age"]

Out[57]:

3            36-40
5            41-45
8            31-35
9            46-50
11           31-35
12           36-40
14           31-35
16     61 or older
20           56-60
21           51-55
22           46-50
23     61 or older
25           41-45
27           21-25
33           36-40
34     61 or older
37           21-25
39           21-25
40           56-60
41           51-55
42           41-45
43           51-55
48           21-25
50           21-25
51     61 or older
55           26-30
57           46-50
61           31-35
69           36-40
71           36-40
          ...     
659         46  50
660         41  45
661         46  50
665            NaN
666            NaN
669         26  30
670            NaN
671         46  50
675          51-55
676         41  45
677         36  40
678          51-55
679    56 or older
681         26  30
682         26  30
683         41  45
684         41  45
685         26  30
686         41  45
688         46  50
689         41  45
690            NaN
691    56 or older
693         26  30
694            NaN
696         21  25
697          51-55
698            NaN
699          51-55
701         26  30
Name: age, Length: 651, dtype: object

In [60]:

combined["age_up"] = combined["age"].str.extract(r"(\d+)", expand=True).astype(float)
combined["age_up"].value_counts()

Out[60]:

41.0    93
46.0    81
36.0    73
51.0    71
26.0    67
21.0    62
31.0    61
56.0    55
61.0    23
20.0    10
Name: age_up, dtype: int64

In [61]:

def classify_age(val):
    if pd.isnull(val):
        return np.nan
    elif val < 30:
        return "30 or less"
    elif 30< val< 40:
        return "30s"
    elif 40< val< 50:
        return "40s"
    else:
        return "50 or more"
    
combined["age_cat"] = combined["age_up"].apply(classify_age)

In [67]:

combined["age_cat"].value_counts(dropna=False)

Out[67]:

40s           174
50 or more    149
30 or less    139
30s           134
NaN            55
Name: age_cat, dtype: int64

In [71]:

# Use groupby to see how many people by age are dissatisfied
combined.groupby(["age_cat","dissatisfied"])["age_cat"].agg("count")

Out[71]:

age_cat     dissatisfied
30 or less  False            90
            True             49
30s         False            86
            True             48
40s         False           108
            True             66
50 or more  False            86
            True             63
Name: age_cat, dtype: int64

The number exit for dissatisfied of 30 or less: 49

The number exit for dissatisfied of 30s: 48

The number exit for dissatisfied of 40s: 66

The number exit for dissatisfied of 50 or more: 63

Instead of analyzing the survey results together, analyze each survey separately. Did more employees in the DETE survey or TAFE survey end their employment because they were dissatisfied in some way?¶

In [76]:

# recall there's a "institute" column stored 
# where this data comes from
# see total amounts
combined.groupby(["institute","dissatisfied"])["institute"].agg("count")

Out[76]:

institute  dissatisfied
DETE       False           162
           True            149
TAFE       False           249
           True             91
Name: institute, dtype: int64

In [77]:

# see percentage
pd.pivot_table(combined, index=["institute"], values=["dissatisfied"])

Out[77]:

	dissatisfied
institute
DETE	0.479100
TAFE	0.267647

From both total amounts(149>91) and percentage(47%>26%), we can realize more employees from DETE resigned due to dissatisfaction reasons.

In [ ]: