CLEAN AND ANALYZE EMPLOYEE EXITS SURVEYS

We'll work with exit surveys from employees of the Department of Education, Training and Employment (DETE) and the Technical and Further Education (TAFE) institute in Queensland, Australia. You can find the TAFE exit survey here and the survey for the DETE here.

We'll try to answer the following questions:

  • Are employees who only worked for the institutes for a short period of time resigning due to some kind of dissatisfaction?

  • What about employees who have been there longer?

  • Are younger employees resigning due to some kind of dissatisfaction?

  • What about older employees?

Index of contents:

  • A. READING AND EXPLORING OUR DATAFRAMES

  • B. TRANSFORMING OUR DATA

    • B.1. Reshape our data
    • B.2. Renaming columns
    • B.3. Subseting our Dataframe
    • B.4. Checking for inconsistencies in our data
    • B.5. Transforming values
  • C. COMBINING OUR DATAFRAMES

    • C.1. Clean up an transform the "institute_service" column
    • C.2. Joining "service_cat" column to combined_service_cat dataframe
    • C.3. Clean up an transform the "age" column
    • C.4. Joining "ages_mean" column to combined_age dataframe
  • D. VISUALIZING CLEANED DATA

    • D.1. Visualizing data of combined_service_cat dataframe
  • E. CONCLUSIONS

Summary of conclusions:

  • Only a 30% approximately of "New" employees (Less than 3 years at their workplace) were dissatisfied with their jobs when they exit. Employees belonging to "Established" and Veteran categories, with more than 7 years and 11 years of servis respectively, were the most dissatisfied.

  • However grouping the data by the age of employees, there are very little differences between groups, in relation to their percentaje of dissatisfaction when they resigned.

A. READING AND EXPLORING OUR DATAFRAMES

In [166]:
import numpy as np
import pandas as pd

#read both files and assing them to variables. For this proyect both are coded in UTF-8 (the original ones are encoded using cp1252.) 
dete_survey = pd.read_csv("dete_survey.csv")
tafe_survey = pd.read_csv("tafe_survey.csv")
In [167]:
dete_survey.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 822 entries, 0 to 821
Data columns (total 56 columns):
ID                                     822 non-null int64
SeparationType                         822 non-null object
Cease Date                             822 non-null object
DETE Start Date                        822 non-null object
Role Start Date                        822 non-null object
Position                               817 non-null object
Classification                         455 non-null object
Region                                 822 non-null object
Business Unit                          126 non-null object
Employment Status                      817 non-null object
Career move to public sector           822 non-null bool
Career move to private sector          822 non-null bool
Interpersonal conflicts                822 non-null bool
Job dissatisfaction                    822 non-null bool
Dissatisfaction with the department    822 non-null bool
Physical work environment              822 non-null bool
Lack of recognition                    822 non-null bool
Lack of job security                   822 non-null bool
Work location                          822 non-null bool
Employment conditions                  822 non-null bool
Maternity/family                       822 non-null bool
Relocation                             822 non-null bool
Study/Travel                           822 non-null bool
Ill Health                             822 non-null bool
Traumatic incident                     822 non-null bool
Work life balance                      822 non-null bool
Workload                               822 non-null bool
None of the above                      822 non-null bool
Professional Development               808 non-null object
Opportunities for promotion            735 non-null object
Staff morale                           816 non-null object
Workplace issue                        788 non-null object
Physical environment                   817 non-null object
Worklife balance                       815 non-null object
Stress and pressure support            810 non-null object
Performance of supervisor              813 non-null object
Peer support                           812 non-null object
Initiative                             813 non-null object
Skills                                 811 non-null object
Coach                                  767 non-null object
Career Aspirations                     746 non-null object
Feedback                               792 non-null object
Further PD                             768 non-null object
Communication                          814 non-null object
My say                                 812 non-null object
Information                            816 non-null object
Kept informed                          813 non-null object
Wellness programs                      766 non-null object
Health & Safety                        793 non-null object
Gender                                 798 non-null object
Age                                    811 non-null object
Aboriginal                             16 non-null object
Torres Strait                          3 non-null object
South Sea                              7 non-null object
Disability                             23 non-null object
NESB                                   32 non-null object
dtypes: bool(18), int64(1), object(37)
memory usage: 258.6+ KB
In [168]:
dete_survey.head(2)
Out[168]:
ID SeparationType Cease Date DETE Start Date Role Start Date Position Classification Region Business Unit Employment Status ... Kept informed Wellness programs Health & Safety Gender Age Aboriginal Torres Strait South Sea Disability NESB
0 1 Ill Health Retirement 08/2012 1984 2004 Public Servant A01-A04 Central Office Corporate Strategy and Peformance Permanent Full-time ... N N N Male 56-60 NaN NaN NaN NaN Yes
1 2 Voluntary Early Retirement (VER) 08/2012 Not Stated Not Stated Public Servant AO5-AO7 Central Office Corporate Strategy and Peformance Permanent Full-time ... N N N Male 56-60 NaN NaN NaN NaN NaN

2 rows × 56 columns

In [169]:
dete_survey.describe(include = "all").iloc[0:5, 0:12] # for a view of the 5 first lines from the 12 first columns
Out[169]:
ID SeparationType Cease Date DETE Start Date Role Start Date Position Classification Region Business Unit Employment Status Career move to public sector Career move to private sector
count 822.000000 822 822 822 822 817 455 822 126 817 822 822
unique NaN 9 25 51 46 15 8 9 14 5 2 2
top NaN Age Retirement 2012 Not Stated Not Stated Teacher Primary Metropolitan Education Queensland Permanent Full-time False False
freq NaN 285 344 73 98 324 161 135 54 434 800 742
mean 411.693431 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
In [170]:
dete_survey.describe(include = "all").iloc[0:5, 10:28] # columns 10-27
Out[170]:
Career move to public sector Career move to private sector Interpersonal conflicts Job dissatisfaction Dissatisfaction with the department Physical work environment Lack of recognition Lack of job security Work location Employment conditions Maternity/family Relocation Study/Travel Ill Health Traumatic incident Work life balance Workload None of the above
count 822 822 822 822 822 822 822 822 822 822 822 822 822 822 822 822 822 822
unique 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
top False False False False False False False False False False False False False False False False False False
freq 800 742 788 733 761 806 765 794 795 788 760 754 785 710 794 605 735 605
mean NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
In [171]:
dete_survey.describe(include = "all").iloc[0:5, 28:49] 
Out[171]:
Professional Development Opportunities for promotion Staff morale Workplace issue Physical environment Worklife balance Stress and pressure support Performance of supervisor Peer support Initiative ... Coach Career Aspirations Feedback Further PD Communication My say Information Kept informed Wellness programs Health & Safety
count 808 735 816 788 817 815 810 813 812 813 ... 767 746 792 768 814 812 816 813 766 793
unique 6 6 6 6 6 6 6 6 6 6 ... 6 6 6 6 6 6 6 6 6 6
top A A A A A A A A A A ... A A A A A A A A A A
freq 413 242 335 357 467 359 342 349 401 396 ... 345 246 348 293 399 400 436 401 253 386
mean NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

5 rows × 21 columns

In [172]:
dete_survey["Professional Development"].value_counts(dropna = False)
Out[172]:
A      413
SA     184
N      103
D       60
SD      33
M       15
NaN     14
Name: Professional Development, dtype: int64
In [173]:
dete_survey.describe(include = "all").iloc[0:5, 49:] 
Out[173]:
Gender Age Aboriginal Torres Strait South Sea Disability NESB
count 798 811 16 3 7 23 32
unique 2 10 1 1 1 1 1
top Female 61 or older Yes Yes Yes Yes Yes
freq 573 222 16 3 7 23 32
mean NaN NaN NaN NaN NaN NaN NaN
  • Findins in dete_survey:

Almost all data are strings or boolean, there is only a column with numeric values that is ID column.

There are "Not Stated" values instead of NaN (e.g. in DETE start date column).

The names of columns combines upper and lower character. Should be standarize to upper for better management.

We can clasified four groups of columns for their type of information and values:

columns 0-12: general information related to the position and the cease. Type of values, strings.

columns 12-27: causes of cease. Type of values, boolean.

columns 28-49: evaluation of the Department. Type of values, string ranking from M to A (with 6 categories between these both).

columns 49-56: personal information of the employee (Gender and Age). The last 5 columns are Y or N and have very few values. We should decide if drop them because doesn't apport any information for our aims.

There are many columns we don't need in this proyect. The columns which seem necesary for answering our initial questions are: DETE Start date, Cese dat, Age, and the group of columns related to causes of cease.

In [174]:
tafe_survey.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 702 entries, 0 to 701
Data columns (total 72 columns):
Record ID                                                                                                                                                        702 non-null float64
Institute                                                                                                                                                        702 non-null object
WorkArea                                                                                                                                                         702 non-null object
CESSATION YEAR                                                                                                                                                   695 non-null float64
Reason for ceasing employment                                                                                                                                    701 non-null object
Contributing Factors. Career Move - Public Sector                                                                                                                437 non-null object
Contributing Factors. Career Move - Private Sector                                                                                                               437 non-null object
Contributing Factors. Career Move - Self-employment                                                                                                              437 non-null object
Contributing Factors. Ill Health                                                                                                                                 437 non-null object
Contributing Factors. Maternity/Family                                                                                                                           437 non-null object
Contributing Factors. Dissatisfaction                                                                                                                            437 non-null object
Contributing Factors. Job Dissatisfaction                                                                                                                        437 non-null object
Contributing Factors. Interpersonal Conflict                                                                                                                     437 non-null object
Contributing Factors. Study                                                                                                                                      437 non-null object
Contributing Factors. Travel                                                                                                                                     437 non-null object
Contributing Factors. Other                                                                                                                                      437 non-null object
Contributing Factors. NONE                                                                                                                                       437 non-null object
Main Factor. Which of these was the main factor for leaving?                                                                                                     113 non-null object
InstituteViews. Topic:1. I feel the senior leadership had a clear vision and direction                                                                           608 non-null object
InstituteViews. Topic:2. I was given access to skills training to help me do my job better                                                                       613 non-null object
InstituteViews. Topic:3. I was given adequate opportunities for personal development                                                                             610 non-null object
InstituteViews. Topic:4. I was given adequate opportunities for promotion within %Institute]Q25LBL%                                                              608 non-null object
InstituteViews. Topic:5. I felt the salary for the job was right for the responsibilities I had                                                                  615 non-null object
InstituteViews. Topic:6. The organisation recognised when staff did good work                                                                                    607 non-null object
InstituteViews. Topic:7. Management was generally supportive of me                                                                                               614 non-null object
InstituteViews. Topic:8. Management was generally supportive of my team                                                                                          608 non-null object
InstituteViews. Topic:9. I was kept informed of the changes in the organisation which would affect me                                                            610 non-null object
InstituteViews. Topic:10. Staff morale was positive within the Institute                                                                                         602 non-null object
InstituteViews. Topic:11. If I had a workplace issue it was dealt with quickly                                                                                   601 non-null object
InstituteViews. Topic:12. If I had a workplace issue it was dealt with efficiently                                                                               597 non-null object
InstituteViews. Topic:13. If I had a workplace issue it was dealt with discreetly                                                                                601 non-null object
WorkUnitViews. Topic:14. I was satisfied with the quality of the management and supervision within my work unit                                                  609 non-null object
WorkUnitViews. Topic:15. I worked well with my colleagues                                                                                                        605 non-null object
WorkUnitViews. Topic:16. My job was challenging and interesting                                                                                                  607 non-null object
WorkUnitViews. Topic:17. I was encouraged to use my initiative in the course of my work                                                                          610 non-null object
WorkUnitViews. Topic:18. I had sufficient contact with other people in my job                                                                                    613 non-null object
WorkUnitViews. Topic:19. I was given adequate support and co-operation by my peers to enable me to do my job                                                     609 non-null object
WorkUnitViews. Topic:20. I was able to use the full range of my skills in my job                                                                                 609 non-null object
WorkUnitViews. Topic:21. I was able to use the full range of my abilities in my job. ; Category:Level of Agreement; Question:YOUR VIEWS ABOUT YOUR WORK UNIT]    608 non-null object
WorkUnitViews. Topic:22. I was able to use the full range of my knowledge in my job                                                                              608 non-null object
WorkUnitViews. Topic:23. My job provided sufficient variety                                                                                                      611 non-null object
WorkUnitViews. Topic:24. I was able to cope with the level of stress and pressure in my job                                                                      610 non-null object
WorkUnitViews. Topic:25. My job allowed me to balance the demands of work and family to my satisfaction                                                          611 non-null object
WorkUnitViews. Topic:26. My supervisor gave me adequate personal recognition and feedback on my performance                                                      606 non-null object
WorkUnitViews. Topic:27. My working environment was satisfactory e.g. sufficient space, good lighting, suitable seating and working area                         610 non-null object
WorkUnitViews. Topic:28. I was given the opportunity to mentor and coach others in order for me to pass on my skills and knowledge prior to my cessation date    609 non-null object
WorkUnitViews. Topic:29. There was adequate communication between staff in my unit                                                                               603 non-null object
WorkUnitViews. Topic:30. Staff morale was positive within my work unit                                                                                           606 non-null object
Induction. Did you undertake Workplace Induction?                                                                                                                619 non-null object
InductionInfo. Topic:Did you undertake a Corporate Induction?                                                                                                    432 non-null object
InductionInfo. Topic:Did you undertake a Institute Induction?                                                                                                    483 non-null object
InductionInfo. Topic: Did you undertake Team Induction?                                                                                                          440 non-null object
InductionInfo. Face to Face Topic:Did you undertake a Corporate Induction; Category:How it was conducted?                                                        555 non-null object
InductionInfo. On-line Topic:Did you undertake a Corporate Induction; Category:How it was conducted?                                                             555 non-null object
InductionInfo. Induction Manual Topic:Did you undertake a Corporate Induction?                                                                                   555 non-null object
InductionInfo. Face to Face Topic:Did you undertake a Institute Induction?                                                                                       530 non-null object
InductionInfo. On-line Topic:Did you undertake a Institute Induction?                                                                                            555 non-null object
InductionInfo. Induction Manual Topic:Did you undertake a Institute Induction?                                                                                   553 non-null object
InductionInfo. Face to Face Topic: Did you undertake Team Induction; Category?                                                                                   555 non-null object
InductionInfo. On-line Topic: Did you undertake Team Induction?process you undertook and how it was conducted.]                                                  555 non-null object
InductionInfo. Induction Manual Topic: Did you undertake Team Induction?                                                                                         555 non-null object
Workplace. Topic:Did you and your Manager develop a Performance and Professional Development Plan (PPDP)?                                                        608 non-null object
Workplace. Topic:Does your workplace promote a work culture free from all forms of unlawful discrimination?                                                      594 non-null object
Workplace. Topic:Does your workplace promote and practice the principles of employment equity?                                                                   587 non-null object
Workplace. Topic:Does your workplace value the diversity of its employees?                                                                                       586 non-null object
Workplace. Topic:Would you recommend the Institute as an employer to others?                                                                                     581 non-null object
Gender. What is your Gender?                                                                                                                                     596 non-null object
CurrentAge. Current Age                                                                                                                                          596 non-null object
Employment Type. Employment Type                                                                                                                                 596 non-null object
Classification. Classification                                                                                                                                   596 non-null object
LengthofServiceOverall. Overall Length of Service at Institute (in years)                                                                                        596 non-null object
LengthofServiceCurrent. Length of Service at current workplace (in years)                                                                                        596 non-null object
dtypes: float64(2), object(70)
memory usage: 395.0+ KB
In [175]:
tafe_survey.head(2)
Out[175]:
Record ID Institute WorkArea CESSATION YEAR Reason for ceasing employment Contributing Factors. Career Move - Public Sector Contributing Factors. Career Move - Private Sector Contributing Factors. Career Move - Self-employment Contributing Factors. Ill Health Contributing Factors. Maternity/Family ... Workplace. Topic:Does your workplace promote a work culture free from all forms of unlawful discrimination? Workplace. Topic:Does your workplace promote and practice the principles of employment equity? Workplace. Topic:Does your workplace value the diversity of its employees? Workplace. Topic:Would you recommend the Institute as an employer to others? Gender. What is your Gender? CurrentAge. Current Age Employment Type. Employment Type Classification. Classification LengthofServiceOverall. Overall Length of Service at Institute (in years) LengthofServiceCurrent. Length of Service at current workplace (in years)
0 6.341330e+17 Southern Queensland Institute of TAFE Non-Delivery (corporate) 2010.0 Contract Expired NaN NaN NaN NaN NaN ... Yes Yes Yes Yes Female 26 30 Temporary Full-time Administration (AO) 1-2 1-2
1 6.341337e+17 Mount Isa Institute of TAFE Non-Delivery (corporate) 2010.0 Retirement - - - - - ... Yes Yes Yes Yes NaN NaN NaN NaN NaN NaN

2 rows × 72 columns

In [176]:
tafe_survey.describe(include = "all")[0:5]
Out[176]:
Record ID Institute WorkArea CESSATION YEAR Reason for ceasing employment Contributing Factors. Career Move - Public Sector Contributing Factors. Career Move - Private Sector Contributing Factors. Career Move - Self-employment Contributing Factors. Ill Health Contributing Factors. Maternity/Family ... Workplace. Topic:Does your workplace promote a work culture free from all forms of unlawful discrimination? Workplace. Topic:Does your workplace promote and practice the principles of employment equity? Workplace. Topic:Does your workplace value the diversity of its employees? Workplace. Topic:Would you recommend the Institute as an employer to others? Gender. What is your Gender? CurrentAge. Current Age Employment Type. Employment Type Classification. Classification LengthofServiceOverall. Overall Length of Service at Institute (in years) LengthofServiceCurrent. Length of Service at current workplace (in years)
count 7.020000e+02 702 702 695.000000 701 437 437 437 437 437 ... 594 587 586 581 596 596 596 596 596 596
unique NaN 12 2 NaN 6 2 2 2 2 2 ... 2 2 2 2 2 9 5 9 7 7
top NaN Brisbane North Institute of TAFE Non-Delivery (corporate) NaN Resignation - - - - - ... Yes Yes Yes Yes Female 56 or older Permanent Full-time Administration (AO) Less than 1 year Less than 1 year
freq NaN 161 432 NaN 340 375 336 420 403 411 ... 536 512 488 416 389 162 237 293 147 177
mean 6.346026e+17 NaN NaN 2011.423022 NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

5 rows × 72 columns

In [177]:
tafe_survey["Contributing Factors. Career Move - Public Sector "].value_counts(dropna = False)
Out[177]:
-                              375
NaN                            265
Career Move - Public Sector     62
Name: Contributing Factors. Career Move - Public Sector , dtype: int64
  • Findins in tafe_survey:

The information of this file is much less clear:

The name of columns are too long for a good management of their information. It seems they used just the questions of this survey as columns names.

Almost of types of data are objects except for two columns: Record ID and CESSATION YEAR.

It seems it would be better select only the columns we will need for this proyect: CESSATION YEAR, LengthofServiceOverall, CurrentAge, and the gropu of Contributing factors.

B. TRANSFORMING OUR DATA

B.1. Reshape our data

We'll transform missing values of date_survey file named as "Not Stated" as Nan values.

Also we'll eliminate columns that we don't need.

In [178]:
dete_survey = pd.read_csv("dete_survey.csv", na_values = "Not Stated")

dete_survey.head(2) # only the 2 first rows
Out[178]:
ID SeparationType Cease Date DETE Start Date Role Start Date Position Classification Region Business Unit Employment Status ... Kept informed Wellness programs Health & Safety Gender Age Aboriginal Torres Strait South Sea Disability NESB
0 1 Ill Health Retirement 08/2012 1984.0 2004.0 Public Servant A01-A04 Central Office Corporate Strategy and Peformance Permanent Full-time ... N N N Male 56-60 NaN NaN NaN NaN Yes
1 2 Voluntary Early Retirement (VER) 08/2012 NaN NaN Public Servant AO5-AO7 Central Office Corporate Strategy and Peformance Permanent Full-time ... N N N Male 56-60 NaN NaN NaN NaN NaN

2 rows × 56 columns

In [179]:
dete_survey["DETE Start Date"].value_counts(dropna = False)
Out[179]:
NaN        73
 2011.0    40
 2007.0    34
 2008.0    31
 2010.0    27
 2012.0    27
 2009.0    24
 2006.0    23
 1970.0    21
 1975.0    21
 2013.0    21
 2005.0    20
 1990.0    20
 1999.0    19
 1996.0    19
 1992.0    18
 1991.0    18
 2000.0    18
 2004.0    18
 1989.0    17
 1978.0    15
 2003.0    15
 1988.0    15
 1976.0    15
 2002.0    15
 1974.0    14
 1997.0    14
 1998.0    14
 1979.0    14
 1995.0    14
 1980.0    14
 1993.0    13
 1972.0    12
 1986.0    12
 1977.0    11
 1971.0    10
 1984.0    10
 1994.0    10
 1969.0    10
 2001.0    10
 1983.0     9
 1981.0     9
 1973.0     8
 1985.0     8
 1987.0     7
 1982.0     4
 1963.0     4
 1968.0     3
 1967.0     2
 1965.0     1
 1966.0     1
Name: DETE Start Date, dtype: int64

We don't have "Not Stated" values in this column as before...

In [180]:
dete_survey_updated = dete_survey.drop(axis = 1, labels = dete_survey.columns[28:49]) #use a subset of the list of columns to drop

dete_survey_updated.info() # take a look of the remainings columns
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 822 entries, 0 to 821
Data columns (total 35 columns):
ID                                     822 non-null int64
SeparationType                         822 non-null object
Cease Date                             788 non-null object
DETE Start Date                        749 non-null float64
Role Start Date                        724 non-null float64
Position                               817 non-null object
Classification                         455 non-null object
Region                                 717 non-null object
Business Unit                          126 non-null object
Employment Status                      817 non-null object
Career move to public sector           822 non-null bool
Career move to private sector          822 non-null bool
Interpersonal conflicts                822 non-null bool
Job dissatisfaction                    822 non-null bool
Dissatisfaction with the department    822 non-null bool
Physical work environment              822 non-null bool
Lack of recognition                    822 non-null bool
Lack of job security                   822 non-null bool
Work location                          822 non-null bool
Employment conditions                  822 non-null bool
Maternity/family                       822 non-null bool
Relocation                             822 non-null bool
Study/Travel                           822 non-null bool
Ill Health                             822 non-null bool
Traumatic incident                     822 non-null bool
Work life balance                      822 non-null bool
Workload                               822 non-null bool
None of the above                      822 non-null bool
Gender                                 798 non-null object
Age                                    811 non-null object
Aboriginal                             16 non-null object
Torres Strait                          3 non-null object
South Sea                              7 non-null object
Disability                             23 non-null object
NESB                                   32 non-null object
dtypes: bool(18), float64(2), int64(1), object(14)
memory usage: 123.7+ KB

We have eliminated all columns related to evaluate the department, which we didn't need.

In [181]:
tafe_survey_updated = tafe_survey.drop(columns = tafe_survey.columns[17:66]) #in this case we use columns parameter instead of label + axis

tafe_survey_updated.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 702 entries, 0 to 701
Data columns (total 23 columns):
Record ID                                                                    702 non-null float64
Institute                                                                    702 non-null object
WorkArea                                                                     702 non-null object
CESSATION YEAR                                                               695 non-null float64
Reason for ceasing employment                                                701 non-null object
Contributing Factors. Career Move - Public Sector                            437 non-null object
Contributing Factors. Career Move - Private Sector                           437 non-null object
Contributing Factors. Career Move - Self-employment                          437 non-null object
Contributing Factors. Ill Health                                             437 non-null object
Contributing Factors. Maternity/Family                                       437 non-null object
Contributing Factors. Dissatisfaction                                        437 non-null object
Contributing Factors. Job Dissatisfaction                                    437 non-null object
Contributing Factors. Interpersonal Conflict                                 437 non-null object
Contributing Factors. Study                                                  437 non-null object
Contributing Factors. Travel                                                 437 non-null object
Contributing Factors. Other                                                  437 non-null object
Contributing Factors. NONE                                                   437 non-null object
Gender. What is your Gender?                                                 596 non-null object
CurrentAge. Current Age                                                      596 non-null object
Employment Type. Employment Type                                             596 non-null object
Classification. Classification                                               596 non-null object
LengthofServiceOverall. Overall Length of Service at Institute (in years)    596 non-null object
LengthofServiceCurrent. Length of Service at current workplace (in years)    596 non-null object
dtypes: float64(2), object(21)
memory usage: 126.2+ KB

In the case of tafe_survey the reduction of columns is even bigger.

B.2. Renaming columns

Because we eventually want to combine them, we'll have to standardize the column names of our interest.

We will use the following criteria to update the column names:

  • Make all the capitalization lowercase.
  • Remove any trailing whitespace from the end of the strings.
  • Replace spaces with underscores ('_').
In [182]:
dete_survey_updated.columns = dete_survey_updated.columns.str.lower().str.strip().str.replace(" ", "_")

print(dete_survey_updated.columns) # to check the result
Index(['id', 'separationtype', 'cease_date', 'dete_start_date',
       'role_start_date', 'position', 'classification', 'region',
       'business_unit', 'employment_status', 'career_move_to_public_sector',
       'career_move_to_private_sector', 'interpersonal_conflicts',
       'job_dissatisfaction', 'dissatisfaction_with_the_department',
       'physical_work_environment', 'lack_of_recognition',
       'lack_of_job_security', 'work_location', 'employment_conditions',
       'maternity/family', 'relocation', 'study/travel', 'ill_health',
       'traumatic_incident', 'work_life_balance', 'workload',
       'none_of_the_above', 'gender', 'age', 'aboriginal', 'torres_strait',
       'south_sea', 'disability', 'nesb'],
      dtype='object')
In [183]:
print(tafe_survey_updated.columns)
Index(['Record ID', 'Institute', 'WorkArea', 'CESSATION YEAR',
       'Reason for ceasing employment',
       'Contributing Factors. Career Move - Public Sector ',
       'Contributing Factors. Career Move - Private Sector ',
       'Contributing Factors. Career Move - Self-employment',
       'Contributing Factors. Ill Health',
       'Contributing Factors. Maternity/Family',
       'Contributing Factors. Dissatisfaction',
       'Contributing Factors. Job Dissatisfaction',
       'Contributing Factors. Interpersonal Conflict',
       'Contributing Factors. Study', 'Contributing Factors. Travel',
       'Contributing Factors. Other', 'Contributing Factors. NONE',
       'Gender. What is your Gender?', 'CurrentAge. Current Age',
       'Employment Type. Employment Type', 'Classification. Classification',
       'LengthofServiceOverall. Overall Length of Service at Institute (in years)',
       'LengthofServiceCurrent. Length of Service at current workplace (in years)'],
      dtype='object')
In [184]:
# to update the columns name in tafe_survey_updated, according to dete_survey_updated columns names.
tafe_survey_updated = (tafe_survey_updated.rename(mapper = {'Record ID': 'id', 
                                                            'CESSATION YEAR': 'cease_date', 
                                                            'Reason for ceasing employment': 'separationtype', 
                                                            'Gender. What is your Gender?': 'gender', 
                                                            'CurrentAge. Current Age': 'age', 
                                                            'Employment Type. Employment Type': 'employment_status', 
                                                            'Classification. Classification': 'position', 
                                                            'LengthofServiceOverall. Overall Length of Service at Institute (in years)': 'institute_service', 
                                                            'LengthofServiceCurrent. Length of Service at current workplace (in years)': 'role_service'},
                                                  axis = 1))

print(tafe_survey_updated.columns)
Index(['id', 'Institute', 'WorkArea', 'cease_date', 'separationtype',
       'Contributing Factors. Career Move - Public Sector ',
       'Contributing Factors. Career Move - Private Sector ',
       'Contributing Factors. Career Move - Self-employment',
       'Contributing Factors. Ill Health',
       'Contributing Factors. Maternity/Family',
       'Contributing Factors. Dissatisfaction',
       'Contributing Factors. Job Dissatisfaction',
       'Contributing Factors. Interpersonal Conflict',
       'Contributing Factors. Study', 'Contributing Factors. Travel',
       'Contributing Factors. Other', 'Contributing Factors. NONE', 'gender',
       'age', 'employment_status', 'position', 'institute_service',
       'role_service'],
      dtype='object')

B.3. Subseting our Dataframe

Some of our end goals is to answer the following questions:

  • Are employees who have only worked for the institutes for a short period of time resigning due to some kind of dissatisfaction?
  • What about employees who have been at the job longer?

For this project, we'll only analyze survey respondents who resigned, so their separation type column, contains the string 'Resignation'.

In [185]:
dete_survey_updated["separationtype"].value_counts(dropna = False)
Out[185]:
Age Retirement                          285
Resignation-Other reasons               150
Resignation-Other employer               91
Resignation-Move overseas/interstate     70
Voluntary Early Retirement (VER)         67
Ill Health Retirement                    61
Other                                    49
Contract Expired                         34
Termination                              15
Name: separationtype, dtype: int64
In [186]:
tafe_survey_updated["separationtype"].value_counts(dropna = False)
Out[186]:
Resignation                 340
Contract Expired            127
Retrenchment/ Redundancy    104
Retirement                   82
Transfer                     25
Termination                  23
NaN                           1
Name: separationtype, dtype: int64
In [187]:
# to make a boolean index from dete_separation column with all 3 types of resignations
dete_separationtype_resign = ((dete_survey_updated["separationtype"] == "Resignation-Other reasons") | 
                              (dete_survey_updated["separationtype"] == "Resignation-Other employer") |
                              (dete_survey_updated["separationtype"] == "Resignation-Move overseas/interstate")
                             )

# make a subset of a copy of dete_survey_updated (for avoiding SettingWithCopy Warning later)
dete_resignations = dete_survey_updated.copy()[dete_separationtype_resign]

dete_resignations["separationtype"].value_counts(dropna = False)
Out[187]:
Resignation-Other reasons               150
Resignation-Other employer               91
Resignation-Move overseas/interstate     70
Name: separationtype, dtype: int64
In [188]:
# the same operation with tafe_survey_updated dataframe
tafe_resignations = tafe_survey_updated.copy()[tafe_survey_updated["separationtype"] == "Resignation"]

tafe_resignations["separationtype"].value_counts(dropna = False)
Out[188]:
Resignation    340
Name: separationtype, dtype: int64

B.4. Checking for inconsistencies in our data

Now, before we start cleaning and manipulating the rest of our data, let's verify that the data doesn't contain any major inconsistencies.

In [189]:
dete_resignations["cease_date"].value_counts(dropna = False)
Out[189]:
2012       126
2013        74
01/2014     22
12/2013     17
06/2013     14
NaN         11
09/2013     11
11/2013      9
07/2013      9
10/2013      6
08/2013      4
05/2012      2
05/2013      2
07/2012      1
2010         1
09/2010      1
07/2006      1
Name: cease_date, dtype: int64

We need to clean this data from this column (cease_date): extract the year and convert its type to float.

In [190]:
pattern = r"(2[0-9]{3})" # regular expresion for extracting years from 2000 year. 

dete_resignations["cease_date"] = dete_resignations["cease_date"].str.extract(pattern).astype(float)

dete_resignations["cease_date"].value_counts(dropna = False) # to check changes
/dataquest/system/env/python3/lib/python3.4/site-packages/ipykernel/__main__.py:3: FutureWarning:

currently extract(expand=None) means expand=False (return Index/Series/DataFrame) but in a future version of pandas this will be changed to expand=True (return DataFrame)

Out[190]:
 2013.0    146
 2012.0    129
 2014.0     22
NaN         11
 2010.0      2
 2006.0      1
Name: cease_date, dtype: int64
In [191]:
dete_resignations['dete_start_date'].value_counts(dropna = False).sort_index(ascending = True) # sort by year (index), acending
Out[191]:
 1963.0     1
 1971.0     1
 1972.0     1
 1973.0     1
 1974.0     2
 1975.0     1
 1976.0     2
 1977.0     1
 1980.0     5
 1982.0     1
 1983.0     2
 1984.0     1
 1985.0     3
 1986.0     3
 1987.0     1
 1988.0     4
 1989.0     4
 1990.0     5
 1991.0     4
 1992.0     6
 1993.0     5
 1994.0     6
 1995.0     4
 1996.0     6
 1997.0     5
 1998.0     6
 1999.0     8
 2000.0     9
 2001.0     3
 2002.0     6
 2003.0     6
 2004.0    14
 2005.0    15
 2006.0    13
 2007.0    21
 2008.0    22
 2009.0    13
 2010.0    17
 2011.0    24
 2012.0    21
 2013.0    10
NaN        28
Name: dete_start_date, dtype: int64
In [192]:
tafe_resignations['cease_date'].value_counts(dropna = False)
Out[192]:
 2011.0    116
 2012.0     94
 2010.0     68
 2013.0     55
NaN          5
 2009.0      2
Name: cease_date, dtype: int64

Values at these two column (dete_start_date and cease_date in tafe) are ok. There is no need to clean it.

Cheching all this columns we can see that there are no cease dates in dete before 2006, and it's odd because there are many star dates since 1963. Surely have had ceases during all this time that haven't been appointed until 2006 in this database.

B.5. Transforming values

We can calculate years of service in dete survey by subtracting the values in dete_start_date from the cease_date. We'll assign the result to a new column in dete named institute_service.

In [193]:
dete_resignations["institute_service"] = (dete_resignations["cease_date"] - 
                                         dete_resignations['dete_start_date'])

dete_resignations["institute_service"].describe() # to view its basics statistic values
Out[193]:
count    273.000000
mean      10.457875
std        9.931709
min        0.000000
25%        3.000000
50%        7.000000
75%       16.000000
max       49.000000
Name: institute_service, dtype: float64

Next we must choose what factors or columns appointed in these surveys are the adecuated for considerign that an employee quit because thet were dissatisfied.

  • From dete survey we'll choose the following columns:'interpersonal_conflicts','job_dissatisfaction', 'dissatisfaction_with_the_department','physical_work_environment', 'lack_of_recognition', 'lack_of_job_security', 'work_location', 'employment_conditions', 'work_life_balance' and 'workload'.
  • And from tafe survey:'Contributing Factors. Dissatisfaction', 'Contributing Factors. Job Dissatisfaction', 'Contributing Factors. Interpersonal Conflict'.

If the employee indicated any of the factors above caused them to resign, we'll mark them as dissatisfied in a new column.

In [194]:
# as example for a view of the original values of these columns in dete survey
dete_resignations['interpersonal_conflicts'].value_counts(dropna = False)
Out[194]:
False    291
True      20
Name: interpersonal_conflicts, dtype: int64
In [195]:
# as example for a view of the original values of these columns in tefe survey
tafe_resignations['Contributing Factors. Dissatisfaction'].value_counts(dropna = False)
Out[195]:
-                                         277
Contributing Factors. Dissatisfaction      55
NaN                                         8
Name: Contributing Factors. Dissatisfaction, dtype: int64

At this point we'll create a function to transform the data in selected tafe columns from their actual values, to boolean values.

In [196]:
tafe_dissat_columns = (['Contributing Factors. Dissatisfaction',
                       'Contributing Factors. Job Dissatisfaction', 
                        'Contributing Factors. Interpersonal Conflict'])
def to_boolean(element):
    
    if element == "-":
        return False
    
    elif pd.isnull(element): #pandas function
        return np.nan #numpy function
    
    else:
        return True

tafe_resignations[tafe_dissat_columns] = tafe_resignations[tafe_dissat_columns].applymap(to_boolean)
    
# to view the results use df.apply() method 
tafe_resignations[tafe_dissat_columns].apply(pd.value_counts, dropna = False)
Out[196]:
Contributing Factors. Dissatisfaction Contributing Factors. Job Dissatisfaction Contributing Factors. Interpersonal Conflict
False 277 270 308
True 55 62 24
NaN 8 8 8
In [197]:
# selected columns at dete survey:
dete_dissat_columns = (['interpersonal_conflicts','job_dissatisfaction', 
                       'dissatisfaction_with_the_department','physical_work_environment', 
                       'lack_of_recognition', 'lack_of_job_security', 'work_location', 
                       'employment_conditions', 'work_life_balance', 'workload'])

# to check values of all selected columns :
dete_resignations[dete_dissat_columns].apply(pd.value_counts, dropna = False)
Out[197]:
interpersonal_conflicts job_dissatisfaction dissatisfaction_with_the_department physical_work_environment lack_of_recognition lack_of_job_security work_location employment_conditions work_life_balance workload
False 291 270 282 305 278 297 293 288 243 284
True 20 41 29 6 33 14 18 23 68 27

Following we'll create a dissatisfied column in BOTH the tafe_resignations and dete_resignations dataframes.

If any of the columns of its respective dataframes listed above contain a True value, we'll add a True value to a new column named dissatisfied. To accomplish this, we'll use the DataFrame.any() method to do the following:

  • Return True if any element in the selected columns above is True
  • Return False if none of the elements in the selected columns above is True
  • Return NaN if the value is NaN

We also will use the df.copy() method to create a copy of the results and avoid the SettingWithCopy Warning. Assign the results to dete_resignations_up and tafe_resignations_up.

In [198]:
#assign True to dissatisfied columns if any element in the selected columns above is True, and NaN if the value is NaN
dete_resignations["dissatisfied"] = dete_resignations[dete_dissat_columns].any(axis = 1, skipna = False)

#make a copy:
dete_resignations_up = dete_resignations.copy()

#to check the results
dete_resignations_up["dissatisfied"].value_counts(dropna = False)
Out[198]:
False    157
True     154
Name: dissatisfied, dtype: int64
In [199]:
# the same for tafe columns
tafe_resignations["dissatisfied"] = tafe_resignations[tafe_dissat_columns].any(axis = 1, skipna = False)

tafe_resignations_up = tafe_resignations.copy()

tafe_resignations_up["dissatisfied"].value_counts(dropna = False)
Out[199]:
False    235
True      97
NaN        8
Name: dissatisfied, dtype: int64

C. COMBINING OUR DATAFRAMES

Now, we're ready to combine our datasets! Our end goal is to aggregate the data according to the institute_service column, so when you combine the data, think about how to get the data into a form that's easy to aggregate.

  1. First, let's add a column to each dataframe that will allow us to easily distinguish between the two:
  • Add a column named institute to dete_resignations_up. Each row should contain the value DETE.

  • Add a column named institute to tafe_resignations_up. Each row should contain the value TAFE.

  1. Combine the dataframes. Assign the result to combined.
  1. We still have some columns left in the dataframe that we don't need to complete our analysis. We'll use the DataFrame.dropna() method to drop any columns with less than 500 non null values, and will assign the result to combined_updated.
In [200]:
dete_resignations_up["institute"] = "DETE"

# to check the results and take a view of dete_resignations_up dataframe
dete_resignations_up.head(2)
Out[200]:
id separationtype cease_date dete_start_date role_start_date position classification region business_unit employment_status ... gender age aboriginal torres_strait south_sea disability nesb institute_service dissatisfied institute
3 4 Resignation-Other reasons 2012.0 2005.0 2006.0 Teacher Primary Central Queensland NaN Permanent Full-time ... Female 36-40 NaN NaN NaN NaN NaN 7.0 False DETE
5 6 Resignation-Other reasons 2012.0 1994.0 1997.0 Guidance Officer NaN Central Office Education Queensland Permanent Full-time ... Female 41-45 NaN NaN NaN NaN NaN 18.0 True DETE

2 rows × 38 columns

In [201]:
tafe_resignations_up["institute"] = "DETE"

# to check the results and take a view of dete_resignations_up dataframe
tafe_resignations_up.head(2)
Out[201]:
id Institute WorkArea cease_date separationtype Contributing Factors. Career Move - Public Sector Contributing Factors. Career Move - Private Sector Contributing Factors. Career Move - Self-employment Contributing Factors. Ill Health Contributing Factors. Maternity/Family ... Contributing Factors. Other Contributing Factors. NONE gender age employment_status position institute_service role_service dissatisfied institute
3 6.341399e+17 Mount Isa Institute of TAFE Non-Delivery (corporate) 2010.0 Resignation - - - - - ... - - NaN NaN NaN NaN NaN NaN False DETE
4 6.341466e+17 Southern Queensland Institute of TAFE Delivery (teaching) 2010.0 Resignation - Career Move - Private Sector - - - ... - - Male 41 45 Permanent Full-time Teacher (including LVT) 3-4 3-4 False DETE

2 rows × 25 columns

In [202]:
# concatenate dataframes horizontally
combined = pd.concat([dete_resignations_up, tafe_resignations_up])

combined.shape
Out[202]:
(651, 53)
In [203]:
# Calculate the number of missing values in each column
combined.isnull().sum().sort_values(ascending = False)
Out[203]:
torres_strait                                          651
south_sea                                              648
aboriginal                                             644
disability                                             643
nesb                                                   642
business_unit                                          619
classification                                         490
region                                                 386
role_start_date                                        380
dete_start_date                                        368
role_service                                           361
dissatisfaction_with_the_department                    340
work_location                                          340
employment_conditions                                  340
workload                                               340
job_dissatisfaction                                    340
career_move_to_public_sector                           340
career_move_to_private_sector                          340
ill_health                                             340
interpersonal_conflicts                                340
physical_work_environment                              340
relocation                                             340
work_life_balance                                      340
lack_of_job_security                                   340
traumatic_incident                                     340
lack_of_recognition                                    340
maternity/family                                       340
study/travel                                           340
none_of_the_above                                      340
Contributing Factors. Other                            319
Contributing Factors. Career Move - Public Sector      319
Contributing Factors. NONE                             319
Contributing Factors. Travel                           319
Contributing Factors. Career Move - Self-employment    319
Contributing Factors. Dissatisfaction                  319
Contributing Factors. Maternity/Family                 319
Contributing Factors. Job Dissatisfaction              319
Contributing Factors. Interpersonal Conflict           319
Contributing Factors. Ill Health                       319
Contributing Factors. Study                            319
Contributing Factors. Career Move - Private Sector     319
Institute                                              311
WorkArea                                               311
institute_service                                       88
gender                                                  59
age                                                     55
employment_status                                       54
position                                                53
cease_date                                              16
dissatisfied                                             8
id                                                       0
institute                                                0
separationtype                                           0
dtype: int64

The columns of our interest: dissatisfied, age, institute_service, are lower in Nan values

We are going to drop any columns with less than 500 non null values, and will assign the result to combined_updated.

In [204]:
combined_updated = combined.copy().dropna(thresh = 500, axis=1) # create a copy for avoid SettingWithCopyWarning later

# to check results
combined_updated.isnull().sum().sort_values(ascending = False)
Out[204]:
institute_service    88
gender               59
age                  55
employment_status    54
position             53
cease_date           16
dissatisfied          8
separationtype        0
institute             0
id                    0
dtype: int64

C.1. Clean up an transform the "institute_service" column

This column is tricky to clean because it currently contains values in different forms.

In [205]:
combined_updated["institute_service"].value_counts(dropna = False)
Out[205]:
NaN                   88
Less than 1 year      73
1-2                   64
3-4                   63
5-6                   33
11-20                 26
5.0                   23
1.0                   22
7-10                  21
0.0                   20
3.0                   20
6.0                   17
4.0                   16
2.0                   14
9.0                   14
7.0                   13
More than 20 years    10
13.0                   8
8.0                    8
15.0                   7
20.0                   7
14.0                   6
12.0                   6
17.0                   6
22.0                   6
10.0                   6
16.0                   5
18.0                   5
11.0                   4
23.0                   4
24.0                   4
19.0                   3
32.0                   3
39.0                   3
21.0                   3
28.0                   2
30.0                   2
26.0                   2
36.0                   2
25.0                   2
29.0                   1
31.0                   1
27.0                   1
34.0                   1
35.0                   1
38.0                   1
41.0                   1
42.0                   1
49.0                   1
33.0                   1
Name: institute_service, dtype: int64

We will tranform this column:

  • Drop its rows with Nan values

  • Convert their values into categories

In [206]:
# get the index of rows with nan values on column "institute_service"
index_rows = combined_updated[combined_updated["institute_service"].isnull()].index

# drop rows with null values. Create a new dataframe for not lossing these values in combined_updated
combined_service_cat = combined_updated.copy().drop(labels = index_rows)
In [207]:
combined_service_cat["institute_service"].value_counts(dropna = False)
Out[207]:
Less than 1 year      67
3-4                   63
1-2                   61
5-6                   32
11-20                 25
1.0                   21
7-10                  20
3.0                   19
5.0                   19
0.0                   18
6.0                   16
4.0                   15
9.0                   14
2.0                   12
7.0                   12
More than 20 years    10
13.0                   8
8.0                    7
15.0                   7
22.0                   6
17.0                   6
12.0                   6
20.0                   6
14.0                   5
16.0                   5
10.0                   4
18.0                   4
23.0                   4
24.0                   4
21.0                   3
11.0                   3
39.0                   3
19.0                   2
25.0                   2
26.0                   2
28.0                   2
36.0                   2
32.0                   2
33.0                   1
35.0                   1
34.0                   1
49.0                   1
29.0                   1
42.0                   1
31.0                   1
27.0                   1
41.0                   1
38.0                   1
Name: institute_service, dtype: int64
In [208]:
# Change all values to string and remove ".0" character. Note that dot needs go between []
combined_service_cat["institute_service"] = (combined_service_cat["institute_service"].astype(str).
                                         str.replace(r"([.]0)", ""))

combined_service_cat["institute_service"].value_counts(dropna = False).sort_index(ascending = False) #sorted for a better view
Out[208]:
More than 20 years    10
Less than 1 year      67
9                     14
8                      7
7-10                  20
7                     12
6                     16
5-6                   32
5                     19
49                     1
42                     1
41                     1
4                     15
39                     3
38                     1
36                     2
35                     1
34                     1
33                     1
32                     2
31                     1
3-4                   63
3                     19
29                     1
28                     2
27                     1
26                     2
25                     2
24                     4
23                     4
22                     6
21                     3
20                     6
2                     12
19                     2
18                     4
17                     6
16                     5
15                     7
14                     5
13                     8
12                     6
11-20                 25
11                     3
10                     4
1-2                   61
1                     21
0                     18
Name: institute_service, dtype: int64

In the following code we'll extract all years of "institute_service" column in combined_service_cat dataframe, and will store them in a new dataframe called "years".

Lately we'll calculate years["service_cat"] column as the arithmetic mean of all years stracted.

Finally we'll map this columnd and will transform numbers into cathegories.

In [209]:
# regular expresion for extracting all the years of "institute_service" column

pattern = r"(?P<First_Year>[0-9][0-9]?)-?(?P<Second_Year>[1-9][0-9]?)?" # question mark, ?, after a group 
                                                            # to indicate that a match for those groups 
                                                            # is optional.Only extracting characters
                                                            # between brakets ()

# dataframe which store all years extracted
years = combined_service_cat.copy()["institute_service"].str.extractall(pattern) # make a copy of combined_updated

# a view of all values stored in columns of years
years.apply(pd.value_counts, dropna = False)
Out[209]:
First_Year Second_Year
0 18.0 NaN
1 149.0 NaN
10 4.0 20.0
11 28.0 NaN
12 6.0 NaN
13 8.0 NaN
14 5.0 NaN
15 7.0 NaN
16 5.0 NaN
17 6.0 NaN
18 4.0 NaN
19 2.0 NaN
2 12.0 61.0
20 16.0 25.0
21 3.0 NaN
22 6.0 NaN
23 4.0 NaN
24 4.0 NaN
25 2.0 NaN
26 2.0 NaN
27 1.0 NaN
28 2.0 NaN
29 1.0 NaN
3 82.0 NaN
31 1.0 NaN
32 2.0 NaN
33 1.0 NaN
34 1.0 NaN
35 1.0 NaN
36 2.0 NaN
38 1.0 NaN
39 3.0 NaN
4 15.0 63.0
41 1.0 NaN
42 1.0 NaN
49 1.0 NaN
5 51.0 NaN
6 16.0 32.0
7 32.0 NaN
8 7.0 NaN
9 14.0 NaN
NaN NaN 326.0
In [210]:
years.head()
Out[210]:
First_Year Second_Year
match
5 0 18 NaN
8 0 3 NaN
9 0 15 NaN
11 0 3 NaN
12 0 14 NaN
In [211]:
# because years has a MultiIndex, reset it, remove the index level "match" and convert it in a column.
years = years.reset_index(level=["match"])
In [212]:
years.head()
Out[212]:
match First_Year Second_Year
5 0 18 NaN
8 0 3 NaN
9 0 15 NaN
11 0 3 NaN
12 0 14 NaN
In [213]:
# drop column match, because is useless
years = years.drop("match", axis = 1)
In [214]:
# change all values of its column to float
years = years.astype(float)
In [215]:
# transfor Nan values in Second_Year to 0
years["Second_Year"] = years["Second_Year"].fillna(0)
In [216]:
# create a new column in year based in First_Year values
years["service_cat"] = years["First_Year"]

# for rows which have values in its Second_Year. The value of "service_cat" It's the arithmetic mean
years.loc[years["Second_Year"] > 0,"service_cat"] = ((years.loc[years["Second_Year"] > 0, "First_Year"]
                                                     + years.loc[years["Second_Year"] > 0, "Second_Year"])
                                                     / 2)

years.head()
Out[216]:
First_Year Second_Year service_cat
5 18.0 0.0 18.0
8 3.0 0.0 3.0
9 15.0 0.0 15.0
11 3.0 0.0 3.0
12 14.0 0.0 14.0
In [217]:
# there is no rows with Nan values, we drop them before
years["service_cat"].isnull().sum()
Out[217]:
0

To analyze the data, we'll convert numbers into categories.

We'll use the slightly modified definitions below:

  • New: Less than 3 years at a company
  • Experienced: 3-6 years at a company
  • Established: 7-10 years at a company
  • Veteran: 11 or more years at a company
In [218]:
# create a function to transform the values year's service_cat column into categories
def to_categories(val):
    
    if val < 3:
        return "New"
    elif val >=3 and val <=6:
        return "Experienced"
    elif val >6 and val <=10: # minimun lower than definition to evoid gaps
            return "Established"
    else:
        return "Veteran"


years["service_cat"] = years["service_cat"].map(to_categories)

years["service_cat"].value_counts()
Out[218]:
New            179
Experienced    164
Veteran        127
Established     57
Name: service_cat, dtype: int64
In [219]:
# drop this two columns of years that we don't need more
years = years.drop(["First_Year", "Second_Year"], axis = 1)

C.2. Joining "service_cat" column to combined_service_cat dataframe

Following we join the column "service_cat" of years dataframe to combined_service_cat dataframe

Firsly we'll check if they both continue having the same index after all the former changes...

In [220]:
# to check if this two dataframes have the same number of rows to combine
print(combined_service_cat.shape)
print(years.shape)
(527, 10)
(527, 1)
In [221]:
combined_service_cat.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 527 entries, 5 to 701
Data columns (total 10 columns):
age                  525 non-null object
cease_date           525 non-null float64
dissatisfied         527 non-null object
employment_status    527 non-null object
gender               522 non-null object
id                   527 non-null float64
institute            527 non-null object
institute_service    527 non-null object
position             524 non-null object
separationtype       527 non-null object
dtypes: float64(2), object(8)
memory usage: 45.3+ KB
In [222]:
# to check if these dataframe have the same index
years.index == combined_service_cat.index
Out[222]:
array([ True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True])
In [223]:
merged = pd.merge(left = combined_service_cat, right = years, left_index = True, right_index = True)
In [224]:
merged.shape
Out[224]:
(727, 11)
In [225]:
merged.head()
Out[225]:
age cease_date dissatisfied employment_status gender id institute institute_service position separationtype service_cat
4 41 45 2010.0 False Permanent Full-time Male 6.341466e+17 DETE 3-4 Teacher (including LVT) Resignation Experienced
5 41-45 2012.0 True Permanent Full-time Female 6.000000e+00 DETE 18 Guidance Officer Resignation-Other reasons Veteran
5 41-45 2012.0 True Permanent Full-time Female 6.000000e+00 DETE 18 Guidance Officer Resignation-Other reasons Established
5 56 or older 2010.0 False Contract/casual Female 6.341475e+17 DETE 7-10 Teacher (including LVT) Resignation Veteran
5 56 or older 2010.0 False Contract/casual Female 6.341475e+17 DETE 7-10 Teacher (including LVT) Resignation Established

why this code above doesn't work??

It is supose it merges 2 dataframes by its index. That is the same index for both. But instead of maintaining the number of rows, create rows duplicated with the same index number.

On the contrary the code bellow seems to work fine...

In [226]:
combined_service_cat = pd.concat([combined_service_cat, years], axis = 1)

combined_service_cat.head()
Out[226]:
age cease_date dissatisfied employment_status gender id institute institute_service position separationtype service_cat
5 41-45 2012.0 True Permanent Full-time Female 6.0 DETE 18 Guidance Officer Resignation-Other reasons Veteran
8 31-35 2012.0 False Permanent Full-time Female 9.0 DETE 3 Teacher Resignation-Other reasons Experienced
9 46-50 2012.0 True Permanent Part-time Female 10.0 DETE 15 Teacher Aide Resignation-Other employer Veteran
11 31-35 2012.0 False Permanent Full-time Male 12.0 DETE 3 Teacher Resignation-Move overseas/interstate Experienced
12 36-40 2012.0 False Permanent Full-time Female 13.0 DETE 14 Teacher Resignation-Other reasons Veteran
In [227]:
combined_service_cat.shape
Out[227]:
(527, 11)

We get a frame with the same rows of combined_updated dataframe plus the service_cat column

C.3. Clean up an transform the "age" column

In [228]:
# a preview of values in this column
combined_updated["age"].value_counts(dropna = False)
Out[228]:
51-55            71
NaN              55
41-45            48
41  45           45
46-50            42
36-40            41
46  50           39
26-30            35
21  25           33
31  35           32
26  30           32
36  40           32
21-25            29
31-35            29
56 or older      29
56-60            26
61 or older      23
20 or younger    10
Name: age, dtype: int64

As in the case of "institute_service" column, this column is tricky to clean because it currently contains values in different forms. Because values are similar, we can use the same method that before.

In [229]:
# get the index of rows with nan values on column "institute_service"
index_rows = combined_updated[combined_updated["age"].isnull()].index

# drop rows with null values. Create a new dataframe for not lossing these values in combined_updated
combined_age= combined_updated.copy().drop(labels = index_rows)
In [230]:
combined_age["age"].value_counts(dropna = False)
Out[230]:
51-55            65
41  45           45
41-45            44
46-50            41
36-40            39
46  50           39
21  25           33
26-30            33
36  40           32
31  35           32
26  30           32
56 or older      29
31-35            26
21-25            24
56-60            24
61 or older      20
20 or younger    10
Name: age, dtype: int64
In [231]:
# Change to string and spaces to "-"
combined_age["age"] = (combined_age["age"].astype(str).str.replace(" ", "-"))

combined_age["age"].value_counts(dropna = False)
Out[231]:
51-55            65
41--45           45
41-45            44
46-50            41
46--50           39
36-40            39
26-30            33
21--25           33
36--40           32
26--30           32
31--35           32
56-or-older      29
31-35            26
56-60            24
21-25            24
61-or-older      20
20-or-younger    10
Name: age, dtype: int64
In [232]:
# regular expresion for extracting all the ages of "age" column

pattern = r"(?P<First_Age>[0-9][0-9]?)-?--?(?P<Second_Age>[1-9][0-9]?)?" # question mark, ?, after a group 
                                                            # to indicate that a match for those groups 
                                                            # is optional.Only extracting characters
                                                            # between brakets ()

# dataframe which store all years extracted
ages = combined_age.copy()["age"].str.extractall(pattern) # make a copy of combined_age

# a view of all values stored in columns of ages
ages.apply(pd.value_counts, dropna = False)
Out[232]:
First_Age Second_Age
20 10.0 NaN
21 57.0 NaN
26 65.0 NaN
30 NaN 65.0
31 58.0 NaN
36 71.0 NaN
40 NaN 71.0
41 89.0 NaN
45 NaN 89.0
46 80.0 NaN
50 NaN 80.0
51 65.0 NaN
55 NaN 65.0
56 53.0 NaN
61 20.0 NaN
NaN NaN 59.0
35 NaN 58.0
25 NaN 57.0
60 NaN 24.0

It seems, looking at this table above, that all ages have been stracted.

In [233]:
ages.head()
Out[233]:
First_Age Second_Age
match
5 0 41 45
8 0 31 35
9 0 46 50
11 0 31 35
12 0 36 40
In [234]:
# because years has a MultiIndex, reset it, remove the index level "match" and convert it in a column.
ages = ages.reset_index(level=["match"])
In [235]:
ages.head()
Out[235]:
match First_Age Second_Age
5 0 41 45
8 0 31 35
9 0 46 50
11 0 31 35
12 0 36 40
In [236]:
# drop column match, because is useless
ages = ages.drop("match", axis = 1)
In [237]:
# change all values of its column to float
ages = ages.astype(float)
In [238]:
# transfor Nan values in Second_Age to 0
ages["Second_Age"] = ages["Second_Age"].fillna(0)
In [239]:
# create a new column in year based in First_Age values
ages["ages_mean"] = ages["First_Age"]

# for rows which have values in its Second_Age. The value of "ages_mean" It's the arithmetic mean
ages.loc[ages["Second_Age"] > 0,"ages_mean"] = ((ages.loc[ages["Second_Age"] > 0, "First_Age"]
                                                     + ages.loc[ages["Second_Age"] > 0, "Second_Age"])
                                                     / 2)

ages.head()
Out[239]:
First_Age Second_Age ages_mean
5 41.0 45.0 43.0
8 31.0 35.0 33.0
9 46.0 50.0 48.0
11 31.0 35.0 33.0
12 36.0 40.0 38.0
In [240]:
# drop this two columns of ages that we don't need more
ages = ages.drop(["First_Age", "Second_Age"], axis = 1)
In [241]:
# there is no rows with Nan values
ages["ages_mean"].isnull().sum()
Out[241]:
0
In [242]:
# a view of the distribution of ages' values
ages["ages_mean"].describe()
Out[242]:
count    568.000000
mean      41.005282
std       11.370852
min       20.000000
25%       33.000000
50%       43.000000
75%       48.000000
max       61.000000
Name: ages_mean, dtype: float64

According to this distribution, as in the case of years of service, we'll convert numbers in this column into categories.

We'll use the three categories below:

  • Young: Less than 33 years
  • Middle age: 33-48 years
  • Senior: More than 48 years
In [243]:
# create a function to transform the values year's service_cat column into categories
def to_categories(val):
    
    if val < 33:
        return "Young"
    elif val >=33 and val <=48:
        return "Middle age"
    else:
        return "Senior"

ages["ages_mean"] = ages["ages_mean"].map(to_categories)

ages["ages_mean"].value_counts()
Out[243]:
Middle age    298
Senior        138
Young         132
Name: ages_mean, dtype: int64

C.4. Joining "ages_mean" column to combined_age dataframe

We'll use here the same methon we use before to join "service_cat" column to combined_service_cat dataframe.

In [244]:
combined_age = pd.concat([combined_age, ages], axis = 1)

combined_age.tail()
Out[244]:
age cease_date dissatisfied employment_status gender id institute institute_service position separationtype ages_mean
693 26--30 2013.0 False Temporary Full-time Female 6.350599e+17 DETE 1-2 Administration (AO) Resignation Young
696 21--25 2013.0 False Temporary Full-time Male 6.350660e+17 DETE 5-6 Operational (OO) Resignation Young
697 51-55 2013.0 False Temporary Full-time Male 6.350668e+17 DETE 1-2 Teacher (including LVT) Resignation Senior
699 51-55 2013.0 False Permanent Full-time Female 6.350704e+17 DETE 5-6 Teacher (including LVT) Resignation Senior
701 26--30 2013.0 False Contract/casual Female 6.350730e+17 DETE 3-4 Administration (AO) Resignation Young

We can look at here that ages_mean occupies the last column, and their join with combine_age dataframe seems fine.

D. VISUALIZING CLEANED DATA

In [245]:
# for this last task we'll import matplop library
import matplotlib.pyplot as plt

%matplotlib inline

D.1. Visualizing data of combined_service_cat dataframe

Firstly we'll check the data in the columns of our interest.

In [246]:
combined_service_cat["service_cat"].value_counts(dropna = False)
Out[246]:
New            179
Experienced    164
Veteran        127
Established     57
Name: service_cat, dtype: int64
In [247]:
combined_service_cat["dissatisfied"].value_counts(dropna = False)
Out[247]:
False    316
True     211
Name: dissatisfied, dtype: int64

There are no missing values in these columns above

In [248]:
# this is needed because if not code bellow doesn't work
combined_service_cat["dissatisfied"] = combined_service_cat["dissatisfied"].astype(bool)
In [249]:
combined_service_pv = combined_service_cat.pivot_table(index = "service_cat", values = "dissatisfied")
In [250]:
# draw a bar plot
combined_service_pv.plot(kind = "bar", title = "Percentaje of dissatisfied by service categories")

plt.show()

GRAPH 1. Percentajes of dissatisfied grouping by service_cat (years of service) categories

D.2. Visualizing data of combined_age dataframe

Firstly we'll check the data in the columns of our interest.

In [251]:
combined_age["ages_mean"].value_counts(dropna = False)
Out[251]:
Middle age    298
Senior        138
Young         132
Name: ages_mean, dtype: int64
In [252]:
combined_age["dissatisfied"].value_counts(dropna = False)
Out[252]:
False    343
True     225
Name: dissatisfied, dtype: int64

There are no missing values in these columns above

In [254]:
# this is needed because if not code bellow doesn't work
combined_age["dissatisfied"] = combined_age["dissatisfied"].astype(bool)
In [255]:
combined_age_pv = combined_age.pivot_table(index = "ages_mean", values = "dissatisfied")
In [256]:
# draw a bar plot
combined_age_pv.plot(kind = "bar", title = "Percentaje of dissatisfied by age")

plt.show()

E. CONCLUSIONS

According to GRAPH 1 above: only a 30% approximately of "New" employees (Less than 3 years at their workplace) were dissatisfied with their jobs when they exit. The were the lowest dissatisfied of all groups, classified by ages of service, in this two surveys.

On the other hand, employees belonging to "Established" and Veteran cathegories, with more than 7 years and 11 years of servis respectively, were the most dissatisfied. With more than 50% employees dissatisfied in some way, when they resigned.

According to GRAPH 2 above: a 36% approximately of "Young" employees (less than 33 years old) were dissatisfied with their jobs when they exit.

On the other hand, employees classified as "Senior", with more than 48 years old, were the most dissatisfied of all groups of age. With more than 42% dissatisfied in some way, when they resigned.

Nevertheless, we can see there are very little differences between this three groups of age, in relation to their percentaje of dissatisfaction when they resigned.