CLEAN AND ANALYZE EMPLOYEE EXITS SURVEYS¶

We'll work with exit surveys from employees of the Department of Education, Training and Employment (DETE) and the Technical and Further Education (TAFE) institute in Queensland, Australia. You can find the TAFE exit survey here and the survey for the DETE here.

We'll try to answer the following questions:

Are employees who only worked for the institutes for a short period of time resigning due to some kind of dissatisfaction?
What about employees who have been there longer?
Are younger employees resigning due to some kind of dissatisfaction?
What about older employees?

Index of contents:

A. READING AND EXPLORING OUR DATAFRAMES
B. TRANSFORMING OUR DATA
- B.1. Reshape our data
- B.2. Renaming columns
- B.3. Subseting our Dataframe
- B.4. Checking for inconsistencies in our data
- B.5. Transforming values
C. COMBINING OUR DATAFRAMES
- C.1. Clean up an transform the "institute_service" column
- C.2. Joining "service_cat" column to combined_service_cat dataframe
- C.3. Clean up an transform the "age" column
- C.4. Joining "ages_mean" column to combined_age dataframe
D. VISUALIZING CLEANED DATA
- D.1. Visualizing data of combined_service_cat dataframe
E. CONCLUSIONS

Summary of conclusions:

Only a 30% approximately of "New" employees (Less than 3 years at their workplace) were dissatisfied with their jobs when they exit. Employees belonging to "Established" and Veteran categories, with more than 7 years and 11 years of servis respectively, were the most dissatisfied.
However grouping the data by the age of employees, there are very little differences between groups, in relation to their percentaje of dissatisfaction when they resigned.

A. READING AND EXPLORING OUR DATAFRAMES¶

In [166]:

import numpy as np
import pandas as pd

#read both files and assing them to variables. For this proyect both are coded in UTF-8 (the original ones are encoded using cp1252.) 
dete_survey = pd.read_csv("dete_survey.csv")
tafe_survey = pd.read_csv("tafe_survey.csv")

In [167]:

dete_survey.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 822 entries, 0 to 821
Data columns (total 56 columns):
ID                                     822 non-null int64
SeparationType                         822 non-null object
Cease Date                             822 non-null object
DETE Start Date                        822 non-null object
Role Start Date                        822 non-null object
Position                               817 non-null object
Classification                         455 non-null object
Region                                 822 non-null object
Business Unit                          126 non-null object
Employment Status                      817 non-null object
Career move to public sector           822 non-null bool
Career move to private sector          822 non-null bool
Interpersonal conflicts                822 non-null bool
Job dissatisfaction                    822 non-null bool
Dissatisfaction with the department    822 non-null bool
Physical work environment              822 non-null bool
Lack of recognition                    822 non-null bool
Lack of job security                   822 non-null bool
Work location                          822 non-null bool
Employment conditions                  822 non-null bool
Maternity/family                       822 non-null bool
Relocation                             822 non-null bool
Study/Travel                           822 non-null bool
Ill Health                             822 non-null bool
Traumatic incident                     822 non-null bool
Work life balance                      822 non-null bool
Workload                               822 non-null bool
None of the above                      822 non-null bool
Professional Development               808 non-null object
Opportunities for promotion            735 non-null object
Staff morale                           816 non-null object
Workplace issue                        788 non-null object
Physical environment                   817 non-null object
Worklife balance                       815 non-null object
Stress and pressure support            810 non-null object
Performance of supervisor              813 non-null object
Peer support                           812 non-null object
Initiative                             813 non-null object
Skills                                 811 non-null object
Coach                                  767 non-null object
Career Aspirations                     746 non-null object
Feedback                               792 non-null object
Further PD                             768 non-null object
Communication                          814 non-null object
My say                                 812 non-null object
Information                            816 non-null object
Kept informed                          813 non-null object
Wellness programs                      766 non-null object
Health & Safety                        793 non-null object
Gender                                 798 non-null object
Age                                    811 non-null object
Aboriginal                             16 non-null object
Torres Strait                          3 non-null object
South Sea                              7 non-null object
Disability                             23 non-null object
NESB                                   32 non-null object
dtypes: bool(18), int64(1), object(37)
memory usage: 258.6+ KB

In [168]:

dete_survey.head(2)

Out[168]:

	ID	SeparationType	Cease Date	DETE Start Date	Role Start Date	Position	Classification	Region	Business Unit	Employment Status	...	Kept informed	Wellness programs	Health & Safety	Gender	Age	Aboriginal	Torres Strait	South Sea	Disability	NESB
0	1	Ill Health Retirement	08/2012	1984	2004	Public Servant	A01-A04	Central Office	Corporate Strategy and Peformance	Permanent Full-time	...	N	N	N	Male	56-60	NaN	NaN	NaN	NaN	Yes
1	2	Voluntary Early Retirement (VER)	08/2012	Not Stated	Not Stated	Public Servant	AO5-AO7	Central Office	Corporate Strategy and Peformance	Permanent Full-time	...	N	N	N	Male	56-60	NaN	NaN	NaN	NaN	NaN

2 rows × 56 columns

In [169]:

dete_survey.describe(include = "all").iloc[0:5, 0:12] # for a view of the 5 first lines from the 12 first columns

Out[169]:

	ID	SeparationType	Cease Date	DETE Start Date	Role Start Date	Position	Classification	Region	Business Unit	Employment Status	Career move to public sector	Career move to private sector
count	822.000000	822	822	822	822	817	455	822	126	817	822	822
unique	NaN	9	25	51	46	15	8	9	14	5	2	2
top	NaN	Age Retirement	2012	Not Stated	Not Stated	Teacher	Primary	Metropolitan	Education Queensland	Permanent Full-time	False	False
freq	NaN	285	344	73	98	324	161	135	54	434	800	742
mean	411.693431	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

In [170]:

dete_survey.describe(include = "all").iloc[0:5, 10:28] # columns 10-27

Out[170]:

	Career move to public sector	Career move to private sector	Interpersonal conflicts	Job dissatisfaction	Dissatisfaction with the department	Physical work environment	Lack of recognition	Lack of job security	Work location	Employment conditions	Maternity/family	Relocation	Study/Travel	Ill Health	Traumatic incident	Work life balance	Workload	None of the above
count	822	822	822	822	822	822	822	822	822	822	822	822	822	822	822	822	822	822
unique	2	2	2	2	2	2	2	2	2	2	2	2	2	2	2	2	2	2
top	False	False	False	False	False	False	False	False	False	False	False	False	False	False	False	False	False	False
freq	800	742	788	733	761	806	765	794	795	788	760	754	785	710	794	605	735	605
mean	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

In [171]:

dete_survey.describe(include = "all").iloc[0:5, 28:49] 

Out[171]:

	Professional Development	Opportunities for promotion	Staff morale	Workplace issue	Physical environment	Worklife balance	Stress and pressure support	Performance of supervisor	Peer support	Initiative	...	Coach	Career Aspirations	Feedback	Further PD	Communication	My say	Information	Kept informed	Wellness programs	Health & Safety
count	808	735	816	788	817	815	810	813	812	813	...	767	746	792	768	814	812	816	813	766	793
unique	6	6	6	6	6	6	6	6	6	6	...	6	6	6	6	6	6	6	6	6	6
top	A	A	A	A	A	A	A	A	A	A	...	A	A	A	A	A	A	A	A	A	A
freq	413	242	335	357	467	359	342	349	401	396	...	345	246	348	293	399	400	436	401	253	386
mean	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

5 rows × 21 columns

In [172]:

dete_survey["Professional Development"].value_counts(dropna = False)

Out[172]:

A      413
SA     184
N      103
D       60
SD      33
M       15
NaN     14
Name: Professional Development, dtype: int64

In [173]:

dete_survey.describe(include = "all").iloc[0:5, 49:] 

Out[173]:

	Gender	Age	Aboriginal	Torres Strait	South Sea	Disability	NESB
count	798	811	16	3	7	23	32
unique	2	10	1	1	1	1	1
top	Female	61 or older	Yes	Yes	Yes	Yes	Yes
freq	573	222	16	3	7	23	32
mean	NaN	NaN	NaN	NaN	NaN	NaN	NaN

Findins in dete_survey:

Almost all data are strings or boolean, there is only a column with numeric values that is ID column.

There are "Not Stated" values instead of NaN (e.g. in DETE start date column).

The names of columns combines upper and lower character. Should be standarize to upper for better management.

We can clasified four groups of columns for their type of information and values:

columns 0-12: general information related to the position and the cease. Type of values, strings.

columns 12-27: causes of cease. Type of values, boolean.

columns 28-49: evaluation of the Department. Type of values, string ranking from M to A (with 6 categories between these both).

columns 49-56: personal information of the employee (Gender and Age). The last 5 columns are Y or N and have very few values. We should decide if drop them because doesn't apport any information for our aims.

There are many columns we don't need in this proyect. The columns which seem necesary for answering our initial questions are: DETE Start date, Cese dat, Age, and the group of columns related to causes of cease.

In [174]:

tafe_survey.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 702 entries, 0 to 701
Data columns (total 72 columns):
Record ID 702 non-null float64
Institute 702 non-null object
WorkArea 702 non-null object
CESSATION YEAR 695 non-null float64
Reason for ceasing employment 701 non-null object
Contributing Factors. Career Move - Public Sector 437 non-null object
Contributing Factors. Career Move - Private Sector 437 non-null object
Contributing Factors. Career Move - Self-employment 437 non-null object
Contributing Factors. Ill Health 437 non-null object
Contributing Factors. Maternity/Family 437 non-null object
Contributing Factors. Dissatisfaction 437 non-null object
Contributing Factors. Job Dissatisfaction 437 non-null object
Contributing Factors. Interpersonal Conflict 437 non-null object
Contributing Factors. Study 437 non-null object
Contributing Factors. Travel 437 non-null object
Contributing Factors. Other 437 non-null object
Contributing Factors. NONE 437 non-null object
Main Factor. Which of these was the main factor for leaving? 113 non-null object
InstituteViews. Topic:1. I feel the senior leadership had a clear vision and direction 608 non-null object
InstituteViews. Topic:2. I was given access to skills training to help me do my job better 613 non-null object
InstituteViews. Topic:3. I was given adequate opportunities for personal development 610 non-null object
InstituteViews. Topic:4. I was given adequate opportunities for promotion within %Institute]Q25LBL% 608 non-null object
InstituteViews. Topic:5. I felt the salary for the job was right for the responsibilities I had 615 non-null object
InstituteViews. Topic:6. The organisation recognised when staff did good work 607 non-null object
InstituteViews. Topic:7. Management was generally supportive of me 614 non-null object
InstituteViews. Topic:8. Management was generally supportive of my team 608 non-null object
InstituteViews. Topic:9. I was kept informed of the changes in the organisation which would affect me 610 non-null object
InstituteViews. Topic:10. Staff morale was positive within the Institute 602 non-null object
InstituteViews. Topic:11. If I had a workplace issue it was dealt with quickly 601 non-null object
InstituteViews. Topic:12. If I had a workplace issue it was dealt with efficiently 597 non-null object
InstituteViews. Topic:13. If I had a workplace issue it was dealt with discreetly 601 non-null object
WorkUnitViews. Topic:14. I was satisfied with the quality of the management and supervision within my work unit 609 non-null object
WorkUnitViews. Topic:15. I worked well with my colleagues 605 non-null object
WorkUnitViews. Topic:16. My job was challenging and interesting 607 non-null object
WorkUnitViews. Topic:17. I was encouraged to use my initiative in the course of my work 610 non-null object
WorkUnitViews. Topic:18. I had sufficient contact with other people in my job 613 non-null object
WorkUnitViews. Topic:19. I was given adequate support and co-operation by my peers to enable me to do my job 609 non-null object
WorkUnitViews. Topic:20. I was able to use the full range of my skills in my job 609 non-null object
WorkUnitViews. Topic:21. I was able to use the full range of my abilities in my job. ; Category:Level of Agreement; Question:YOUR VIEWS ABOUT YOUR WORK UNIT] 608 non-null object
WorkUnitViews. Topic:22. I was able to use the full range of my knowledge in my job 608 non-null object
WorkUnitViews. Topic:23. My job provided sufficient variety 611 non-null object
WorkUnitViews. Topic:24. I was able to cope with the level of stress and pressure in my job 610 non-null object
WorkUnitViews. Topic:25. My job allowed me to balance the demands of work and family to my satisfaction 611 non-null object
WorkUnitViews. Topic:26. My supervisor gave me adequate personal recognition and feedback on my performance 606 non-null object
WorkUnitViews. Topic:27. My working environment was satisfactory e.g. sufficient space, good lighting, suitable seating and working area 610 non-null object
WorkUnitViews. Topic:28. I was given the opportunity to mentor and coach others in order for me to pass on my skills and knowledge prior to my cessation date 609 non-null object
WorkUnitViews. Topic:29. There was adequate communication between staff in my unit 603 non-null object
WorkUnitViews. Topic:30. Staff morale was positive within my work unit 606 non-null object
Induction. Did you undertake Workplace Induction? 619 non-null object
InductionInfo. Topic:Did you undertake a Corporate Induction? 432 non-null object
InductionInfo. Topic:Did you undertake a Institute Induction? 483 non-null object
InductionInfo. Topic: Did you undertake Team Induction? 440 non-null object
InductionInfo. Face to Face Topic:Did you undertake a Corporate Induction; Category:How it was conducted? 555 non-null object
InductionInfo. On-line Topic:Did you undertake a Corporate Induction; Category:How it was conducted? 555 non-null object
InductionInfo. Induction Manual Topic:Did you undertake a Corporate Induction? 555 non-null object
InductionInfo. Face to Face Topic:Did you undertake a Institute Induction? 530 non-null object
InductionInfo. On-line Topic:Did you undertake a Institute Induction? 555 non-null object
InductionInfo. Induction Manual Topic:Did you undertake a Institute Induction? 553 non-null object
InductionInfo. Face to Face Topic: Did you undertake Team Induction; Category? 555 non-null object
InductionInfo. On-line Topic: Did you undertake Team Induction?process you undertook and how it was conducted.] 555 non-null object
InductionInfo. Induction Manual Topic: Did you undertake Team Induction? 555 non-null object
Workplace. Topic:Did you and your Manager develop a Performance and Professional Development Plan (PPDP)? 608 non-null object
Workplace. Topic:Does your workplace promote a work culture free from all forms of unlawful discrimination? 594 non-null object
Workplace. Topic:Does your workplace promote and practice the principles of employment equity? 587 non-null object
Workplace. Topic:Does your workplace value the diversity of its employees? 586 non-null object
Workplace. Topic:Would you recommend the Institute as an employer to others? 581 non-null object
Gender. What is your Gender? 596 non-null object
CurrentAge. Current Age 596 non-null object
Employment Type. Employment Type 596 non-null object
Classification. Classification 596 non-null object
LengthofServiceOverall. Overall Length of Service at Institute (in years) 596 non-null object
LengthofServiceCurrent. Length of Service at current workplace (in years) 596 non-null object
dtypes: float64(2), object(70)
memory usage: 395.0+ KB

In [175]:

tafe_survey.head(2)

Out[175]:

	Record ID	Institute	WorkArea	CESSATION YEAR	Reason for ceasing employment	Contributing Factors. Career Move - Public Sector	Contributing Factors. Career Move - Private Sector	Contributing Factors. Career Move - Self-employment	Contributing Factors. Ill Health	Contributing Factors. Maternity/Family	...	Workplace. Topic:Does your workplace promote a work culture free from all forms of unlawful discrimination?	Workplace. Topic:Does your workplace promote and practice the principles of employment equity?	Workplace. Topic:Does your workplace value the diversity of its employees?	Workplace. Topic:Would you recommend the Institute as an employer to others?	Gender. What is your Gender?	CurrentAge. Current Age	Employment Type. Employment Type	Classification. Classification	LengthofServiceOverall. Overall Length of Service at Institute (in years)	LengthofServiceCurrent. Length of Service at current workplace (in years)
0	6.341330e+17	Southern Queensland Institute of TAFE	Non-Delivery (corporate)	2010.0	Contract Expired	NaN	NaN	NaN	NaN	NaN	...	Yes	Yes	Yes	Yes	Female	26 30	Temporary Full-time	Administration (AO)	1-2	1-2
1	6.341337e+17	Mount Isa Institute of TAFE	Non-Delivery (corporate)	2010.0	Retirement	-	-	-	-	-	...	Yes	Yes	Yes	Yes	NaN	NaN	NaN	NaN	NaN	NaN

2 rows × 72 columns

In [176]:

tafe_survey.describe(include = "all")[0:5]

Out[176]:

	Record ID	Institute	WorkArea	CESSATION YEAR	Reason for ceasing employment	Contributing Factors. Career Move - Public Sector	Contributing Factors. Career Move - Private Sector	Contributing Factors. Career Move - Self-employment	Contributing Factors. Ill Health	Contributing Factors. Maternity/Family	...	Workplace. Topic:Does your workplace promote a work culture free from all forms of unlawful discrimination?	Workplace. Topic:Does your workplace promote and practice the principles of employment equity?	Workplace. Topic:Does your workplace value the diversity of its employees?	Workplace. Topic:Would you recommend the Institute as an employer to others?	Gender. What is your Gender?	CurrentAge. Current Age	Employment Type. Employment Type	Classification. Classification	LengthofServiceOverall. Overall Length of Service at Institute (in years)	LengthofServiceCurrent. Length of Service at current workplace (in years)
count	7.020000e+02	702	702	695.000000	701	437	437	437	437	437	...	594	587	586	581	596	596	596	596	596	596
unique	NaN	12	2	NaN	6	2	2	2	2	2	...	2	2	2	2	2	9	5	9	7	7
top	NaN	Brisbane North Institute of TAFE	Non-Delivery (corporate)	NaN	Resignation	-	-	-	-	-	...	Yes	Yes	Yes	Yes	Female	56 or older	Permanent Full-time	Administration (AO)	Less than 1 year	Less than 1 year
freq	NaN	161	432	NaN	340	375	336	420	403	411	...	536	512	488	416	389	162	237	293	147	177
mean	6.346026e+17	NaN	NaN	2011.423022	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

5 rows × 72 columns

In [177]:

tafe_survey["Contributing Factors. Career Move - Public Sector "].value_counts(dropna = False)

Out[177]:

-                              375
NaN                            265
Career Move - Public Sector     62
Name: Contributing Factors. Career Move - Public Sector , dtype: int64

Findins in tafe_survey:

The information of this file is much less clear:

The name of columns are too long for a good management of their information. It seems they used just the questions of this survey as columns names.

Almost of types of data are objects except for two columns: Record ID and CESSATION YEAR.

It seems it would be better select only the columns we will need for this proyect: CESSATION YEAR, LengthofServiceOverall, CurrentAge, and the gropu of Contributing factors.

B. TRANSFORMING OUR DATA¶

B.1. Reshape our data¶

We'll transform missing values of date_survey file named as "Not Stated" as Nan values.

Also we'll eliminate columns that we don't need.

In [178]:

dete_survey = pd.read_csv("dete_survey.csv", na_values = "Not Stated")

dete_survey.head(2) # only the 2 first rows

Out[178]:

	ID	SeparationType	Cease Date	DETE Start Date	Role Start Date	Position	Classification	Region	Business Unit	Employment Status	...	Kept informed	Wellness programs	Health & Safety	Gender	Age	Aboriginal	Torres Strait	South Sea	Disability	NESB
0	1	Ill Health Retirement	08/2012	1984.0	2004.0	Public Servant	A01-A04	Central Office	Corporate Strategy and Peformance	Permanent Full-time	...	N	N	N	Male	56-60	NaN	NaN	NaN	NaN	Yes
1	2	Voluntary Early Retirement (VER)	08/2012	NaN	NaN	Public Servant	AO5-AO7	Central Office	Corporate Strategy and Peformance	Permanent Full-time	...	N	N	N	Male	56-60	NaN	NaN	NaN	NaN	NaN

2 rows × 56 columns

In [179]:

dete_survey["DETE Start Date"].value_counts(dropna = False)

Out[179]:

NaN        73
 2011.0    40
 2007.0    34
 2008.0    31
 2010.0    27
 2012.0    27
 2009.0    24
 2006.0    23
 1970.0    21
 1975.0    21
 2013.0    21
 2005.0    20
 1990.0    20
 1999.0    19
 1996.0    19
 1992.0    18
 1991.0    18
 2000.0    18
 2004.0    18
 1989.0    17
 1978.0    15
 2003.0    15
 1988.0    15
 1976.0    15
 2002.0    15
 1974.0    14
 1997.0    14
 1998.0    14
 1979.0    14
 1995.0    14
 1980.0    14
 1993.0    13
 1972.0    12
 1986.0    12
 1977.0    11
 1971.0    10
 1984.0    10
 1994.0    10
 1969.0    10
 2001.0    10
 1983.0     9
 1981.0     9
 1973.0     8
 1985.0     8
 1987.0     7
 1982.0     4
 1963.0     4
 1968.0     3
 1967.0     2
 1965.0     1
 1966.0     1
Name: DETE Start Date, dtype: int64

We don't have "Not Stated" values in this column as before...

In [180]:

dete_survey_updated = dete_survey.drop(axis = 1, labels = dete_survey.columns[28:49]) #use a subset of the list of columns to drop

dete_survey_updated.info() # take a look of the remainings columns

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 822 entries, 0 to 821
Data columns (total 35 columns):
ID                                     822 non-null int64
SeparationType                         822 non-null object
Cease Date                             788 non-null object
DETE Start Date                        749 non-null float64
Role Start Date                        724 non-null float64
Position                               817 non-null object
Classification                         455 non-null object
Region                                 717 non-null object
Business Unit                          126 non-null object
Employment Status                      817 non-null object
Career move to public sector           822 non-null bool
Career move to private sector          822 non-null bool
Interpersonal conflicts                822 non-null bool
Job dissatisfaction                    822 non-null bool
Dissatisfaction with the department    822 non-null bool
Physical work environment              822 non-null bool
Lack of recognition                    822 non-null bool
Lack of job security                   822 non-null bool
Work location                          822 non-null bool
Employment conditions                  822 non-null bool
Maternity/family                       822 non-null bool
Relocation                             822 non-null bool
Study/Travel                           822 non-null bool
Ill Health                             822 non-null bool
Traumatic incident                     822 non-null bool
Work life balance                      822 non-null bool
Workload                               822 non-null bool
None of the above                      822 non-null bool
Gender                                 798 non-null object
Age                                    811 non-null object
Aboriginal                             16 non-null object
Torres Strait                          3 non-null object
South Sea                              7 non-null object
Disability                             23 non-null object
NESB                                   32 non-null object
dtypes: bool(18), float64(2), int64(1), object(14)
memory usage: 123.7+ KB

We have eliminated all columns related to evaluate the department, which we didn't need.

In [181]:

tafe_survey_updated = tafe_survey.drop(columns = tafe_survey.columns[17:66]) #in this case we use columns parameter instead of label + axis

tafe_survey_updated.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 702 entries, 0 to 701
Data columns (total 23 columns):
Record ID                                                                    702 non-null float64
Institute                                                                    702 non-null object
WorkArea                                                                     702 non-null object
CESSATION YEAR                                                               695 non-null float64
Reason for ceasing employment                                                701 non-null object
Contributing Factors. Career Move - Public Sector                            437 non-null object
Contributing Factors. Career Move - Private Sector                           437 non-null object
Contributing Factors. Career Move - Self-employment                          437 non-null object
Contributing Factors. Ill Health                                             437 non-null object
Contributing Factors. Maternity/Family                                       437 non-null object
Contributing Factors. Dissatisfaction                                        437 non-null object
Contributing Factors. Job Dissatisfaction                                    437 non-null object
Contributing Factors. Interpersonal Conflict                                 437 non-null object
Contributing Factors. Study                                                  437 non-null object
Contributing Factors. Travel                                                 437 non-null object
Contributing Factors. Other                                                  437 non-null object
Contributing Factors. NONE                                                   437 non-null object
Gender. What is your Gender?                                                 596 non-null object
CurrentAge. Current Age                                                      596 non-null object
Employment Type. Employment Type                                             596 non-null object
Classification. Classification                                               596 non-null object
LengthofServiceOverall. Overall Length of Service at Institute (in years)    596 non-null object
LengthofServiceCurrent. Length of Service at current workplace (in years)    596 non-null object
dtypes: float64(2), object(21)
memory usage: 126.2+ KB

In the case of tafe_survey the reduction of columns is even bigger.

B.2. Renaming columns¶

Because we eventually want to combine them, we'll have to standardize the column names of our interest.

We will use the following criteria to update the column names:

Make all the capitalization lowercase.
Remove any trailing whitespace from the end of the strings.
Replace spaces with underscores ('_').

In [182]:

dete_survey_updated.columns = dete_survey_updated.columns.str.lower().str.strip().str.replace(" ", "_")

print(dete_survey_updated.columns) # to check the result

Index(['id', 'separationtype', 'cease_date', 'dete_start_date',
       'role_start_date', 'position', 'classification', 'region',
       'business_unit', 'employment_status', 'career_move_to_public_sector',
       'career_move_to_private_sector', 'interpersonal_conflicts',
       'job_dissatisfaction', 'dissatisfaction_with_the_department',
       'physical_work_environment', 'lack_of_recognition',
       'lack_of_job_security', 'work_location', 'employment_conditions',
       'maternity/family', 'relocation', 'study/travel', 'ill_health',
       'traumatic_incident', 'work_life_balance', 'workload',
       'none_of_the_above', 'gender', 'age', 'aboriginal', 'torres_strait',
       'south_sea', 'disability', 'nesb'],
      dtype='object')

In [183]:

print(tafe_survey_updated.columns)

Index(['Record ID', 'Institute', 'WorkArea', 'CESSATION YEAR',
       'Reason for ceasing employment',
       'Contributing Factors. Career Move - Public Sector ',
       'Contributing Factors. Career Move - Private Sector ',
       'Contributing Factors. Career Move - Self-employment',
       'Contributing Factors. Ill Health',
       'Contributing Factors. Maternity/Family',
       'Contributing Factors. Dissatisfaction',
       'Contributing Factors. Job Dissatisfaction',
       'Contributing Factors. Interpersonal Conflict',
       'Contributing Factors. Study', 'Contributing Factors. Travel',
       'Contributing Factors. Other', 'Contributing Factors. NONE',
       'Gender. What is your Gender?', 'CurrentAge. Current Age',
       'Employment Type. Employment Type', 'Classification. Classification',
       'LengthofServiceOverall. Overall Length of Service at Institute (in years)',
       'LengthofServiceCurrent. Length of Service at current workplace (in years)'],
      dtype='object')

In [184]:

# to update the columns name in tafe_survey_updated, according to dete_survey_updated columns names.
tafe_survey_updated = (tafe_survey_updated.rename(mapper = {'Record ID': 'id', 
                                                            'CESSATION YEAR': 'cease_date', 
                                                            'Reason for ceasing employment': 'separationtype', 
                                                            'Gender. What is your Gender?': 'gender', 
                                                            'CurrentAge. Current Age': 'age', 
                                                            'Employment Type. Employment Type': 'employment_status', 
                                                            'Classification. Classification': 'position', 
                                                            'LengthofServiceOverall. Overall Length of Service at Institute (in years)': 'institute_service', 
                                                            'LengthofServiceCurrent. Length of Service at current workplace (in years)': 'role_service'},
                                                  axis = 1))

print(tafe_survey_updated.columns)

Index(['id', 'Institute', 'WorkArea', 'cease_date', 'separationtype',
       'Contributing Factors. Career Move - Public Sector ',
       'Contributing Factors. Career Move - Private Sector ',
       'Contributing Factors. Career Move - Self-employment',
       'Contributing Factors. Ill Health',
       'Contributing Factors. Maternity/Family',
       'Contributing Factors. Dissatisfaction',
       'Contributing Factors. Job Dissatisfaction',
       'Contributing Factors. Interpersonal Conflict',
       'Contributing Factors. Study', 'Contributing Factors. Travel',
       'Contributing Factors. Other', 'Contributing Factors. NONE', 'gender',
       'age', 'employment_status', 'position', 'institute_service',
       'role_service'],
      dtype='object')

B.3. Subseting our Dataframe¶

Some of our end goals is to answer the following questions:

Are employees who have only worked for the institutes for a short period of time resigning due to some kind of dissatisfaction?
What about employees who have been at the job longer?

For this project, we'll only analyze survey respondents who resigned, so their separation type column, contains the string 'Resignation'.

In [185]:

dete_survey_updated["separationtype"].value_counts(dropna = False)

Out[185]:

Age Retirement                          285
Resignation-Other reasons               150
Resignation-Other employer               91
Resignation-Move overseas/interstate     70
Voluntary Early Retirement (VER)         67
Ill Health Retirement                    61
Other                                    49
Contract Expired                         34
Termination                              15
Name: separationtype, dtype: int64

In [186]:

tafe_survey_updated["separationtype"].value_counts(dropna = False)

Out[186]:

Resignation                 340
Contract Expired            127
Retrenchment/ Redundancy    104
Retirement                   82
Transfer                     25
Termination                  23
NaN                           1
Name: separationtype, dtype: int64

In [187]:

# to make a boolean index from dete_separation column with all 3 types of resignations
dete_separationtype_resign = ((dete_survey_updated["separationtype"] == "Resignation-Other reasons") | 
                              (dete_survey_updated["separationtype"] == "Resignation-Other employer") |
                              (dete_survey_updated["separationtype"] == "Resignation-Move overseas/interstate")
                             )

# make a subset of a copy of dete_survey_updated (for avoiding SettingWithCopy Warning later)
dete_resignations = dete_survey_updated.copy()[dete_separationtype_resign]

dete_resignations["separationtype"].value_counts(dropna = False)

Out[187]:

Resignation-Other reasons               150
Resignation-Other employer               91
Resignation-Move overseas/interstate     70
Name: separationtype, dtype: int64

In [188]:

# the same operation with tafe_survey_updated dataframe
tafe_resignations = tafe_survey_updated.copy()[tafe_survey_updated["separationtype"] == "Resignation"]

tafe_resignations["separationtype"].value_counts(dropna = False)

Out[188]:

Resignation    340
Name: separationtype, dtype: int64

B.4. Checking for inconsistencies in our data¶

Now, before we start cleaning and manipulating the rest of our data, let's verify that the data doesn't contain any major inconsistencies.

In [189]:

dete_resignations["cease_date"].value_counts(dropna = False)

Out[189]:

2012       126
2013        74
01/2014     22
12/2013     17
06/2013     14
NaN         11
09/2013     11
11/2013      9
07/2013      9
10/2013      6
08/2013      4
05/2012      2
05/2013      2
07/2012      1
2010         1
09/2010      1
07/2006      1
Name: cease_date, dtype: int64

We need to clean this data from this column (cease_date): extract the year and convert its type to float.

In [190]:

pattern = r"(2[0-9]{3})" # regular expresion for extracting years from 2000 year. 

dete_resignations["cease_date"] = dete_resignations["cease_date"].str.extract(pattern).astype(float)

dete_resignations["cease_date"].value_counts(dropna = False) # to check changes

/dataquest/system/env/python3/lib/python3.4/site-packages/ipykernel/__main__.py:3: FutureWarning:

currently extract(expand=None) means expand=False (return Index/Series/DataFrame) but in a future version of pandas this will be changed to expand=True (return DataFrame)

Out[190]:

 2013.0    146
 2012.0    129
 2014.0     22
NaN         11
 2010.0      2
 2006.0      1
Name: cease_date, dtype: int64

In [191]:

dete_resignations['dete_start_date'].value_counts(dropna = False).sort_index(ascending = True) # sort by year (index), acending

Out[191]:

 1963.0     1
 1971.0     1
 1972.0     1
 1973.0     1
 1974.0     2
 1975.0     1
 1976.0     2
 1977.0     1
 1980.0     5
 1982.0     1
 1983.0     2
 1984.0     1
 1985.0     3
 1986.0     3
 1987.0     1
 1988.0     4
 1989.0     4
 1990.0     5
 1991.0     4
 1992.0     6
 1993.0     5
 1994.0     6
 1995.0     4
 1996.0     6
 1997.0     5
 1998.0     6
 1999.0     8
 2000.0     9
 2001.0     3
 2002.0     6
 2003.0     6
 2004.0    14
 2005.0    15
 2006.0    13
 2007.0    21
 2008.0    22
 2009.0    13
 2010.0    17
 2011.0    24
 2012.0    21
 2013.0    10
NaN        28
Name: dete_start_date, dtype: int64

In [192]:

tafe_resignations['cease_date'].value_counts(dropna = False)

Out[192]:

 2011.0    116
 2012.0     94
 2010.0     68
 2013.0     55
NaN          5
 2009.0      2
Name: cease_date, dtype: int64

Values at these two column (dete_start_date and cease_date in tafe) are ok. There is no need to clean it.

Cheching all this columns we can see that there are no cease dates in dete before 2006, and it's odd because there are many star dates since 1963. Surely have had ceases during all this time that haven't been appointed until 2006 in this database.

B.5. Transforming values¶

We can calculate years of service in dete survey by subtracting the values in dete_start_date from the cease_date. We'll assign the result to a new column in dete named institute_service.

In [193]:

dete_resignations["institute_service"] = (dete_resignations["cease_date"] - 
                                         dete_resignations['dete_start_date'])

dete_resignations["institute_service"].describe() # to view its basics statistic values

Out[193]:

count    273.000000
mean      10.457875
std        9.931709
min        0.000000
25%        3.000000
50%        7.000000
75%       16.000000
max       49.000000
Name: institute_service, dtype: float64

Next we must choose what factors or columns appointed in these surveys are the adecuated for considerign that an employee quit because thet were dissatisfied.

From dete survey we'll choose the following columns:'interpersonal_conflicts','job_dissatisfaction', 'dissatisfaction_with_the_department','physical_work_environment', 'lack_of_recognition', 'lack_of_job_security', 'work_location', 'employment_conditions', 'work_life_balance' and 'workload'.
And from tafe survey:'Contributing Factors. Dissatisfaction', 'Contributing Factors. Job Dissatisfaction', 'Contributing Factors. Interpersonal Conflict'.

If the employee indicated any of the factors above caused them to resign, we'll mark them as dissatisfied in a new column.

In [194]:

# as example for a view of the original values of these columns in dete survey
dete_resignations['interpersonal_conflicts'].value_counts(dropna = False)

Out[194]:

False    291
True      20
Name: interpersonal_conflicts, dtype: int64

In [195]:

# as example for a view of the original values of these columns in tefe survey
tafe_resignations['Contributing Factors. Dissatisfaction'].value_counts(dropna = False)

Out[195]:

-                                         277
Contributing Factors. Dissatisfaction      55
NaN                                         8
Name: Contributing Factors. Dissatisfaction, dtype: int64

At this point we'll create a function to transform the data in selected tafe columns from their actual values, to boolean values.

In [196]:

tafe_dissat_columns = (['Contributing Factors. Dissatisfaction',
                       'Contributing Factors. Job Dissatisfaction', 
                        'Contributing Factors. Interpersonal Conflict'])
def to_boolean(element):
    
    if element == "-":
        return False
    
    elif pd.isnull(element): #pandas function
        return np.nan #numpy function
    
    else:
        return True

tafe_resignations[tafe_dissat_columns] = tafe_resignations[tafe_dissat_columns].applymap(to_boolean)
    
# to view the results use df.apply() method 
tafe_resignations[tafe_dissat_columns].apply(pd.value_counts, dropna = False)

Out[196]:

	Contributing Factors. Dissatisfaction	Contributing Factors. Job Dissatisfaction	Contributing Factors. Interpersonal Conflict
False	277	270	308
True	55	62	24
NaN	8	8	8

In [197]:

# selected columns at dete survey:
dete_dissat_columns = (['interpersonal_conflicts','job_dissatisfaction', 
                       'dissatisfaction_with_the_department','physical_work_environment', 
                       'lack_of_recognition', 'lack_of_job_security', 'work_location', 
                       'employment_conditions', 'work_life_balance', 'workload'])

# to check values of all selected columns :
dete_resignations[dete_dissat_columns].apply(pd.value_counts, dropna = False)

Out[197]:

	interpersonal_conflicts	job_dissatisfaction	dissatisfaction_with_the_department	physical_work_environment	lack_of_recognition	lack_of_job_security	work_location	employment_conditions	work_life_balance	workload
False	291	270	282	305	278	297	293	288	243	284
True	20	41	29	6	33	14	18	23	68	27

Following we'll create a dissatisfied column in BOTH the tafe_resignations and dete_resignations dataframes.

If any of the columns of its respective dataframes listed above contain a True value, we'll add a True value to a new column named dissatisfied. To accomplish this, we'll use the DataFrame.any() method to do the following:

Return True if any element in the selected columns above is True
Return False if none of the elements in the selected columns above is True
Return NaN if the value is NaN

We also will use the df.copy() method to create a copy of the results and avoid the SettingWithCopy Warning. Assign the results to dete_resignations_up and tafe_resignations_up.

In [198]:

#assign True to dissatisfied columns if any element in the selected columns above is True, and NaN if the value is NaN
dete_resignations["dissatisfied"] = dete_resignations[dete_dissat_columns].any(axis = 1, skipna = False)

#make a copy:
dete_resignations_up = dete_resignations.copy()

#to check the results
dete_resignations_up["dissatisfied"].value_counts(dropna = False)

Out[198]:

False    157
True     154
Name: dissatisfied, dtype: int64

In [199]:

# the same for tafe columns
tafe_resignations["dissatisfied"] = tafe_resignations[tafe_dissat_columns].any(axis = 1, skipna = False)

tafe_resignations_up = tafe_resignations.copy()

tafe_resignations_up["dissatisfied"].value_counts(dropna = False)

Out[199]:

False    235
True      97
NaN        8
Name: dissatisfied, dtype: int64

C. COMBINING OUR DATAFRAMES¶

Now, we're ready to combine our datasets! Our end goal is to aggregate the data according to the institute_service column, so when you combine the data, think about how to get the data into a form that's easy to aggregate.

First, let's add a column to each dataframe that will allow us to easily distinguish between the two:

Add a column named institute to dete_resignations_up. Each row should contain the value DETE.
Add a column named institute to tafe_resignations_up. Each row should contain the value TAFE.

Combine the dataframes. Assign the result to combined.
We still have some columns left in the dataframe that we don't need to complete our analysis. We'll use the DataFrame.dropna() method to drop any columns with less than 500 non null values, and will assign the result to combined_updated.

In [200]:

dete_resignations_up["institute"] = "DETE"

# to check the results and take a view of dete_resignations_up dataframe
dete_resignations_up.head(2)

Out[200]:

	id	separationtype	cease_date	dete_start_date	role_start_date	position	classification	region	business_unit	employment_status	...	gender	age	aboriginal	torres_strait	south_sea	disability	nesb	institute_service	dissatisfied	institute
3	4	Resignation-Other reasons	2012.0	2005.0	2006.0	Teacher	Primary	Central Queensland	NaN	Permanent Full-time	...	Female	36-40	NaN	NaN	NaN	NaN	NaN	7.0	False	DETE
5	6	Resignation-Other reasons	2012.0	1994.0	1997.0	Guidance Officer	NaN	Central Office	Education Queensland	Permanent Full-time	...	Female	41-45	NaN	NaN	NaN	NaN	NaN	18.0	True	DETE

2 rows × 38 columns

In [201]:

tafe_resignations_up["institute"] = "DETE"

# to check the results and take a view of dete_resignations_up dataframe
tafe_resignations_up.head(2)

Out[201]:

	id	Institute	WorkArea	cease_date	separationtype	Contributing Factors. Career Move - Public Sector	Contributing Factors. Career Move - Private Sector	Contributing Factors. Career Move - Self-employment	Contributing Factors. Ill Health	Contributing Factors. Maternity/Family	...	Contributing Factors. Other	Contributing Factors. NONE	gender	age	employment_status	position	institute_service	role_service	dissatisfied	institute
3	6.341399e+17	Mount Isa Institute of TAFE	Non-Delivery (corporate)	2010.0	Resignation	-	-	-	-	-	...	-	-	NaN	NaN	NaN	NaN	NaN	NaN	False	DETE
4	6.341466e+17	Southern Queensland Institute of TAFE	Delivery (teaching)	2010.0	Resignation	-	Career Move - Private Sector	-	-	-	...	-	-	Male	41 45	Permanent Full-time	Teacher (including LVT)	3-4	3-4	False	DETE

2 rows × 25 columns

In [202]:

# concatenate dataframes horizontally
combined = pd.concat([dete_resignations_up, tafe_resignations_up])

combined.shape

Out[202]:

(651, 53)

In [203]:

# Calculate the number of missing values in each column
combined.isnull().sum().sort_values(ascending = False)

Out[203]:

torres_strait                                          651
south_sea                                              648
aboriginal                                             644
disability                                             643
nesb                                                   642
business_unit                                          619
classification                                         490
region                                                 386
role_start_date                                        380
dete_start_date                                        368
role_service                                           361
dissatisfaction_with_the_department                    340
work_location                                          340
employment_conditions                                  340
workload                                               340
job_dissatisfaction                                    340
career_move_to_public_sector                           340
career_move_to_private_sector                          340
ill_health                                             340
interpersonal_conflicts                                340
physical_work_environment                              340
relocation                                             340
work_life_balance                                      340
lack_of_job_security                                   340
traumatic_incident                                     340
lack_of_recognition                                    340
maternity/family                                       340
study/travel                                           340
none_of_the_above                                      340
Contributing Factors. Other                            319
Contributing Factors. Career Move - Public Sector      319
Contributing Factors. NONE                             319
Contributing Factors. Travel                           319
Contributing Factors. Career Move - Self-employment    319
Contributing Factors. Dissatisfaction                  319
Contributing Factors. Maternity/Family                 319
Contributing Factors. Job Dissatisfaction              319
Contributing Factors. Interpersonal Conflict           319
Contributing Factors. Ill Health                       319
Contributing Factors. Study                            319
Contributing Factors. Career Move - Private Sector     319
Institute                                              311
WorkArea                                               311
institute_service                                       88
gender                                                  59
age                                                     55
employment_status                                       54
position                                                53
cease_date                                              16
dissatisfied                                             8
id                                                       0
institute                                                0
separationtype                                           0
dtype: int64

The columns of our interest: dissatisfied, age, institute_service, are lower in Nan values

We are going to drop any columns with less than 500 non null values, and will assign the result to combined_updated.

In [204]:

combined_updated = combined.copy().dropna(thresh = 500, axis=1) # create a copy for avoid SettingWithCopyWarning later

# to check results
combined_updated.isnull().sum().sort_values(ascending = False)

Out[204]:

institute_service    88
gender               59
age                  55
employment_status    54
position             53
cease_date           16
dissatisfied          8
separationtype        0
institute             0
id                    0
dtype: int64

C.1. Clean up an transform the "institute_service" column¶

This column is tricky to clean because it currently contains values in different forms.

In [205]:

combined_updated["institute_service"].value_counts(dropna = False)

Out[205]:

NaN                   88
Less than 1 year      73
1-2                   64
3-4                   63
5-6                   33
11-20                 26
5.0                   23
1.0                   22
7-10                  21
0.0                   20
3.0                   20
6.0                   17
4.0                   16
2.0                   14
9.0                   14
7.0                   13
More than 20 years    10
13.0                   8
8.0                    8
15.0                   7
20.0                   7
14.0                   6
12.0                   6
17.0                   6
22.0                   6
10.0                   6
16.0                   5
18.0                   5
11.0                   4
23.0                   4
24.0                   4
19.0                   3
32.0                   3
39.0                   3
21.0                   3
28.0                   2
30.0                   2
26.0                   2
36.0                   2
25.0                   2
29.0                   1
31.0                   1
27.0                   1
34.0                   1
35.0                   1
38.0                   1
41.0                   1
42.0                   1
49.0                   1
33.0                   1
Name: institute_service, dtype: int64

We will tranform this column:

Drop its rows with Nan values
Convert their values into categories

In [206]:

# get the index of rows with nan values on column "institute_service"
index_rows = combined_updated[combined_updated["institute_service"].isnull()].index

# drop rows with null values. Create a new dataframe for not lossing these values in combined_updated
combined_service_cat = combined_updated.copy().drop(labels = index_rows)

In [207]:

combined_service_cat["institute_service"].value_counts(dropna = False)

Out[207]:

Less than 1 year      67
3-4                   63
1-2                   61
5-6                   32
11-20                 25
1.0                   21
7-10                  20
3.0                   19
5.0                   19
0.0                   18
6.0                   16
4.0                   15
9.0                   14
2.0                   12
7.0                   12
More than 20 years    10
13.0                   8
8.0                    7
15.0                   7
22.0                   6
17.0                   6
12.0                   6
20.0                   6
14.0                   5
16.0                   5
10.0                   4
18.0                   4
23.0                   4
24.0                   4
21.0                   3
11.0                   3
39.0                   3
19.0                   2
25.0                   2
26.0                   2
28.0                   2
36.0                   2
32.0                   2
33.0                   1
35.0                   1
34.0                   1
49.0                   1
29.0                   1
42.0                   1
31.0                   1
27.0                   1
41.0                   1
38.0                   1
Name: institute_service, dtype: int64

In [208]:

# Change all values to string and remove ".0" character. Note that dot needs go between []
combined_service_cat["institute_service"] = (combined_service_cat["institute_service"].astype(str).
                                         str.replace(r"([.]0)", ""))

combined_service_cat["institute_service"].value_counts(dropna = False).sort_index(ascending = False) #sorted for a better view

Out[208]:

More than 20 years    10
Less than 1 year      67
9                     14
8                      7
7-10                  20
7                     12
6                     16
5-6                   32
5                     19
49                     1
42                     1
41                     1
4                     15
39                     3
38                     1
36                     2
35                     1
34                     1
33                     1
32                     2
31                     1
3-4                   63
3                     19
29                     1
28                     2
27                     1
26                     2
25                     2
24                     4
23                     4
22                     6
21                     3
20                     6
2                     12
19                     2
18                     4
17                     6
16                     5
15                     7
14                     5
13                     8
12                     6
11-20                 25
11                     3
10                     4
1-2                   61
1                     21
0                     18
Name: institute_service, dtype: int64

In the following code we'll extract all years of "institute_service" column in combined_service_cat dataframe, and will store them in a new dataframe called "years".

Lately we'll calculate years["service_cat"] column as the arithmetic mean of all years stracted.

Finally we'll map this columnd and will transform numbers into cathegories.

In [209]:

# regular expresion for extracting all the years of "institute_service" column

pattern = r"(?P<First_Year>[0-9][0-9]?)-?(?P<Second_Year>[1-9][0-9]?)?" # question mark, ?, after a group 
                                                            # to indicate that a match for those groups 
                                                            # is optional.Only extracting characters
                                                            # between brakets ()

# dataframe which store all years extracted
years = combined_service_cat.copy()["institute_service"].str.extractall(pattern) # make a copy of combined_updated

# a view of all values stored in columns of years
years.apply(pd.value_counts, dropna = False)

Out[209]:

	First_Year	Second_Year
0	18.0	NaN
1	149.0	NaN
10	4.0	20.0
11	28.0	NaN
12	6.0	NaN
13	8.0	NaN
14	5.0	NaN
15	7.0	NaN
16	5.0	NaN
17	6.0	NaN
18	4.0	NaN
19	2.0	NaN
2	12.0	61.0
20	16.0	25.0
21	3.0	NaN
22	6.0	NaN
23	4.0	NaN
24	4.0	NaN
25	2.0	NaN
26	2.0	NaN
27	1.0	NaN
28	2.0	NaN
29	1.0	NaN
3	82.0	NaN
31	1.0	NaN
32	2.0	NaN
33	1.0	NaN
34	1.0	NaN
35	1.0	NaN
36	2.0	NaN
38	1.0	NaN
39	3.0	NaN
4	15.0	63.0
41	1.0	NaN
42	1.0	NaN
49	1.0	NaN
5	51.0	NaN
6	16.0	32.0
7	32.0	NaN
8	7.0	NaN
9	14.0	NaN
NaN	NaN	326.0

In [210]:

years.head()

Out[210]:

		First_Year	Second_Year
	match
5	0	18	NaN
8	0	3	NaN
9	0	15	NaN
11	0	3	NaN
12	0	14	NaN

In [211]:

# because years has a MultiIndex, reset it, remove the index level "match" and convert it in a column.
years = years.reset_index(level=["match"])

In [212]:

years.head()

Out[212]:

	First_Year	Second_Year
5	18	NaN
8	3	NaN
9	15	NaN
11	3	NaN
12	14	NaN

In [213]:

# drop column match, because is useless
years = years.drop("match", axis = 1)

In [214]:

# change all values of its column to float
years = years.astype(float)

In [215]:

# transfor Nan values in Second_Year to 0
years["Second_Year"] = years["Second_Year"].fillna(0)

In [216]:

# create a new column in year based in First_Year values
years["service_cat"] = years["First_Year"]

# for rows which have values in its Second_Year. The value of "service_cat" It's the arithmetic mean
years.loc[years["Second_Year"] > 0,"service_cat"] = ((years.loc[years["Second_Year"] > 0, "First_Year"]
                                                     + years.loc[years["Second_Year"] > 0, "Second_Year"])
                                                     / 2)

years.head()

Out[216]:

	First_Year	service_cat
5	18.0	18.0
8	3.0	3.0
9	15.0	15.0
11	3.0	3.0
12	14.0	14.0

In [217]:

# there is no rows with Nan values, we drop them before
years["service_cat"].isnull().sum()

Out[217]:

To analyze the data, we'll convert numbers into categories.

We'll use the slightly modified definitions below:

New: Less than 3 years at a company
Experienced: 3-6 years at a company
Established: 7-10 years at a company
Veteran: 11 or more years at a company

In [218]:

# create a function to transform the values year's service_cat column into categories
def to_categories(val):
    
    if val < 3:
        return "New"
    elif val >=3 and val <=6:
        return "Experienced"
    elif val >6 and val <=10: # minimun lower than definition to evoid gaps
            return "Established"
    else:
        return "Veteran"


years["service_cat"] = years["service_cat"].map(to_categories)

years["service_cat"].value_counts()

Out[218]:

New            179
Experienced    164
Veteran        127
Established     57
Name: service_cat, dtype: int64

In [219]:

# drop this two columns of years that we don't need more
years = years.drop(["First_Year", "Second_Year"], axis = 1)

C.2. Joining "service_cat" column to combined_service_cat dataframe¶

Following we join the column "service_cat" of years dataframe to combined_service_cat dataframe

Firsly we'll check if they both continue having the same index after all the former changes...

In [220]:

# to check if this two dataframes have the same number of rows to combine
print(combined_service_cat.shape)
print(years.shape)

(527, 10)
(527, 1)

In [221]:

combined_service_cat.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 527 entries, 5 to 701
Data columns (total 10 columns):
age                  525 non-null object
cease_date           525 non-null float64
dissatisfied         527 non-null object
employment_status    527 non-null object
gender               522 non-null object
id                   527 non-null float64
institute            527 non-null object
institute_service    527 non-null object
position             524 non-null object
separationtype       527 non-null object
dtypes: float64(2), object(8)
memory usage: 45.3+ KB

In [222]:

# to check if these dataframe have the same index
years.index == combined_service_cat.index

Out[222]:

array([ True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True])

In [223]:

merged = pd.merge(left = combined_service_cat, right = years, left_index = True, right_index = True)

In [224]:

merged.shape

Out[224]:

(727, 11)

In [225]:

merged.head()

Out[225]:

	age	cease_date	dissatisfied	employment_status	gender	id	institute	institute_service	position	separationtype	service_cat
4	41 45	2010.0	False	Permanent Full-time	Male	6.341466e+17	DETE	3-4	Teacher (including LVT)	Resignation	Experienced
5	41-45	2012.0	True	Permanent Full-time	Female	6.000000e+00	DETE	18	Guidance Officer	Resignation-Other reasons	Veteran
5	41-45	2012.0	True	Permanent Full-time	Female	6.000000e+00	DETE	18	Guidance Officer	Resignation-Other reasons	Established
5	56 or older	2010.0	False	Contract/casual	Female	6.341475e+17	DETE	7-10	Teacher (including LVT)	Resignation	Veteran
5	56 or older	2010.0	False	Contract/casual	Female	6.341475e+17	DETE	7-10	Teacher (including LVT)	Resignation	Established

why this code above doesn't work??

It is supose it merges 2 dataframes by its index. That is the same index for both. But instead of maintaining the number of rows, create rows duplicated with the same index number.

On the contrary the code bellow seems to work fine...

In [226]:

combined_service_cat = pd.concat([combined_service_cat, years], axis = 1)

combined_service_cat.head()

Out[226]:

	age	cease_date	dissatisfied	employment_status	gender	id	institute	institute_service	position	separationtype	service_cat
5	41-45	2012.0	True	Permanent Full-time	Female	6.0	DETE	18	Guidance Officer	Resignation-Other reasons	Veteran
8	31-35	2012.0	False	Permanent Full-time	Female	9.0	DETE	3	Teacher	Resignation-Other reasons	Experienced
9	46-50	2012.0	True	Permanent Part-time	Female	10.0	DETE	15	Teacher Aide	Resignation-Other employer	Veteran
11	31-35	2012.0	False	Permanent Full-time	Male	12.0	DETE	3	Teacher	Resignation-Move overseas/interstate	Experienced
12	36-40	2012.0	False	Permanent Full-time	Female	13.0	DETE	14	Teacher	Resignation-Other reasons	Veteran

In [227]:

combined_service_cat.shape

Out[227]:

(527, 11)

We get a frame with the same rows of combined_updated dataframe plus the service_cat column

C.3. Clean up an transform the "age" column¶

In [228]:

# a preview of values in this column
combined_updated["age"].value_counts(dropna = False)

Out[228]:

51-55            71
NaN              55
41-45            48
41  45           45
46-50            42
36-40            41
46  50           39
26-30            35
21  25           33
31  35           32
26  30           32
36  40           32
21-25            29
31-35            29
56 or older      29
56-60            26
61 or older      23
20 or younger    10
Name: age, dtype: int64

As in the case of "institute_service" column, this column is tricky to clean because it currently contains values in different forms. Because values are similar, we can use the same method that before.

In [229]:

# get the index of rows with nan values on column "institute_service"
index_rows = combined_updated[combined_updated["age"].isnull()].index

# drop rows with null values. Create a new dataframe for not lossing these values in combined_updated
combined_age= combined_updated.copy().drop(labels = index_rows)

In [230]:

combined_age["age"].value_counts(dropna = False)

Out[230]:

51-55            65
41  45           45
41-45            44
46-50            41
36-40            39
46  50           39
21  25           33
26-30            33
36  40           32
31  35           32
26  30           32
56 or older      29
31-35            26
21-25            24
56-60            24
61 or older      20
20 or younger    10
Name: age, dtype: int64

In [231]:

# Change to string and spaces to "-"
combined_age["age"] = (combined_age["age"].astype(str).str.replace(" ", "-"))

combined_age["age"].value_counts(dropna = False)

Out[231]:

51-55            65
41--45           45
41-45            44
46-50            41
46--50           39
36-40            39
26-30            33
21--25           33
36--40           32
26--30           32
31--35           32
56-or-older      29
31-35            26
56-60            24
21-25            24
61-or-older      20
20-or-younger    10
Name: age, dtype: int64

In [232]:

# regular expresion for extracting all the ages of "age" column

pattern = r"(?P<First_Age>[0-9][0-9]?)-?--?(?P<Second_Age>[1-9][0-9]?)?" # question mark, ?, after a group 
                                                            # to indicate that a match for those groups 
                                                            # is optional.Only extracting characters
                                                            # between brakets ()

# dataframe which store all years extracted
ages = combined_age.copy()["age"].str.extractall(pattern) # make a copy of combined_age

# a view of all values stored in columns of ages
ages.apply(pd.value_counts, dropna = False)

Out[232]:

	First_Age	Second_Age
20	10.0	NaN
21	57.0	NaN
26	65.0	NaN
30	NaN	65.0
31	58.0	NaN
36	71.0	NaN
40	NaN	71.0
41	89.0	NaN
45	NaN	89.0
46	80.0	NaN
50	NaN	80.0
51	65.0	NaN
55	NaN	65.0
56	53.0	NaN
61	20.0	NaN
NaN	NaN	59.0
35	NaN	58.0
25	NaN	57.0
60	NaN	24.0

It seems, looking at this table above, that all ages have been stracted.

In [233]:

ages.head()

Out[233]:

		First_Age	Second_Age
	match
5	0	41	45
8	0	31	35
9	0	46	50
11	0	31	35
12	0	36	40

In [234]:

# because years has a MultiIndex, reset it, remove the index level "match" and convert it in a column.
ages = ages.reset_index(level=["match"])

In [235]:

ages.head()

Out[235]:

	First_Age	Second_Age
5	41	45
8	31	35
9	46	50
11	31	35
12	36	40

In [236]:

# drop column match, because is useless
ages = ages.drop("match", axis = 1)

In [237]:

# change all values of its column to float
ages = ages.astype(float)

In [238]:

# transfor Nan values in Second_Age to 0
ages["Second_Age"] = ages["Second_Age"].fillna(0)

In [239]:

# create a new column in year based in First_Age values
ages["ages_mean"] = ages["First_Age"]

# for rows which have values in its Second_Age. The value of "ages_mean" It's the arithmetic mean
ages.loc[ages["Second_Age"] > 0,"ages_mean"] = ((ages.loc[ages["Second_Age"] > 0, "First_Age"]
                                                     + ages.loc[ages["Second_Age"] > 0, "Second_Age"])
                                                     / 2)

ages.head()

Out[239]:

	First_Age	Second_Age	ages_mean
5	41.0	45.0	43.0
8	31.0	35.0	33.0
9	46.0	50.0	48.0
11	31.0	35.0	33.0
12	36.0	40.0	38.0

In [240]:

# drop this two columns of ages that we don't need more
ages = ages.drop(["First_Age", "Second_Age"], axis = 1)

In [241]:

# there is no rows with Nan values
ages["ages_mean"].isnull().sum()

Out[241]:

In [242]:

# a view of the distribution of ages' values
ages["ages_mean"].describe()

Out[242]:

count    568.000000
mean      41.005282
std       11.370852
min       20.000000
25%       33.000000
50%       43.000000
75%       48.000000
max       61.000000
Name: ages_mean, dtype: float64

According to this distribution, as in the case of years of service, we'll convert numbers in this column into categories.

We'll use the three categories below:

Young: Less than 33 years
Middle age: 33-48 years
Senior: More than 48 years

In [243]:

# create a function to transform the values year's service_cat column into categories
def to_categories(val):
    
    if val < 33:
        return "Young"
    elif val >=33 and val <=48:
        return "Middle age"
    else:
        return "Senior"

ages["ages_mean"] = ages["ages_mean"].map(to_categories)

ages["ages_mean"].value_counts()

Out[243]:

Middle age    298
Senior        138
Young         132
Name: ages_mean, dtype: int64

C.4. Joining "ages_mean" column to combined_age dataframe¶

We'll use here the same methon we use before to join "service_cat" column to combined_service_cat dataframe.

In [244]:

combined_age = pd.concat([combined_age, ages], axis = 1)

combined_age.tail()

Out[244]:

	age	cease_date	dissatisfied	employment_status	gender	id	institute	institute_service	position	separationtype	ages_mean
693	26--30	2013.0	False	Temporary Full-time	Female	6.350599e+17	DETE	1-2	Administration (AO)	Resignation	Young
696	21--25	2013.0	False	Temporary Full-time	Male	6.350660e+17	DETE	5-6	Operational (OO)	Resignation	Young
697	51-55	2013.0	False	Temporary Full-time	Male	6.350668e+17	DETE	1-2	Teacher (including LVT)	Resignation	Senior
699	51-55	2013.0	False	Permanent Full-time	Female	6.350704e+17	DETE	5-6	Teacher (including LVT)	Resignation	Senior
701	26--30	2013.0	False	Contract/casual	Female	6.350730e+17	DETE	3-4	Administration (AO)	Resignation	Young

We can look at here that ages_mean occupies the last column, and their join with combine_age dataframe seems fine.

D. VISUALIZING CLEANED DATA¶

In [245]:

# for this last task we'll import matplop library
import matplotlib.pyplot as plt

%matplotlib inline

D.1. Visualizing data of combined_service_cat dataframe¶

Firstly we'll check the data in the columns of our interest.

In [246]:

combined_service_cat["service_cat"].value_counts(dropna = False)

Out[246]:

New            179
Experienced    164
Veteran        127
Established     57
Name: service_cat, dtype: int64

In [247]:

combined_service_cat["dissatisfied"].value_counts(dropna = False)

Out[247]:

False    316
True     211
Name: dissatisfied, dtype: int64

There are no missing values in these columns above

In [248]:

# this is needed because if not code bellow doesn't work
combined_service_cat["dissatisfied"] = combined_service_cat["dissatisfied"].astype(bool)

In [249]:

combined_service_pv = combined_service_cat.pivot_table(index = "service_cat", values = "dissatisfied")

In [250]:

# draw a bar plot
combined_service_pv.plot(kind = "bar", title = "Percentaje of dissatisfied by service categories")

plt.show()

GRAPH 1. Percentajes of dissatisfied grouping by service_cat (years of service) categories

D.2. Visualizing data of combined_age dataframe¶

Firstly we'll check the data in the columns of our interest.

In [251]:

combined_age["ages_mean"].value_counts(dropna = False)

Out[251]:

Middle age    298
Senior        138
Young         132
Name: ages_mean, dtype: int64

In [252]:

combined_age["dissatisfied"].value_counts(dropna = False)

Out[252]:

False    343
True     225
Name: dissatisfied, dtype: int64

There are no missing values in these columns above

In [254]:

# this is needed because if not code bellow doesn't work
combined_age["dissatisfied"] = combined_age["dissatisfied"].astype(bool)

In [255]:

combined_age_pv = combined_age.pivot_table(index = "ages_mean", values = "dissatisfied")

In [256]:

# draw a bar plot
combined_age_pv.plot(kind = "bar", title = "Percentaje of dissatisfied by age")

plt.show()

E. CONCLUSIONS¶

According to GRAPH 1 above: only a 30% approximately of "New" employees (Less than 3 years at their workplace) were dissatisfied with their jobs when they exit. The were the lowest dissatisfied of all groups, classified by ages of service, in this two surveys.

On the other hand, employees belonging to "Established" and Veteran cathegories, with more than 7 years and 11 years of servis respectively, were the most dissatisfied. With more than 50% employees dissatisfied in some way, when they resigned.

According to GRAPH 2 above: a 36% approximately of "Young" employees (less than 33 years old) were dissatisfied with their jobs when they exit.

On the other hand, employees classified as "Senior", with more than 48 years old, were the most dissatisfied of all groups of age. With more than 42% dissatisfied in some way, when they resigned.

Nevertheless, we can see there are very little differences between this three groups of age, in relation to their percentaje of dissatisfaction when they resigned.