#!/usr/bin/env python # coding: utf-8 # # Guided Project: Clean and Analyse Exit Surveys # # In this guided project, we worked with exit surveys from employees of the [Department of Education, Training and Employment](https://en.wikipedia.org/wiki/Department_of_Education_(Queensland)) (DETE) and the Technical and Further Education (TAFE) institute in Queensland, Australia. You can find the TAFE exit survey here and the survey for the DETE [here](https://data.gov.au/dataset/ds-qld-fe96ff30-d157-4a81-851d-215f2a0fe26d/details?q=exit%20survey). # # ## Aim: # - To analyse various reasons of resignations mainly dissatisfaction. # - To analyse according to the age, experience and time worked at the company. # # # # # In[1]: import pandas as pd import matplotlib.pyplot as plt import numpy as np import seaborn as sns dete_survey = pd.read_csv('dete_survey.csv') tafe_survey = pd.read_csv('tafe_survey.csv') # In[2]: dete_survey.head() # In[3]: dete_survey.info() # - Most of the columns contain values of type object, including the fields corresponding to dates. # - There were also observed some values of type boolean(might help us in classifying type of resignation). # - In the last five columns, null values predominate almost entirely(most likely to be dropped). # In[4]: dete_survey.isnull().sum() # In[5]: tafe_survey.head() # In[6]: tafe_survey.info() # In[7]: tafe_survey.isnull().sum() # # 1. As some of the columns in both the columns are named different but imply the same, we might need to rename the columns so that there is order while combining the data and answering questions like: # - Are employees who only worked for the institutes for a short period of time resigning due to some kind of dissatisfaction? What about employees who have been there longer? # - Are younger employees resigning due to some kind of dissatisfaction? What about older employees? # # 2. There are some cells in dete_survey which contain 'Not Stated' rather than Nan. While cleaning data we have to check uniformity in the columns, that is the values in the columns should have similar datatype and should basically belong to the same group as the other data. For example, if a column contains the marks of a student, we should make sure that there are no values which represent grades or percentage. # # # 3. There are many columns in both the surveys that are not needed for our analysis. # # # # 4. The number of columns in both the surveys are not same, we will need to filter the useful columns. # # # In[8]: # solving point 2 as missing values are marked by 'Not Stated' dete_survey = pd.read_csv('dete_survey.csv',na_values = 'Not Stated') # ## Cleaning Column Names # In[9]: # column names are more ordered in dete_survey so we will rename columns in tafe_survey according to dete_survey # first, we will uniform column names in dete_survey dete_survey.columns = dete_survey.columns.str.strip().str.lower().str.replace(' ','_') dete_survey.columns # In[10]: dete_survey.rename({'separationtype': 'separation_type'}, axis=1, inplace=True) dete_survey.head() # In[11]: # renaming in tafe_survey columns = {'Record ID': 'id', 'CESSATION YEAR':'cease_date', 'Reason for ceasing employment':'separation_type', 'Gender. What is your Gender?':'gender', 'CurrentAge. Current Age':'age', 'Employment Type. Employment Type':'employment_status', 'Classification. Classification':'position', 'LengthofServiceOverall. Overall Length of Service at Institute (in years)':'institute_service', 'LengthofServiceCurrent. Length of Service at current workplace (in years)':'role_service', 'Contributing Factors. Dissatisfaction':'factors_diss', 'Contributing Factors. Job Dissatisfaction':'factors_job_diss' } tafe_survey = tafe_survey.rename(columns = columns) # In[12]: tafe_survey.columns # ## Dropping Columns # In[13]: # now we will try to delete some columns we dont need # columns 28-48 have denoted values with no particular meaning or there is no guide to weigh them # hence it is of no use to us # the last 5 columns in dete_survey is dominated by null values dete_survey = dete_survey.drop(dete_survey.columns[-5:], axis = 1) dete_survey = dete_survey.drop(dete_survey.columns[28:49], axis = 1) # In[14]: # we choose columns to drop on the same basis as we did for dete_survey tafe_survey = tafe_survey.drop(tafe_survey.columns[17:66], axis = 1) # In[15]: dete_survey.head() # In[16]: tafe_survey.head() # In[17]: # checking for duplicates by id dete_survey.duplicated(['id']).value_counts() # In[18]: tafe_survey.duplicated(['id']).value_counts() # ## Data Filtering (Resigned Employees) # # For the development of this project we only need to focus on those people who resigned, that is why we'll filter both datasets taking as into account those values in the column separation_type that contain the word 'Resignation'. # In[19]: dete_survey['separation_type'].value_counts(dropna = False) # Separation due to all resignation types is around 53% followed by Age Retirement at around 35%. Now we will see that resignation in tafe_survey is just less than 50 %. The high proportions of resignation types needs some answers which we will try to answer after preparing our data. # In[20]: tafe_survey['separation_type'].value_counts(dropna = False) # In the DETE-Survey dataset we see that in the column separation_type there are 3 values containing the word 'Resignation'. Whereas, in the TAFE-Survey dataset, there is only one. # In[21]: # filtering out separation due to resignation dete_survey['separationtype'] = dete_survey['separation_type'].str.split('-').str[0] dete_resignations = dete_survey.copy()[dete_survey['separationtype'].str.contains(r'Resignation')] # In[22]: tafe_resignations = tafe_survey.copy()[tafe_survey['separation_type'].str.contains(r'Resignation',na=False)] tafe_resignations.head() # ## Data Verification / Checking Validity # # Now we will check the validity of dates and years in our dataset. # Starting with `cease_date`:, the last year of a person's employment should not be more than the current year(the year the data was created). Similarly,for `date_start_date` , we can say that people working here wont be over 60, and reasoning that they started working their in their 20's, we can say that the least value for this column can be around # 1935. # # # # # In[23]: dete_resignations['cease_date'].value_counts() # This is what we discussed about [here](#section). # In[24]: # cleaning the column by extracting the years dete_resignations['cease_date'] = dete_resignations['cease_date'].str.extract(r'(?P[1-2][0-9]{3})') dete_resignations['cease_date'] = dete_resignations['cease_date'].astype(float) dete_resignations['cease_date'].value_counts() # In[25]: dete_resignations['dete_start_date'] = dete_resignations['dete_start_date'].astype(float) dete_resignations['dete_start_date'].value_counts() # In[26]: # no need to change or extract anyhtin here tafe_resignations['cease_date'].value_counts() # We have verified the date columns and none of the dates has been violating our assumptions. We needed to verify this as we are to going to sort or filter the employees based on their experience. # ## Calculating Experience/ Service Time # # It is noiticable that the tafe_resignations dataframe already contains a "service" column, which we renamed to institute_service. In order to analyze both surveys together, we'll have to create a corresponding institute_service column in dete_resignations. # In[27]: dete_resignations['institute_service'] = (dete_resignations['dete_start_date'].astype(float)-dete_resignations['cease_date'].astype(float))*(-1) dete_resignations['institute_service'].value_counts() # - Excluding the null values of the institute_service field from the DETE dataset, we observe that 42% of the employees worked at most 5 years. # In[28]: dete_resignations['institute_service'].isnull().sum() # In[29]: tafe_resignations['institute_service'].value_counts(dropna = False) # As we can see from the values above, employee with relatively less experience have resigned in higher proportion. # - Excluding null values, 47% of the employees resignation from TAFE had worked for less than 2 years. # Also, when we combine the datasets, institute_service in TAFE should have the same format as in DETE. We will deal with this later in tis notebook. # After a bit of cleaning and preparing our data for analysis, now we will have to classify employees as dissatsfied and then filter them according to their experience to answer our question. # ## Identification of Dissatisfied Employees # # Also, we will work with dete_resignations and tafe_resignations as the dataset has data corresponding to resigned employees only. # In dete_resignations, I think the following columns contribute to the employee's decision: # - 13 Job dissatisfaction # - 14 Dissatisfaction with the department # - 15 Physical work environment # - 16 Lack of recognition # - 17 Lack of job security # - 18 Work location # - 19 Employment conditions # - 20 Maternity/family # - 21 Relocation # - 22 Study/Travel # - 23 Ill Health # - 24 Traumatic incident # - 25 Work life balance # - 26 Workload # In[30]: dete_resignations.loc[:,'job_dissatisfaction':'workload'] # # # So since these columns have boolean input values, even if one of the column to a corresponding row states false, then that employee will be classified as 'dissatisfied'. # Before proceeding, I think we should not include columns such from maternity/family to traumatic_incident in columns such as dissatisfied as they are not related to the institute. # In[31]: # we will create a column 'dissatisfied' which will be of Boolean type # the role of 'any' is to return whether any element is True def update_vals(x): if x == '-': return False elif pd.isnull(x): return np.nan else: return True dete_resignations['dissatisfied'] = dete_resignations[['job_dissatisfaction', 'dissatisfaction_with_the_department', 'physical_work_environment', 'lack_of_recognition', 'lack_of_job_security', 'work_location', 'employment_conditions', 'work_life_balance', 'workload']].any(1,skipna=False) dete_resignations_up = dete_resignations.copy() dete_resignations_up['dissatisfied'].value_counts(dropna = False) # In[32]: dete_resignations_up['dissatisfied'].head() # In[33]: dete_resignations_up['dissatisfied'].value_counts() # Similarly for tefe_resignations we will include two columns: # - Contributing Factors. Dissatisfaction # - Contributing Factors. Job Dissatisfaction # In[34]: def update_vals(x): if x == '-': return False elif pd.isnull(x): return np.nan else: return True tafe_resignations['dissatisfied'] = tafe_resignations[['factors_diss','factors_job_diss']].applymap(update_vals).any(1,skipna=False) tafe_resignations['dissatisfied'].head() tafe_resignations_up = tafe_resignations.copy() tafe_resignations_up['dissatisfied'].value_counts(dropna = False) # ## Combining DataFrames # # We have performed various actions to clean and filter our data. Now we will be reeady to merge iot. Also, as we practiced in some lessons before, while merging it is better that each dataset has it as own identity. We have given an identity to each dataste by dedicating a column with their title. # In[35]: dete_resignations_up['institute'] = 'DETE' tafe_resignations_up['institute'] = 'TAFE' # In[36]: # combining combined = pd.concat([dete_resignations_up, tafe_resignations_up], ignore_index=True) combined.shape # In[37]: combined_null = combined.isnull().sum() combined_null # In[38]: # we will drop the following columns columns_drop = combined_null[combined_null >= 400] # In[39]: combined = combined.dropna(thresh = 200, axis = 1) combined # In[40]: combined['institute_service'] # In[41]: # the column is pretty inconsistent with its value(due to tafe_survey) type(combined['institute_service'][2]) # In[42]: # column consists of diffeent dtypes, so direct string operation is impossible type(combined['institute_service'][640]) # In[43]: combined.head() # ## Classifying Employees by Length of Service # # We would rather classify the employees according to a range of experience as years as it is effective to from classes. Our classification is as follows: # - New: Less than 3 years at a company # - Experienced: 3-6 years at a company # - Established: 7-10 years at a company # - Veteran: 11 or more years at a company # In[44]: def service_category(val): if pd.isna(val): return np.nan elif val < 3: return 'New' elif val < 7: return 'Experienced' elif val < 11: return 'Established' else: return 'Veteran' # In[45]: # combined['institute_service'].astype('str') combined['institute_service'].value_counts(dropna = False) # In[46]: combined['institute_service'] = combined['institute_service'].astype('str') type(combined['institute_service'][0]) # In[47]: # cleaning columns entries/ transforming them into the same format def clean(val): if pd.isna(val): return np.nan if '-' in val: return float(val[-1]) else: res = [i for i in val.split()] return res[0] combined['institute_service'] = combined['institute_service'].apply(clean) combined['institute_service'] # In[48]: combined['institute_service'][combined['institute_service'] == 'Less'] = 1.0 combined['institute_service'][combined['institute_service'] == 'More'] = 20.0 combined['institute_service'] = combined['institute_service'].astype('float') combined['institute_service'].value_counts(dropna = False) # In[49]: combined['institute_service'] = combined['institute_service'].astype('float') combined['institute_service'].value_counts(dropna = False) # In[50]: combined['service_cat'] = combined['institute_service'].apply(service_category) combined['service_cat'].value_counts(dropna=False) # In[51]: combined['age'] # In[52]: def convert_range(val): if pd.notna(val): if ' ' in val: numbers = [int(i) for i in val.split() if i.isdigit()] else: numbers = [int(i) for i in val.split('-') if i.isdigit()] return int(sum(numbers) / len(numbers)) return np.nan combined['age'] = combined['age'].apply(convert_range) # In[53]: combined['age'].value_counts(dropna = False) # In[54]: def age_category(val): if pd.isna(val): return np.nan elif val < 31: return 'Junior' elif val < 46: return 'Middle' else: return 'Senior' combined['age'] = combined['age'].astype('float') combined['age_cat'] = combined['age'].apply(age_category) combined['age_cat'].value_counts(dropna=False) # ## Removing Irrelevant columns / Keeping Relevant Columns # # Since we have already used some columns to decide dissatisfaction, we no longer need those columns anymore. Hence, we will identify and keep only the columns we need for our analysis. # In[55]: combined.info() # In[56]: # other columns are needed for our analysis. We our choosing the columns according to the question we are trying to # answer. We even dropped the bolean columns as we had used their information to create a new column which turns # out to be more precise combined = combined[['id', 'cease_date', 'position', 'employment_status', 'gender', 'age', 'institute_service', 'dissatisfied','institute','service_cat','age_cat']] combined.sample(5) # ## Handling missing values in combined # # In[57]: combined.isnull().sum() # In[58]: index = combined['dissatisfied'].isna() index # In[59]: combined.loc[index,'dissatisfied'] # About 1% values in dissatisfied is null so we can fill it with the majority value that is 'False'. # In[60]: combined.loc[index,'dissatisfied'] = False # In[61]: #combined['dissatisfied'] = combined['dissatisfied'].fillna(False) combined['dissatisfied'].value_counts(dropna = False) # In[62]: combined.isnull().sum() # In the columns age with its corresponding age_cat and institute_service with its corresponding service_cat, we observe a significant amount of missing values. # In[63]: # checking where the columns intersect age_service_null_bool = (combined['service_cat'].isnull()) & (combined['age_cat'].isnull()) age_service_null = combined[age_service_null_bool] age_service_null # These are all null correspondingly so even if we want to fill in the missing values, we cant as the emloyee's data is missing in all columns. # In[64]: combined.drop(combined[age_service_null_bool].index, inplace=True) combined.reset_index(drop=True, inplace=True) # In[65]: combined # We have now eliminated correspponding null values of age_cat and service_cat. Since our question is dependent on these columns, we will try to fill the missing values. Our next step would be to fill both those columns using information from each other. # In[66]: service_null = combined[combined['service_cat'].isnull()] service_null # To fill in values in service_cat, we will fill in the most frequent groups that the corresponding age group represents. # Our boolean indexing will also state as institute to be 'DETE' because even thought the data is combined, corresponding values for age and service category depend on the separate institution. # In[67]: age_cat_list = ['Junior','Middle','Senior'] for i in age_cat_list: print(i,":") print(combined[(combined['institute'] == 'DETE')&(combined['age_cat'] == i)]['service_cat'].value_counts()) # In[68]: service_null['institute'].unique() # Since only DETE is the institute for which service_cat, we dont have to search for corresponding age groups for TAFE. # # From above, we can say that: # - We will fill in the values in the service_cat field, where the value of the age_cat field is Junior, with the 'Experienced' # - We will fill in the values in the service_cat field, where the value of the age_cat field is Middle, with the 'Veteran'. # - We will fill in the values in the service_cat field, where the value of the age_cat field is Senior, with the 'Veteran'. # In[69]: max_values = ['Experienced', 'Established', 'Veteran'] # In[70]: for i,j in zip(age_cat_list,max_values): sort_bool = (combined['institute'] == 'DETE') & (combined['age_cat'] == i) & (combined['institute_service'].isnull()) combined.loc[sort_bool,'service_cat'] = j # In[71]: combined['service_cat'].isnull().sum() # In[72]: combined[combined['age_cat'].isnull()] # We can continue similarly by finding maximum frequency for the corresponding service category. Again we will only check in DETE. # In[73]: service_cat_list = ['Veteran', 'New'] for i in service_cat_list: print(i,":") print(combined[(combined['institute'] == 'DETE')&(combined['service_cat'] == i)]['age_cat'].value_counts(dropna = False)) # In[74]: max_values = ['Senior','Middle'] for i,j in zip(service_cat_list,max_values): sort_bool = (combined['institute'] == 'DETE') & (combined['service_cat'] == i) & (combined['age'].isnull()) combined.loc[sort_bool,'age_cat'] = j # In[75]: combined[combined['age_cat'].isnull()] # In[76]: combined.info() # In[77]: combined.isnull().sum() # There are some missing values in columns such as cease_date,position,employment_status,gender,age and institute_service. But these values are not as important as dissatisfied,service_cat and age_cat, and since they have 0 null values, we can move ahead with our analysis. # # We may debate that we may need to handle the unll values of institute_service. we really do not need to as we have used it to classify other categories. # ## Data Analysis # # Now we'll look at the number of dissatisfied employees who left the institute at different stages of their career. We'll sort the values in descending order and then visualize the results. # In[78]: combined['dissatisfied'].value_counts() # In[79]: diss_sc = combined.pivot_table(index='service_cat', values='dissatisfied', aggfunc=np.sum)['dissatisfied'].sort_values(ascending=False).to_frame() DETE_sc = combined[combined['institute'] == 'DETE'].reset_index(drop=True) diss_sc['DETE'] = DETE_sc.pivot_table(index='service_cat', values='dissatisfied', aggfunc=np.sum)['dissatisfied'].sort_values(ascending=False) TAFE_sc = combined[combined['institute'] == 'TAFE'].reset_index(drop=True) diss_sc['TAFE'] = TAFE_sc.pivot_table(index='service_cat', values='dissatisfied', aggfunc=np.sum)['dissatisfied'].sort_values(ascending=False) diss_sc # In[80]: diss_sc = diss_sc.rename(columns = {'dissatisfied':'TOTAL'}) # In[81]: diss_sc.index # In[82]: # can change the name attribute diss_sc.index.name = 'Service Category' # In[83]: diss_sc # In[84]: diss_sc = diss_sc.fillna(0) diss_sc['TAFE'] = diss_sc['TAFE'].astype('int') diss_sc # In[85]: diss_sc.plot(kind='bar', rot = 0) plt.title('Dissatisfied by Service Category', fontsize=20) plt.xlabel('Service category', fontsize=12) plt.ylabel('Dissatisfied Employees Number', fontsize=12) plt.legend() # use sns.despine() and frameon = False to remove any borders # - Analysing the total dissatisfied columns, we can see that the Veteran,New and Experienced contribute in almost same proportions, at about 30%. # # - In DETE Institute, the maximum percentage of dissatisfied employees are Veteran at 42.28 % while in TAFE Institute, the maximum percentage of dissatisfied employees are New at 55%. # # - In every service category, DETE dominates as the institute with the higher number of employees leaving due to dissatisfaction, except for the New category. # # - Overall, 73% of the employees left the DETE institute because of dissatisfaction issues, while only 27% of the employees from the TAFE institute left because of dissatsifaction. # In[86]: combined[combined['institute']== 'TAFE'] # In[87]: fig = plt.figure(figsize=(18,10), dpi=1600) ax1 = plt.subplot2grid((2,4),(0,0)) plt.pie(diss_sc.iloc[:,1], labels = diss_sc.index, colors = ('b','g','y','r')) plt.title('DETE') ax1 = plt.subplot2grid((2, 4), (0, 1)) plt.pie(diss_sc.iloc[:,2], labels = diss_sc.index, colors = ('b','g','y','r')) plt.title('TAFE') # ## # In[88]: diss_ac = combined.pivot_table(index='age_cat', values='dissatisfied', aggfunc=np.sum)['dissatisfied'].sort_values(ascending=False).to_frame() DETE_sc = combined[combined['institute'] == 'DETE'].reset_index(drop=True) diss_ac['DETE'] = DETE_sc.pivot_table(index='age_cat', values='dissatisfied', aggfunc=np.sum)['dissatisfied'].sort_values(ascending=False) TAFE_sc = combined[combined['institute'] == 'TAFE'].reset_index(drop=True) diss_ac['TAFE'] = TAFE_sc.pivot_table(index='age_cat', values='dissatisfied', aggfunc=np.sum)['dissatisfied'].sort_values(ascending=False) diss_ac # In[89]: diss_ac.index.name = 'Service Category' diss_ac # In[90]: diss_ac.plot(kind='bar', rot = 0) plt.title('Dissatisfied by Age Category', fontsize=20) plt.xlabel('Service category', fontsize=12) plt.ylabel('Dissatisfied Employees Number', fontsize=12) plt.legend() # use sns.despine() and frameon = False to remove any borders # - Analysing the total dissatisfied columns, we can see that the Senior and Middle age groups are the most dissatisfied with their jobs. Almost 80% of the employees who left due to dissatisfaction are from the above age groups. # # - In DETE Institute, the maximum percentage of dissatisfied employees are Senior at 43.6 % while in TAFE Institute, the maximum percentage of dissatisfied employees are Senior and Middle at about 37% each. # # - In every service category, DETE dominates as the institute with the higher number of employees leaving due to dissatisfaction, as seen in the pie charts below. # # In[91]: fig = plt.figure(figsize=(18,10), dpi=1600) ax1 = plt.subplot2grid((2,4),(0,0)) plt.pie(diss_ac.iloc[0,1:3], labels = diss_ac.columns[1:3], colors = ('orange','g')) plt.title('SENIOR') ax1 = plt.subplot2grid((2, 4), (0, 1)) plt.pie(diss_ac.iloc[1,1:3], labels = diss_ac.columns[1:3], colors = ('orange','g')) plt.title('MIDDLE') ax1 = plt.subplot2grid((2, 4), (0, 2)) plt.pie(diss_ac.iloc[2,1:3], labels = diss_ac.columns[1:3], colors = ('orange','g')) plt.title('JUNIOR') # ## Conclusion # # - Generally, the people who have served for more number of years or the ones who are older inn general are the ones to leave due to dissatisfaction in majority. # # # - Overall, 73% of the employees left the DETE institute because of dissatisfaction issues, while only 27% of the employees from the TAFE institute left because of dissatsifaction.