By Combining Multiple Datasets
Author - Raghav_A
For this project, I was approached by 2 (imaginary) institutes - Department of Education, Training and Employment
(DETE) and the Technical and Further Education
(TAFE) institute in Queensland, Australia. Both the stakeholders were eager to know whether -
Both the institutes made their own exit-surve data available to me. Also, they want the datasets to be combined before I answer these questions.
The datasets are available in public domain also, I have provided the link below-
TAFE
- https://data.gov.au/dataset/ds-qld-89970a3b-182b-41ea-aea2-6f9f17b5907e/details?q=exit%20survey
DETE
- https://data.gov.au/dataset/ds-qld-fe96ff30-d157-4a81-851d-215f2a0fe26d/details?q=exit%20survey
Before familiarising with the datasets, I will import the relevant python libraries.
Since I intend to work with dataframes and arrays specifically (since the data is tabular and is stored in .CSV
files), I shall import the Numpy
and Pandas
libraries for now.
Having read my files into DataFrame
objects, I shall use the .head()
, .info()
methods to look inside the datasets, to familiarise with it.
Here we go -
import pandas as pd
import numpy as np
dete_survey = pd.read_csv('dete_survey.csv')
tafe_survey = pd.read_csv('tafe_survey.csv')
dete_survey.head()
tafe_survey.head()
index = 0
for colname in dete_survey.columns:
print(index, colname)
index +=1
index = 0
for colname in tafe_survey.columns:
print(index, colname)
index +=1
Some things in life just aren't perfect.
Bad Data almost always creeps inside datasets, and the bigger the dataset, the higher the volume of bad data. Cleaning and reshaping the data, thus, becomes a crucial aspect of data analysis.
The following observations can be pointed out immediately after glancing at the datasets above -
dete_survey
dataset contains 'Not Stated
' values that indicate values are missing, but they aren't represented as NaN.dete_survey
and tafe_survey
dataframes contain many columns that we don't need to complete our analysis.Not Stated
" values in datasets as NaN
values¶From 1.1
, a couple of Not Stated
values can be observed. These values are of the str
type, and it might be better if these values could be re-read as NaN
or NULL
type.
dete_survey = pd.read_csv('dete_survey.csv',na_values = 'Not Stated')
dete_survey[:3]
From 1.4, it can be obsered that tafe_survey
has too many columns, most of them having pretty long names. It would be better to replace the column names in tafe_survey
dataset with some easy-to-write column names (for only the relevant columns) -
replace_dict = {'Record ID':'id',
'CESSATION YEAR':'cease_date',
'Reason for ceasing employment': 'separationtype',
'Gender. What is your Gender?': 'gender',
'CurrentAge. Current Age': 'age',
'Employment Type. Employment Type': 'employment_status',
'Classification. Classification': 'position',
'LengthofServiceOverall. Overall Length of Service at Institute (in years)': 'institute_service',
'LengthofServiceCurrent. Length of Service at current workplace (in years)': 'role_service'
}
tafe_survey.rename(columns = replace_dict, inplace = True)
tafe_survey[:3]
_
").
character from column namesdete_survey.columns = dete_survey.columns.str.lower().str.strip().str.replace(' ','_').str.replace('.','')
dete_survey[:2]
tafe_survey.columns = tafe_survey.columns.str.lower().str.strip().str.replace(' ','_').str.replace('.','')
tafe_survey[:2]
From 1.3
and 1.4
, it can be observed that both the datasets have a lot of columns that don't seem to be important from the point of view of our objective. It is best to declutter the datasets and remove those unnecessary columns -
# Drop Junk Columns in DETE Dataset
dete_survey_updated = dete_survey.drop(dete_survey.columns[28:49], axis = 1)
# Display Columns in DETE Dataset
dete_survey_updated.columns
# Drop Junk Columns in TAFE Dataset
tafe_survey_updated = tafe_survey.drop(tafe_survey.columns[17:66], axis = 1)
# Display Columns in TAFE Dataset
tafe_survey_updated.columns
resignation
¶Since our objective is to carry out analysis on only the subset of empployees who resigned
, we will single out only those rows where value in the separatioontype
columns contains the Resignation
keyword. As a refresher, find below the screenshot of the objective below -
6.1 TAFE dataset
6.1.1 Analysing the distinct count of separationtype
of TAFE employees
Out of 702 employees who separated from TAFE, almost 340 resigned.
tafe_survey['separationtype'].value_counts(dropna = False)
6.1.2 Filtering rows having Resignation
value in separationtype
column
tafe_resignations = tafe_survey_updated[tafe_survey_updated['separationtype']=='Resignation'].copy()
6.2 DETE Dataset
6.2.1 Analysing the distinct count of separationtype
of DETE employees
Of the 822 employees in DETE, 311 resigned.
dete_survey['separationtype'].value_counts(dropna = False)
6.2.2 Filtering rows having Resignation
value in separationtype
column
dete_resignations = dete_survey_updated[dete_survey_updated['separationtype'].str.contains(r'Resignation')].copy()
date
columns¶cease_date
column from DETE dataset¶It appears that some values of cease_date
column in dete_resignations
are formatted as mm/yyyy
, while some are formatted as yyyy
. In such a scenario, it is better to remove all the mm/
sequence of characters from cease_date
values, for ease of calculation in the analysis ahead.
dete_resignations.head()
# Filtering 'mm/' and assigning 'yyyy' to self
dete_resignations['cease_date'] = dete_resignations['cease_date'].astype(str).str[-4:].astype(float)
dete_resignations['cease_date'].value_counts(dropna = False)
institute_service
)¶# Years of Service of DETE employees (before resigning)
dete_resignations['institute_service'] = dete_resignations['cease_date'] - dete_resignations['dete_start_date']
dete_resignations['institute_service']
Now that the Resignation
type of data has been filtered, we can figure out whether the reason for Resignation
was either employee dissatisfaction
, or some other reason.
From the mentioned column names in 1.3 & 1.4, the columns that are associated with employee dissatisfaction need to be singled out first.
Once that is done, those columns will be used to create a single (Boolean) column calleddissatisfaction
in both the datasets, which shall have -
True
value for an employee whose reason for resignation was some sort of dissatisfaction, and False
value for an employee whose reason for resignation was not dissatisfaction.8.1 DETE Dataset resignation factor columns (Highlighted in Yellow)
dete_dissed_df = dete_resignations[['job_dissatisfaction','dissatisfaction_with_the_department',
'physical_work_environment','lack_of_recognition',
'lack_of_job_security','work_location','employment_conditions',
'work_life_balance','workload'
]]
dete_dissed_df
dete_dissed_s = dete_dissed_df.any(axis = 1, skipna = False)
dete_dissed_s
dete_resignations['dissatisfied'] = dete_dissed_s.copy()
8.2 TAFE Dataset
TAFE Dataset resignation factor columns (Highlighted in Yellow)
tafe_dissed_df = tafe_resignations[['contributing_factors_dissatisfaction',
'contributing_factors_job_dissatisfaction'
]]
tafe_dissed_df.head(25)
def update_vals(element):
if element == '-':
return False
elif pd.isnull(element):
return np.nan
else:
return True
temp_df = tafe_dissed_df.applymap(update_vals)
tafe_dissed_s = temp_df.any(axis=1, skipna = False)
temp_df
tafe_dissed_s.head(25)
tafe_resignations['dissatisfied'] = tafe_dissed_s.copy()
institute
¶Before combining the data, a column named institute
needs to be created in the DETE dataset, which will have 'DETE' as the value in each cell. (Note that the TAFE dataset already has the institute
column, nevertheless each cell-value of that column will have to be rewritten as - 'TAFE')
dete_resignations['institute'] = 'DETE'
tafe_resignations['institute'] = 'TAFE'
The combined
dataset has 651 rows and 52 columns -
combined = pd.concat([dete_resignations,tafe_resignations], ignore_index = True)
combined
NaN
type to -
values in combined
dataset¶It is immediately observed that many cells in the combined
dataset have '-
' value in them. Although we can read this value as NULL
, python can't, and it will return the data type of such a value as str
. Before proceeding, it is best to assign NULL
or NaN
type to any such cells in the combined
dataset.
The updated combined
dataset is named as combined_up
-
def update_nan(element):
if element == '-':
return np.nan
else:
return element
combined_up = combined.applymap(update_nan)
combined_up
combined
dataset¶On doing a NULL-Count
on the combined_up
dataset, it can be clearly observed that out of 52 columns, most of them comprise of NaN
(NULL) values. As most of them aren't required for our analysis anyways, it is best to further de-clutter our dataset, and drop any columns that have less than 500 non-null values.
By doing this, I am able to bring down the number of columns from 52 too the relevant 10, without compromising on any useful data.
combined_up.isnull().sum()
combined_updated = combined_up.dropna(thresh=500, axis =1).copy()
combined_updated
combined_updated.isnull().sum()
institute_service
(years of service) column¶Recalling the objective, it is required that we run some kind of initial analysis that depicts whether employees with low years of experience were more likely to resign due to dissatisfaction than those with higher experience.
Thus, the institute_service
column that depicts the years of service of the employees needs to be cleaned next. But there's a slight hiccup - the institute_service
column contains data in different forms -
Since we have both categorical and numerical values in our column, it is best to assign them all into different categories for our analysis further. A slighly modified definition for "years of experience" would be categorised as follows -
yrsofservice = combined_updated['institute_service'].copy()
yrsofservice = yrsofservice.astype(str)
yrsofservice.value_counts().sort_index()
temp = yrsofservice.copy()
def category_of_service(element):
if element == 'nan':
return np.nan
elif 'Less than 1 year' in element:
return 'New'
elif 'More than 20 years' in element:
return 'Veteran'
elif int(element) < 3:
return 'New'
elif (int(element) >= 3) & (int(element) <= 6):
return 'Experienced'
elif (int(element) >= 7) & (int(element) <= 10):
return 'Established'
else:
return 'Veteran'
temp = temp.str.split('.').str[0].str.split('-').str[0]
service_cat = temp.apply(category_of_service)
service_cat
service_cat.value_counts().sort_index()
service_cat
column¶On basis of the years of experience, a new column called service_cat
is created next, in the combined_updated
dataset.
After that, we have a look at how many NaN
values are in the dissatisfied
column.
Since there are only 8 such values, we assign them a Boolean - False
(since there are 403 False
values as compared to 240 True
values, and moreover, excluding or including these rows won't have a sigificant impact on the overall analysis due to their sheer low count (10))
combined_updated['service_cat'] = service_cat.copy()
combined_updated
combined_updated[(combined_updated['dissatisfied']!=False)&(combined_updated['dissatisfied']!=True)]
combined_updated['dissatisfied'] = combined_updated['dissatisfied'].fillna(False)
# To check if there aren't any NaN values left in 'dissatisfied' column
combined_updated['dissatisfied'].value_counts(dropna = False)
After all this cleaning, finally I am ready to perform my analysis, and (hopefully) I can arrive at some meaningful insights.
Using a straightforward bar-graph to compare the percentage of resignations that were due to some sort of dissatisfaction of the exiting employee, the following significant findings are observed -
New employees
(0-3 years of experience) are least likely to resign from the company due to dissatisfaction.Established
(7-10 yrs) and Veteran
(> 11 yrs) employees resign due to dissatisfaction.combined_updated.pivot_table(values='dissatisfied', index = 'service_cat')
%matplotlib inline
combined_updated.pivot_table(values='dissatisfied', index = 'service_cat').plot.bar()
Although my imaginary stakeholders told me to carry out a combiined analysis for both the institutes, it is wise to know, which institute contributes more to the dissatisfied employee resignations in our data.
Here, again, I use a straighforward bar chart to compare the % resignations due to dissatisfaction. The findings are significant -
combined_updated.pivot_table(values='dissatisfied', index = 'institute').plot.bar()
Here again, I use a straighforward bar chart to compare the % resignations due to dissatisfaction. The findings are significant -
Bottom Line - Permanent Employees are most dissatisfied with their jobs. This is something that the institutes should look into.
combined_updated['employment_status'].value_counts(dropna = False)
combined_updated.pivot_table(values='dissatisfied', index = 'employment_status').plot.bar()
Again, I use a straighforward bar chart to compare the % resignations due to dissatisfaction. The findings this time are NOT significant -
combined_updated['gender'].value_counts(dropna = False)
combined_updated.pivot_table(values='dissatisfied', index = 'gender').plot.bar()
Again, I use a straighforward bar chart to compare the % resignations due to dissatisfaction.
Here, I divide the Job Positions into 3 main categories -
Administration Position
(287 rows)Teaching Position
(156 rows)Others
(155 rows)The findings are significant -
Administration
job roles are due to dissatisfaction.Teaching
job roles are due to dissatisfaction.combined_updated['position'].value_counts(dropna = False)
def position_cat_func(element):
if pd.isnull(element):
return np.nan
elif 'Admin' in element:
return 'Administration Position'
elif 'Teach' in element:
return 'Teaching Position'
else:
return 'Others'
temp_df = combined_updated.copy()
temp_df['position_cat'] = temp_df['position'].apply(position_cat_func)
# Plotting on Bar Chart
temp_df.pivot_table(values='dissatisfied', index = 'position_cat').plot.bar()
Again, I use a straighforward bar chart to compare the % resignations due to dissatisfaction per age group.
Here, I divide the age the following categories -
The findings are significant -
def age_cat_classifier(element):
if pd.isnull(element):
return np.nan
elif '20 or younger' in element:
return '< 20'
elif ('56 or' in element) | ('56-60' in element) | ('61 or' in element):
return '> 56'
elif '5' in element:
return '51-55'
else:
return element[0] + '1 to ' + str(int(element[0])+1) + '0'
# Add a new column to temp_df to capture age-category ('age_cat')
temp_df['age_cat'] = temp_df['age'].apply(age_cat_classifier)
temp_df['age_cat'].value_counts()
temp_df.pivot_table(values='dissatisfied', index = 'age_cat').plot.bar()
Thanks!