#!/usr/bin/env python
# coding: utf-8

# # Clean and Analyze Employee Exit Surveys
# 
# In this project, we'll work with exit surveys from employees of the [Department of Education, Training and Employment](https://en.wikipedia.org/wiki/Department_of_Education_and_Training_(Queensland)) (DETE) and the Technical and Further Education (TAFE) institute in Queensland, Australia. The TAFE exit survey can be finded [here](https://data.gov.au/dataset/ds-qld-89970a3b-182b-41ea-aea2-6f9f17b5907e/details?q=exit%20survey) and the survey for the DETE [here](https://data.gov.au/dataset/ds-qld-fe96ff30-d157-4a81-851d-215f2a0fe26d/details?q=exit%20survey).
# 
# The aim of this project is to explore, transform and clean these datasets to answer the following questiongs to stakeholders:
# - Are employees who only worked for the institutes for a short period of time resigning due to some kind of dissatisfaction? What about employees who have been there longer?
# - Are younger employees resigning due to some kind of dissatisfaction? What about older employees?
# 
# For this project, we'll use our general knowledge to define the columns.
# 
# Below is a preview of a couple columns we'll work with from the dete_survey.csv:
# 
# - ID: An id used to identify the participant of the survey
# - SeparationType: The reason why the person's employment ended
# - Cease Date: The year or month the person's employment ended
# - DETE Start Date: The year the person began employment with the DETE
# 
# Below is a preview of a couple columns we'll work with from the tafe_survey.csv:
# 
# - Record ID: An id used to identify the participant of the survey
# - Reason for ceasing employment: The reason why the person's employment ended
# - LengthofServiceOverall. Overall Length of Service at Institute (in years): The length of the person's employment (in years)
# 

# In[1]:


import pandas as pd
import numpy as np

dete_survey = pd.read_csv('dete_survey.csv')
tafe_survey = pd.read_csv('tafe_survey.csv')


# ## Datasets Description
# In this section, we will briefly present the description of both datasets.
# 
# **Dete Survey Description**

# In[2]:


dete_survey.info()


# In[3]:


dete_survey.head()


# In[4]:


dete_survey.isnull().sum()


# This information shows how the dataset contains 56 columns, most of which are string. A deeper analysis need to be done to select only the columns needed to achive our aim. 
# 
# Six columns (* Business Unit, Aboriginal, Torres Strait, South Sea, Disability, NESB*) have more than ~80% null values. Notice that, the dataset represent the missing values as *Not Stated*. 
# 
# Then, we will replace this value to *NaN* in order to the pandas library recognize it as null value.

# In[5]:


dete_survey = pd.read_csv('dete_survey.csv', na_values='Not Stated')


# In[6]:


dete_survey.isnull().sum()


# We can see how the number of missing values increase in some columns as *DETE Start Date, Role Start Date* and *Region*
# 
# **Tafe Survey Description**

# In[7]:


tafe_survey.info()


# In[8]:


tafe_survey.head()


# In[9]:


tafe_survey.isnull().sum()


# This information shows how the dataset contains 72 columns, most of which are string. We can see how the dataset contain many columns that we don't need to complete our analysis and many of them represent the same information that some columns from dete_survey dataset, but with different names.
# 
# There are multiple columns/answers that indicate an employee resigned because they were dissatisfied. 
# 
# Most of the columns have null values, but only a few of them have more than ~30% null values.
# 
# ## Cleaning the *dete_survey* and *tafe_survey*  datasets
# 
# **Identify Missing Values and Drop Unnecessary Columns**
# 
# Let's drop some columns from each dataframe that we won't use in our analysis to make the dataframes easier to work with. 

# In[10]:


dete_survey_updated = dete_survey.drop(dete_survey.columns[28:49], axis=1)
tafe_survey_updated = tafe_survey.drop(tafe_survey.columns[17:66], axis=1)


# **Clean Column Names**
# 
# Let's turn the attention to the column names. Each dataframe contains many of the same columns, but the column names are different.
# 
# First, we will rename columns in the *dete_survey_updated* dataframe using the following criteria to update the column names:
# - Make all the capitalization lowercase.
# - Remove any trailing whitespace from the end of the strings.
# - Replace spaces with underscores ('_').

# In[11]:


dete_survey_updated.columns = dete_survey_updated.columns.str.replace(' ', '_').str.strip().str.lower()
dete_survey_updated.columns


# Second, we will rename columns in the tafe_survey_updated dataframe as follows:
# - 'Record ID': 'id'
# - 'CESSATION YEAR': 'cease_date'
# - 'Reason for ceasing employment': 'separationtype'
# - 'Gender. What is your Gender?': 'gender'
# - 'CurrentAge. Current Age': 'age'
# - 'Employment Type. Employment Type': 'employment_status'
# - 'Classification. Classification': 'position'
# - 'LengthofServiceOverall. Overall Length of Service at -Institute (in years)': 'institute_service'
# - 'LengthofServiceCurrent. Length of Service at current workplace (in years)': 'role_service'
# 
# The columns renamed are the one that we will use in our analysis.

# In[12]:


tafe_survey_updated.rename(columns = { 'Record ID': 'id', 'CESSATION YEAR': 'cease_date', 'Reason for ceasing employment':'separationtype', 'Gender. What is your Gender?':'gender', 'CurrentAge. Current Age': 'age', 'Employment Type. Employment Type': 'employment_status', 'Classification. Classification': 'position', 'LengthofServiceOverall. Overall Length of Service at Institute (in years)': 'institute_service', 'LengthofServiceCurrent. Length of Service at current workplace (in years)': 'role_service'}, inplace=True)
tafe_survey_updated.columns


# **Filter the Data**
# 
# Let's remove more of the data we don't need. Since our goal is to analyse the employees who resigned, we will only consider the respondents  that contains the string  *'Resignation'* in the *'separationtype*' column.

# In[13]:


dete_survey_updated['separationtype'].value_counts()


# In[14]:


pattern_resignation = r"Resignation"
dete_resignations = dete_survey_updated.loc[dete_survey_updated['separationtype'].str.contains(pattern_resignation)].copy()
print(dete_resignations.shape)
dete_resignations['separationtype'].value_counts()


# In[15]:


tafe_survey_updated['separationtype'].value_counts()


# In[16]:


tafe_resignations = tafe_survey_updated.loc[tafe_survey_updated['separationtype'] == 'Resignation'].copy()
print(tafe_resignations.shape)
tafe_resignations['separationtype'].value_counts()


# In the last steps, we have saved in the *dete_resignations* and *tafe_resignations* dataframes the rows who represent the resignations from dete_survey_updated and tafe_survey_updated datasets, respectively.
# 
# **Verify the Data**
# 
# We'll focus on verifying that the years in the *cease_date* and *dete_start_date* columns make sense. We will check the following:
# - Since the cease_date is the last year of the person's employment and the dete_start_date is the person's first year of employment, it wouldn't make sense to have years after the current date.
# - Given that most people in this field start working in their 20s, it's also unlikely that the dete_start_date was before the year 1940.
# 

# In[17]:


dete_resignations['cease_date'].value_counts()


# As can be seen, we need to clean the *cease_date* column in the dete_resignations dataset.

# In[18]:


pattern_year = r"(?P<Years>[1-2][0-9]{3})"
dete_resignations['cease_date'] = dete_resignations['cease_date'].str.extract(pattern_year).astype(float)
dete_resignations['cease_date'].value_counts()


# In[19]:


dete_resignations['dete_start_date'].value_counts().sort_index()


# In[20]:


tafe_resignations['cease_date'].value_counts().sort_index()


# We will plot the values of these columns with a boxplot to identify any values that look wrong.

# In[57]:


import matplotlib.pyplot as plt
get_ipython().run_line_magic('matplotlib', 'inline')
import seaborn as sns

df_columns_plot = dete_resignations[['dete_start_date','cease_date']]
fig = plt.figure(figsize=(15, 3))
ax = fig.add_subplot(1,2,1)
df_columns_plot.boxplot(ax=ax)
ax.spines["right"].set_visible(False)    
ax.spines["left"].set_visible(False)
ax.spines["top"].set_visible(False)    
ax.spines["bottom"].set_visible(False)
ax.set_ylim(1960,2016)
ax.set_yticks([1960,2016])
ax.grid(False)
ax.set_title('Distribution of dete_start_date and cease_date in the dete_resignations dataset')

ax = fig.add_subplot(1,2,2)
tafe_resignations['cease_date'].plot.box(ax=ax)
ax.spines["right"].set_visible(False)    
ax.spines["left"].set_visible(False)
ax.spines["top"].set_visible(False)    
ax.spines["bottom"].set_visible(False)
ax.set_ylim(1960,2016)
ax.set_yticks([1960,2016])
ax.set_title('Distribution of cease_date in the tafe_resignations dataset')


# We can see how the years's values are logical. However, there are some outliers values from the *dete_start_date* column.
# 
# ## Transforming the datasets
# 
# **Create column to represent years of service**
# 
# To get all the information needed to achive our goals, we need to create a new column that represents the length of time an employee spent in a workplace, called years of service in the Human Resources field. 
# 
# Notice that the *tafe_resignations* dataframe already contains a "service" column, which we renamed to *institute_service*. In order to analyze both surveys together, we'll have to create a corresponding institute_service column in dete_resignations. To do so, we will create an institute_service column in dete_resignations, substracting the *dete_start_date* from *the cease_date*. 
# 

# In[22]:


dete_resignations['institute_service'] = dete_resignations['cease_date'] - dete_resignations['dete_start_date']
print(dete_resignations[['dete_start_date', 'cease_date', 'institute_service']].head())
dete_resignations['institute_service'].value_counts(dropna=False)
tafe_resignations.info()


# **Identify Dissatisfied Employees**
# 
# We'll identify any employees who resigned because they were dissatisfied.
# 
# Below are the columns we'll use to categorize employees as "dissatisfied" from each dataframe. 
# 
# *tafe_survey_updated*:
# - Contributing Factors. Dissatisfaction
# - Contributing Factors. Job Dissatisfaction
# 
# *dete_survey_updated*:
# - job_dissatisfaction
# - dissatisfaction_with_the_department
# - physical_work_environment
# - lack_of_recognition
# - lack_of_job_security
# - work_location
# - employment_conditions
# - work_life_balance
# - workload
# 
# If the employee indicated any of the factors above caused them to resign, we'll mark them as dissatisfied in a new column.
# 
# To create the new column, we'll do the following:
# 
# 1. Convert the values in the 'Contributing Factors. Dissatisfaction' and 'Contributing Factors. Job Dissatisfaction' columns in the tafe_resignations dataframe to True, False, or NaN values.
# 2. If any of the columns listed above contain a True value, we'll add a True value to a new column named dissatisfied.

# In[23]:


tafe_resignations['Contributing Factors. Dissatisfaction'].value_counts(dropna=False)


# In[24]:


tafe_resignations['Contributing Factors. Job Dissatisfaction'].value_counts(dropna=False)


# In[25]:


def update_vals(element):
    if pd.isnull(element):
        return np.nan
    if element == '-':
        return False
    else:
        return True
factor_dissatisfaction_tafe = tafe_resignations[['Contributing Factors. Dissatisfaction', 'Contributing Factors. Job Dissatisfaction']].applymap(update_vals)
tafe_resignations['dissatisfied'] = factor_dissatisfaction_tafe.any(axis=1, skipna=False)
tafe_resignations['dissatisfied'].value_counts(dropna=False)


# In[26]:


factor_dissatisfaction_dete = dete_resignations[['job_dissatisfaction', 'dissatisfaction_with_the_department', 'physical_work_environment','lack_of_recognition','lack_of_job_security','work_location','employment_conditions','work_life_balance','workload']]
dete_resignations['dissatisfied'] = factor_dissatisfaction_dete.any(axis=1, skipna=False)
dete_resignations['dissatisfied'].value_counts(dropna=False)


# In[27]:


dete_resignations_up = dete_resignations.copy()
tafe_resignations_up = tafe_resignations.copy()


# ## Combine the datasets
# 
# In the last steps, we have transformed the dataframes to be ready to combine them. 
# 
# First, let's add a column to each dataframe that will allow us to easily distinguish between the two.

# In[28]:


dete_resignations_up['institute'] = 'DETE'
tafe_resignations_up['institute'] = 'TAFE'
combined = pd.concat([dete_resignations_up, tafe_resignations_up], ignore_index = True, sort=True)
combined.shape


# We still have some columns left in the dataframe that we don't need to complete our analysis. For this reason, we will drop any columns with less than 500 non null values.

# In[29]:


combined_updated = combined.dropna(thresh=500, axis=1).copy()
combined_updated.shape


# We can see how the number of columns is reduced to 10.

# In[30]:


combined_updated.head()


# **Clean the *institute_service* column**
# 
# We'll have to clean up the institute_service column, since it currently contains values in a couple different forms

# In[31]:


combined_updated['institute_service'].value_counts(dropna=False)


# To analyze the data, we'll convert these numbers into categories. We'll base our anlaysis on this [article](https://www.businesswire.com/news/home/20171108006002/en/Age-Number-Engage-Employees-Career-Stage). 
# 
# We'll use the slightly modified definitions below:
# 
# - New: Less than 3 years at a company
# - Experienced: 3-6 years at a company
# - Established: 7-10 years at a company
# - Veteran: 11 or more years at a company
# 
# Let's categorize the values in the *institute_service* column using the definitions above.
# 
# First, we'll extract the years of service from each value in the institute_service column.
# 

# In[32]:


combined_updated['institute_service_up'] = combined_updated.loc[:,'institute_service'].astype('str').str.extract(r'(\d{1,2})').astype('float')


# Next, we'll map each value to one of the career stage definitions above store them at a new column *service_cat*

# In[33]:


def category(val):
    if pd.isnull(val):
        return np.nan
    elif val < 3:
        return 'New'
    elif 3 <= val <= 6:
        return 'Experienced'
    elif  7 <= val <= 10:
        return 'Established'
    elif  val >= 11:
        return 'Veteran'

combined_updated['service_cat'] = combined_updated['institute_service_up'].apply(category)    
combined_updated['service_cat'].value_counts(dropna = False)


# **Clean the *age* column**
# 
# We'll have to clean up the *age* column, since it currently contains values in a couple different forms

# In[34]:


combined_updated['age'].value_counts(dropna=False)


# To analyze the data, we'll convert these ages into categories.
# 
# We'll use the definitions below:
# 
# - Young: Less than 35 years old
# - Middle Age: 35-45 years old
# - Old: 46 or more 
# 
# Let's categorize the values in the *age* column using the definitions above.
# 
# First, we'll extract the years of each value in the *age* column.
# 

# In[44]:


combined_updated['age_up'] = combined_updated.loc[:,'age'].astype('str').str.extract(r'(\d{2})').astype('float')
combined_updated['age_up'].value_counts(dropna=False)


# Next, we'll map each value to one of the age categories defined above and store them at a new column *age_category*

# In[41]:


def age_category(val):
    if pd.isnull(val):
        return np.nan
    elif val < 35:
        return 'Young'
    elif 35 <= val <= 45:
        return 'Middle Age'
    elif  val >= 46:
        return 'Old'

combined_updated['age_category'] = combined_updated['age_up'].apply(age_category)    
combined_updated['age_category'].value_counts(dropna = False)


# **Fill missing values in  dissatisfied column**
# 
# Now, we will fill missing values in the dissatisfied column with the value that occurs most frequently in this column.

# In[37]:


combined_updated['dissatisfied'].value_counts(dropna = False)


# In[38]:


combined_updated['dissatisfied'] = combined_updated['dissatisfied'].fillna(False)
combined_updated['dissatisfied'].value_counts(dropna = False)


# ## Initial Analysis
# 
# **Percentage of dissatisfied employees in each *service_cat* group**
# 
# We will aggregate the dissatisfied column to calculate the percentage of dissatisfied employees in each *service_cat* group. Since a *True* value is considered to be 1, calculating the mean will also calculate the percentage of dissatisfied employees. 

# In[39]:


percentage_service_dissatisfied = combined_updated.pivot_table('dissatisfied','service_cat')
percentage_service_dissatisfied


# In[52]:


ax = percentage_service_dissatisfied.plot(kind='bar')
ax.set_title('Percentage of dissatisfied employees for each service_cat category')
ax.spines["right"].set_visible(False)    
ax.spines["left"].set_visible(False)
ax.spines["top"].set_visible(False)    
ax.spines["bottom"].set_visible(False)
ax.tick_params(bottom=False, top=False, left=False, right=False, labelbottom=False)


# The figure shows how the higuer percentage of dissatisfied employees belongs to the *Established* and *Veteran* groups with values close to 50% of the employees. Then, the employees who have been longer period of time in the institues resigning due to some kind of dissatisfaction more than the one who have been shorter time.
# 
# **Percentage of dissatisfied employees in each age category group**
# 
# Let's agregate the dissatisfied column to calculate the percentage of dissatisfied employees in each *age* group.

# In[45]:


percentage_age_dissatisfied = combined_updated.pivot_table('dissatisfied','age_category')
percentage_age_dissatisfied


# In[53]:


ax = percentage_age_dissatisfied.plot(kind='bar')
ax.set_title('Percentage of dissatisfied employees for each age category')
ax.spines["right"].set_visible(False)    
ax.spines["left"].set_visible(False)
ax.spines["top"].set_visible(False)    
ax.spines["bottom"].set_visible(False)
ax.tick_params(bottom=False, top=False, left=False, right=False, labelbottom=False)


# The figure shows how the higuer percentage of dissatisfied employees belongs to the older people. However, almost the 36% of the younger people resigning due to some kind of dissatisfaction.

# # Conclusion
# 
# In this project, we analyzed survey data from employees of the Department of Education, Training and Employment (DETE) and the Technical and Further Education (TAFE) institute in Australia to analyse the characteristics of the employees who resigning due to some kind of dissatisfaction. We reached that older employees who spend more time in the institues present higher probabilities to resigning because dissatisfaction than younger people who work less time in these institues.
#