#!/usr/bin/env python
# coding: utf-8

# # Cleaning and Analyzing Employee Exit Surveys
# 
# In the code below, we will:
# - work with exit surveys from employees of the Department of Education, Training and Employment (DETE) and the Technical and Further Education (TAFE) institute in Queensland, Australia. We want to know if employees who only worked for the institutes for a short period of time resigning due to some kind of dissatisfaction. We also want to know the same information for the employees who worked there for a long time as well. Also, we want to know if both younger and older employees are resigning due to some kind of dissatisfaction

# In[1]:


import pandas as pd # Imported the pandas library
import numpy as np # Imported the numpy library


# In[2]:


dete_survey = pd.read_csv('dete_survey.csv', index_col = 0) # dete_survery.csv CSV file read into pandas


# In[3]:


dete_survey


# In[4]:


tafe_survey = pd.read_csv('tafe_survey.csv', index_col = 0) # tafe_survey.csv CSV file read into pandas


# In[5]:


tafe_survey


# In[6]:


dete_survey.info() # Prints information for the dete_survey dataframe such as the column names, their type, if there are any null values, etc.


# In[7]:


tafe_survey.info() # Prints information for the tafe_survey dataframe " " " " " " " " " " " " " " ""


# In[8]:


dete_survey.head() # Prints the first five rows for the dete_survey dataframe


# In[9]:


tafe_survey.head() # Prints the first five rows for the tafe_survey dataframe


# In[10]:


Separation = dete_survey['SeparationType'] # Selects only the values from the SeparationType column and it is stored in the Separation variable


# In[11]:


Separation.value_counts() # Prints the number of times each value shows up for the SeparationType column


# In[12]:


Cease_date = dete_survey['Cease Date'] # Selects only the values from the Cease Date column


# In[13]:


Cease_date # Prints out the values for the cease_date column


# In[14]:


Cease_date.value_counts()# Prints the number of times each value shows up for the Cease_Date column


# In[15]:


ceasing_employment = tafe_survey['Reason for ceasing employment'] # Selects the values only from the reason for ceasing employment column  and is stored in a variable called eceasing employment


# In[16]:


ceasing_employment # Prints out the values for the ceasing_employment column


# In[17]:


ceasing_employment.value_counts() # Prints out the number of times each value shows up for the ceasing employment column


# In[18]:


tafe_survey.isnull().sum() # Checks for any null values in each column in the dataframe


# In[19]:


dete_survey.isnull().sum() # Checks for any null values in each column in the dataframe


# - We managed to import the pandas and numpy libraries. On top of that, we saw that the columns in both dataframes had a lot of missing values. We can fix that. There are a lot of columns in both dataframes. However, we want to focus on a couple of columns from each dataframe. We also managed to print out the number of values for some of the column names from the datasets.

# In[20]:


dete_survey =  pd.read_csv('dete_survey.csv', na_values = 'Not Stated', index_col = 0) # Not Stated read in as NaN. In short, any value that is 'Not Stated' is replaced with NaN


# In[21]:


dete_survey # Prints out the updated dataframe


# In[22]:


dete_survey_updated = dete_survey.drop(dete_survey.columns[28:49], axis = 1) # Drops column numbers 28 through 48


# In[23]:


tafe_survey_updated = tafe_survey.drop(tafe_survey.columns[17:66], axis = 1) # Drops column numbers 17 through 65b


# In[24]:


tafe_survey_updated


# - First, we dropped column numbers 28 through 48 for the dete_survey dataframe and stored it into a variable called `dete_survey_update`. Second, we dropped column numbers 17 through 65 for the tafe_survey dataframe and stored it into a variable called `tafe_survey_updated`. The reason to why we made these changes is because they both contain many columns that we do not need to complete our analysis. Also, each dataframe contains many of the same columns, but the column names are different. Lastly, there are multiple columns/answers that indicate an employee resigned because they were dissatisfied. We want to avoid repetition 

# In[25]:


dete_survey_updated.columns = dete_survey_updated.columns.str.replace(' ', '_').str.lower().str.strip() # Makes all of the columns lower case, gets rid of any trailing white space from the end of strings, and replaces the spaces with underscores


# In[26]:


mapping = {'Record ID': 'id', 'CESSATION YEAR': 'cease_date', 'Reason for ceasing employment'
           : 'separationtype', 'Gender. What is your Gender': 'gender', 'CurrentAge. Current Age': 'age',
          'Employment Type. Employment Type': 'employment_status', 'Classification. Classification': 'position',
          'LengthofServiceOverall. Overall Length of Service at Institute (in years)': 'institute_service',
          'LengthofServiceCurrent. Length of Service at current workplace (in years)': 'role_service'}


# In[27]:


tafe_survey_updated.rename(columns = mapping, inplace = True) # Used the dataframe.rename() method to change some of the column names to more simpler name


# In[28]:


tafe_survey_updated.head() # Prints the first five rows for the tafe_survey_updated dataframe


# In[29]:


dete_survey_updated.head() # Prints the first five rows for the dete_survey_updated dataframe


# - First I renamed some of the columns in the dete_survey_updated dataframe. I made all of the capitalized columns lowercase, removed any trailing whitespace from the end of the strings, and replaced the spaces with underscores. I then used the `DataFrame.rename()` method to update some of the columns in the tafe_survey_updated dataframe. The reason to why we did this is because both of the dataframes contain many of the same columns but the column names are different. Also, because eventually we want to combine the columns, we had to standardize the column names. I also printed the first five rows of each of the updated dataframes to make sure that my changes look good.

# In[30]:


dete_survey_updated['separationtype'].value_counts() # Counts up the number of times each unique value shows up in the separationtype column in the dataset


# In[31]:


tafe_survey_updated['separationtype'].value_counts() # Counts up the number of times each unique value shows up in the separationtype column in the dataset


# In[32]:


dete_survey_updated.copy() # Copied dataset to avoid the SettingWithCopy warning


# In[33]:


tafe_survey_updated.copy()  # Copied dataset to avoid the SettingWithCopy warning


# In[34]:


tafe_resignations = tafe_survey_updated[tafe_survey_updated['separationtype'] == "Resignation"] # Selects the rows where the separationtype is resignation for the tafe_survey_updated dataframe


# In[35]:


tafe_resignations


# In[36]:


dete_resignations = dete_survey_updated[(dete_survey_updated["separationtype"] == "Resignation-Other reasons") 
                                          | (dete_survey_updated["separationtype"] == "Resignation-Other employer")
                                          | (dete_survey_updated["separationtype"] == "Resignation-Move overseas/interstate")]
# Selects the rows that have a separation type value of either resignation- other reasons, resignation - other employer, or
# resignation- move overseas/interstate and stored into the dete_resignations variable


# In[37]:


dete_resignations # Prints out the updated dete_survey dataframe


# - We reviewed the number of the unique values in the separationtype column of dete_survey_updated dataframe and the number of unique values in the reason for ceasing employment dataframe. After that we copied both of the dataframes by using the `DataFrame.copy()` method to avoid the SettingWithCopy warning. For the tafe_survey_updated dataframe we selected the rows that have a value of resignation for the reason for ceasing employment column and stored it into a variable called `tafe_resignations`. For the dete_survey_updated we selected the rows that either have a value of resignation-other reasons, resignation-other employer, resignation-move overseas/interstate values for the separationtype column and store it into a variable called `dete_resignations`. We did this because if we look at the unique values in the `separationtype` columns in each of the dataframes, we'll see that each contains a couple of different separation types. We only want to analyze survey respondents who resigned, so their separationtype contains the string `Resignation`.

# In[38]:


dete_resignations['cease_date'].value_counts() # Views how many times each of the unique values in the cease_date column show up


# In[39]:


pattern = r"([1-2][0-9][0-9][0-9])" # What each digit in the years part can be. The first three digits can be between 0 and 9 and the last digit can be either 1 or 2
dete_resignations['cease_date'] = dete_resignations['cease_date'].str.extract(pattern) # Extracts years from the cease_date column by using the Series.str.extract() method


# In[40]:


dete_resignations['cease_date'] = dete_resignations['cease_date'].astype('float') # Converts the values in the ceasedate column to a float or decimal


# In[41]:


dete_resignations # Most up to date dataframe


# In[42]:


dete_resignations['cease_date'].value_counts() # Number of times each unique value shows up in the cease_date column


# In[43]:


dete_resignations['dete_start_date'].value_counts().sort_index(ascending = True) # Number of times each unique value shows up in the dete_start_date column


# In[44]:


tafe_resignations['cease_date'].value_counts().sort_index(ascending = True) # Counts up the number of times each unique value shows up in the cessation year column in ascending order


# In[45]:


tafe_survey_updated.info()


# In[46]:


dete_resignations.info()


# In[47]:


get_ipython().run_line_magic('matplotlib', 'inline')
import matplotlib.pyplot as plt
boxplot1 = dete_resignations['dete_start_date'].plot(kind = 'box') # Boxplot of the dete_start_date column 
                                                                    # that is in the dete_resignations dataframe


# In[48]:


boxplot2 = dete_resignations['cease_date'].plot(kind = 'box') # Boxplot of the cease_date column
                                                              # that is in the dete_resignations dataframe


# In[49]:


boxplot3 = tafe_resignations['cease_date'].plot(kind = 'box') # Boxplot of the cessation year


# - First we counted up the number of times each unique value in the `cease_date` column from the `dete_resignations` dataframe showed up. After that , we extracted the years from the `cease_date` column by using vectorized string methods. We did this because since the `cease_date` is the last year of the person's employment and the `dete_start_date` is the person's first year of employment, it wouldn't make any sense to have years after the current date. Then we converted it to a float. After that, we checked the number of times each unique value showed up in the `cease_date` column in the `tafe_resignations`dataframe. Then we made a couple of box and whisker plots and we can see that each of the columns in the box and whisker plots have some outliers

# In[50]:


tafe_resignations


# In[51]:


dete_resignations


# In[52]:


dete_resignations['institute_service'] = dete_resignations['cease_date'] - dete_resignations['dete_start_date'] # Created a new column
# name called institute services which is simply the cease date minus the dete start date


# - We created a new column name called `institute_services` for the `dete_resignations` dataframe. We did this because in order to analyze both of the surveys together, we would have to create a corresponding `institute_service` column in `dete_resignations`. Also, for our information, in the human resources field, the length of time an employee spent in a workplace is referred to as their years of service. This is the question we want to answer:
#     - Are employees who have only worked for the institutes for a short period of time resigning due to some kind of dissatisfaction? What about employees who have been at the job longer?

# In[53]:


tafe_resignations['Contributing Factors. Dissatisfaction'].value_counts().sort_index(ascending= True) # Prints out the number of 
# times each unique value shows up in the contributing factors dissatisfaction column in the tafe_resignations dataframe


# In[54]:


tafe_resignations['Contributing Factors. Job Dissatisfaction'].value_counts().sort_index(ascending= True) # Prints out the number of 
# times each unique value shows up in the contributing factors job dissatisfaction column in the tafe_resignations dataframe


# In[55]:


def update_vals(value): # Created a function called update_vals that has one parameter called value
    if pd.isnull(value): 
        return np.nan
    if value == '-':
        return False
    else:
        return True    


# In[56]:


tafe_resignations[['Contributing Factors. Dissatisfaction', 'Contributing Factors. Job Dissatisfaction']].applymap(update_vals)
# Applies some of the column names from the tafe_resignations dataframe and sees if any of them contain a true, false , or NaN value. If it is false, then it will not be in a column called 
# dissatisfied


# In[57]:


tafe_resignations['dissatisfied'] = tafe_resignations[['Contributing Factors. Dissatisfaction', 'Contributing Factors. Job Dissatisfaction']].any(axis = 1, skipna = False) #
# Creates a new column called dissatisfied for the tafe_resignations dataframe
# In the dissatisfied column, it will only be true if the employee left only because they were dissatisfed
# But if there were other reasons other than being dissatisfied, then it will be false


# In[58]:


dete_resignations['dissatisfied'] = dete_resignations[['job_dissatisfaction', 'dissatisfaction_with_the_department',
'physical_work_environment', 'lack_of_recognition','lack_of_job_security', 'work_location', 'employment_conditions', 
'work_life_balance', 'workload' ]].any(axis = 1, skipna = False)
# Creates a new column called dissatisfied for the dete_resignations dataframe. 
# In the dissatisfied column, it will only be true if the employee left only because they were dissatisfed
# But if there were other reasons other than being dissatisfied, then it will be false


# In[59]:


tafe_resignations_up = tafe_resignations.copy() # Copies the file so that it will avoid a SettingWithCopy Warning


# In[60]:


tafe_resignations_up


# In[61]:


dete_resignations_up = dete_resignations.copy() # Copies the file so that it will avoid a SettingWithCopy Warning


# In[62]:


dete_resignations_up


# - First, we counted up the number of times each unique value shows up in both of the `Contributing Factors. Job Dissatisfaction` and `Contributing Factors. Dissatisfaction` columns in the `tafe_resignations` dataframe. After that, we made a function called `update_vals()` that included a parameter called `value` that does these following things:
#     - If the value is `NaN`, return `np.nan`
#     - If the value is `-`, return False
#     - For any other value, return `True`
# - After that we used the `df.applymap()` method for both of the `tafe_resignations` and `dete_resignations` dataframes to apply the `update_vals()`. After that, we used the `Dataframe,any()` in order to make a new column called `dissatisfied`. Then after that, we copied both of the files using the `df.copy()` method to avoid the SettingWithCopy Warning

# In[63]:


dete_resignations_up['institute'] = 'DETE' # Created a new column called institute and have each row have a value called DETE


# In[64]:


tafe_resignations_up['institute'] = 'TAFE' # Created a new column called institue and have each row have a value called TAFE


# In[65]:


combined = pd.concat([dete_resignations_up, tafe_resignations_up])
# Combines both of the dete_resignation and tafe_resignation dataframes horizontally


# In[66]:


combined


# In[67]:


combined_updated = combined.dropna(thresh = 500, axis = 1) # Drops any columns that have less than 500 non null values


# In[68]:


combined_updated


# - First we created a column called `institute` for the `dete_resignation_up` dataframe and had each row have a value of `DETE` for that column. Then, we created a column called `institute` for the `tafe_resignations_up` dataframe. We also had each row have a vale of `TAFE` for that column. Then, we combined both of the datasets and stored it into a variable called `combined`. Then we dropped any columns that have less than 500 non null values by using the `Dataframe.dropna()` method. We used the `thresh` command and ended up storing that in a variable called `combined_updated`

# In[69]:


combined_updated['institute_service'] = combined_updated['institute_service'].astype('str') # Changing the column name to a string type


# In[70]:


combined_updated


# In[71]:


pattern = r'(\d+)' # The pattern that we are looking for when extracting the years of service from the institute_service column


# In[72]:


combined_updated['institute_service'] = combined_updated['institute_service'].str.extract(pattern) # Extracts the years of service following the pattern that stored in the pattern variable


# In[73]:


combined_updated['institute_service'] = combined_updated['institute_service'].astype('float') # Converted it to a float 


# In[74]:


def update_vals(value): # Created a function called update_vals() that maps each year value to one of the career stages 
                        # listed in the instructions
    if pd.isnull(value): # If the value is NAN then it returns a value of NaN
        return np.nan
    if value < 3: # If they were there for less than three years it will be returned as new
        return 'New'
    if value >= 3 and value <= 6: # If between 3 - 6 years, it will be returned as experienced
        return 'Experienced'
    if value >= 7 and value <= 10: # If between 7 and 10 years it will be returned as Established
        return 'Established'
    else:
        return 'Veteran'


# In[75]:


combined_updated['institute_service'].value_counts()


# In[76]:


combined_updated['service_cat'] = combined_updated['institute_service'].apply(update_vals) # Creates a new column called service_cat which is essentialy the career stage that each of the employees
                                                                                            # are in whether it be new, experienced, established, or veteran


# In[77]:


combined_updated


# -  We first used the `Series.astype()` method to change the type for the `institute_service` column to `str`. After that, we used vectorized string methods to extract the years from each pattern for the `institute_service` column. For the `pattern` variable, we have a value, `r(\d+)` and we used that for the `str.extract()` method. After that we used the `Series.astype()` method to change the column to a `float` type. Then we mapped, each value to one of the career stages listed in the instructions which would either be `New`, `Experienced`, `Established`, or `Veteran`. We created a function called `update_vals()` with a parameter called `value`. If the value for the `value` parameter is null, it will return NaN. If it is less than `3`, it would return the value, `New`. If it is between `3` and `6` years, it would return the value, `Experienced`. If it is between `7` and `10` years, then it will return the value, `Established`. Lastly, if the value for the parameter is `11` or more years, then it would return a value of, `Veteran`. We then used the `Series.apply()` method to apply the function to the `institute_service` column. We then stored it into a new column called `service_cat` 

# In[78]:


combined_updated['service_cat'].value_counts() # Counts up the number of times each of the values in the service_cat appears


# In[79]:


combined_updated['dissatisfied']


# In[80]:


combined_updated['dissatisfied'].value_counts(dropna = False) # Prints the number of times each value 
                                                # for the dissatisfied column shows up as well as the missing values


# In[81]:


combined_updated


# In[82]:


combined_updated['dissatisfied'].value_counts(dropna = False)


# In[83]:


combined_updated['dissatisfied'] = combined_updated['dissatisfied'].replace("-", np.nan) # Replaces the "-" value
                                        # in the dissatisfied column with, "NaN"


# In[84]:


combined_updated['dissatisfied'].isna().sum() # Counts up the number of missing values that are in the 
                                              # dissatisfied column


# In[85]:


combined_updated['dissatisfied'] = combined_updated['dissatisfied'].fillna(False) # Fills in the missing values with False


# In[86]:


combined_updated


# In[87]:


combined_updated['dissatisfied'].value_counts()


# In[88]:


combined_updated['dissatisfied'].isna().sum() # Confirming that there are no longer anymore missing values in
                                              # the dissatisfied column


# In[89]:


combined_updated


# In[90]:


combined_updated = pd.pivot_table(combined_updated, values = 'dissatisfied', index = ['service_cat'])
combined_updated