#!/usr/bin/env python
# coding: utf-8
# # **Investigate Medical Appointment Dataset**
# ### A person takes a doctor's appointment, receives all the instructions, and no-show. Who's to blame?
#
# In this project, I will try to analyse why would some patient not show up for their medical appointment or whether there are reasons for that using the data we have.
# I will try to find some correlation between the different attributes and whether the patient shows up or not. The dataset I'm going to use contains 110.527 medical appointments and its 14 associated variables ( PatientId, AppointmentID, Gender, ScheduledDay, AppointmentDay, Age, Neighbourhood, Scholarship, Hypertension, Diabetes, Alcoholism, Handcap', SMS_received, No-show )
#
# ## Objectives
# ### Questions to answer
#
# * What is the percentage of no-show?
# * What factors are important for to know in order to predict if a patient will show up for their scheduled appointment?
# * Is the time gender related to whether a patient will show or not?
# * Are patients with scholarship more likely to miss their appointment?
# * Are patients who don't recieve sms more likely to miss their appointment?
# * Is the time difference between the scheduling and appointment related to whether a patient will show?
# * Does age affect whether a patient will show up or not?
# * What is the percentage of patients missing their appointments for every neighbourhood
#
# ***
# ## Setup
# In[1]:
#importing needed modules
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
#choose plots style
sns.set_style('darkgrid')
#make sure plots are inline with the notebook
get_ipython().run_line_magic('matplotlib', 'inline')
# ## Data Wrangling
#
# ### Loading the dataset and checking the columns we have
#
# In[6]:
## Loading the dataset and checking the columns we have
### Load data and print out a few lines. Perform operations to inspect data
### Types and look for instances of missing or possibly errant data.
df = pd.read_csv('data/no-showappointments.csv')
df.head()
# In[7]:
### Get the shape and types of our data
print(df.shape)
pd.DataFrame(df.dtypes)
# In[8]:
### Get some statistics about our data
df.describe()
# In[9]:
### Check if there is any missing values in our data
df.info()
df.isna().any()
# In[10]:
### Check if there is any duplicated rows in our data
df.duplicated().any()
#
Notes on data exploration
#
# It's clear that some columns need to have their type corrected like dates. Another great finding is that the data has no duplicated or missing values. Also, the column no-show can be a bit confusing and we can invert the values to make it more intuitive (show instead of no-show) and we can also turn it to integer instead of yes or no.
#
# ### Data Cleaning
#
# * Drop irrelevant columns
# * Modify column names
# * Correct data types
# * Invert no-show column in to show with integer values
# * Create a new column for days difference between scheduling an appointment
#
# In[12]:
### Drop irrelevant columns
df.drop(['PatientId','AppointmentID'],axis=1,inplace=True)
df.head()
# In[15]:
### Change all cloumns name to lower case and replace all - with _
df.columns=df.columns.str.lower().str.replace('-','_')
pd.DataFrame(df.columns)
# In[16]:
### Change data columns to data type
df['scheduledday']=pd.to_datetime(df['scheduledday'])
df['appointmentday']=pd.to_datetime(df['appointmentday'])
# In[17]:
### Turn no_show column to show
print(df.no_show.unique())
df.no_show=df.no_show.map({'No':1,'Yes':0})
df.rename(columns={'no_show':'show'},inplace=True)
print(df.show.unique())
df.head()
# In[18]:
### Create a new column for days difference between scheduling and appointment
day_diff=(df.appointmentday.dt.date-df.scheduledday.dt.date).dt.days
df.insert(3,'day_diff',day_diff)
df.day_diff.dtype
# In[19]:
### Check data one last time
df.dtypes
# Now that we have our data cleaned and with the proper type for every column and also created a new Time difference column we can start analyzing our data and try to find the correlation between different variables and the show column.
#
# ## Exploratory Data Analysis
# In[20]:
#define function to get the ratio of show in different categories
def plot_rat(x):
df.groupby(x).show.mean().plot(kind='bar',
edgecolor='black',
figsize=(14,8)).set_ylabel('Ratio of show');
display(df.groupby(x)[['show']].mean())
# plt.legend()
# ### What is the percentage of no-show?
# In[21]:
#get some statistics about our data
df.describe()
# In[24]:
# percentage of show and no show
print(f"percentage of patients who didn't show up for their appointment is { (1-df.show.mean())*100 } %" )
no_show=len(df[df.show==0])/len(df.show)
show=len(df[df.show==1])/len(df.show)
plt.bar(['show','no show'],[show*100,no_show*100],color=['g','r']);
plt.title('Percentage of patients showing up or missing their appointment ');
plt.ylabel('Percentage');
plt.xlabel('show or no-show');
display(df.groupby('show')[['show']].count())
# ### What factors are important for us to know in order to predict if a patient will show up for their scheduled appointment?
# In[28]:
#create filters for show and no-show
show=(df.show == 1)
no_show=(df.show == 0)
total_miss=len(df[no_show])
total=len(df)
# ### Is the time gender related to whether a patient will show or not?
# In[32]:
#get the number of patients missing their appointments by gender
no_show_gender=df[no_show]['gender'].value_counts()
no_show_gender.plot(kind='pie');
plt.title('patients who missed their appointment by gender');
print('percentage of Females and Males who missed their appointment:')
#get the percentage of patients missing their appointments by gender
pd.DataFrame(no_show_gender*100/total)
# In[33]:
df.groupby(['gender','show']).size().unstack('gender').plot(kind='bar').set_ylabel('number of patients')
# Finding
#
# #### The percentage of females missing their appointment is nearly two times the number of males. So females are more likely to miss their appointment.
#
# ### Are patients with scholarships more likely to miss their appointment?
#
# In[37]:
#what is the percentage of patients missing their appointment by scholarship
plot_rat(df.scholarship)
plt.title('Ratio of show or no-show by scholarship')
# df.groupby('scholarship')[['show']].mean()
# Finding
#
# #### It seems that patients with no scholarships are actually more likely to miss their appointment
#
# ### Are patients who don't receive SMS more likely to miss their appointment?
# In[52]:
#what is the percentage of patient who attended their appointment by sms_received
plot_rat(df.sms_received)
plt.title('Ratio of show or no-show by sms_received');
# Finding
#
# #### A strange finding here suggests that patients who received an SMS are more likely to miss their appointment !!
#
# ### Is the time difference between the scheduling and appointment related to whether a patient will show?
# In[58]:
#filter for positive day difference
df1=df[df.day_diff>=0]
# df1.day_diff.unique()
#turn day diff into categorical column Day_diff2
bin_edges=[-1,0,4,15,179]
names=['sameday','fewdays','more_than_4','more_than_15']
df['day_diff2']=pd.cut(df1.day_diff,bin_edges,labels=names)
#filter for no-show records and count values for each category of day_diff2
no_show_day_diff=df[no_show].day_diff2.value_counts()/len(df[no_show])*100
no_show_day_diff.reindex(names).plot(kind='bar');
plt.title('propotion of time difference for no_show appointments');
plt.xlabel('days difference between scheduling and appointment');
plt.ylabel('Ratio of no_show');
print('the propotion of different time difference for patients who missed their appiontments:')
pd.DataFrame(no_show_day_diff)
# Finding
# It appears that the longer the period between the scheduling and appointment the more likely the patient won't show up.
#
# ### Does age affect whether a patient will show up or not?
# In[59]:
#plot the histograns of age for patients who showed up and who didn't
df[show].age.hist(alpha=0.5,label='show')
df[no_show].age.hist(alpha=0.5,label='no_show')
plt.legend()
plt.xlabel('age')
plt.ylabel('ratio')
plt.title('Histogram of age values for patients who showed up or missed their appointment')
#ger the mean age for patients who showed up and who didn't
df[no_show][['age']].describe()
# Finding
#
# #### There is no clear relation between the age and whether the patient shows up or not but younger patients are more likely to miss their appointments.
# ### What is the percentage of patients missing their appointments for every neighborhood?
#
# In[62]:
#get the number of records for each neighbourhood
rec_neigh=df['neighbourhood'].value_counts()
#get the number of records for patients missing their appointments for each neighbourhood
rec_neigh_no_show=df[no_show].neighbourhood.value_counts()
#percentage of patients missing their appointments for every neighbourhood
rec_neigh_no_show_percentage=rec_neigh_no_show/rec_neigh
pd.DataFrame(rec_neigh_no_show_percentage.sort_values(axis=0, ascending=False))
#
#
# ## Conclusions
#
# #### After analyzing the dataset here are some findings:
#
# 1. Percentage of patients who didn't show up for their appointment is 20.19%.
# 2. The percentage of females missing their appointment is nearly two times the number of males. So females are more likely to miss their appointment.
# 3. It appears that the longer the period between the scheduling and appointment the more likely the patient won't show up.
# 4. It seems that patients with scholarships are actually more likely to miss their appointment.
# 5. A strange finding here suggests that patients who received an SMS are more likely to miss their appointment !!
# 6. There is no clear relation between the age and whether the patients show up or not but younger patients are more likely to miss their appointments.
#
# #### Analysis Shortcoming & Data Limitations
#
# * The data doesn't state the exact hour of the appointment which would have been very useful to try to find out which hours have the most missing appointments and which doesn't. It could also be very useful to know the difference between scheduling and the appointment since many of the scheduling are on the same day.
# * The data doesn't state if any day is a vacation or not which can indicate if people tend to miss their appointments more on working days.
# * The age column had a negative value but according to the data creator, it means a baby not born yet (a pregnant woman).
# * When calculating the day difference between the scheduling and appointment days we had some negative value which makes no sense and might mean that the records of questions have wrong data.
#
# ## Thanks to
# [Mostafa Abdelaleem](https://www.linkedin.com/in/mostafa-abdelaleem/?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkQuickLabsMedicalAppointmentDataAnalysis30426296-2022-01-01)
#
# [Mridul Bhandari](https://www.linkedin.com/in/mridul-bhandari/?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkQuickLabsMedicalAppointmentDataAnalysis30426296-2022-01-01)
#
# For inspiring me