Investigate Medical Appointment Dataset¶

A person takes a doctor's appointment, receives all the instructions, and no-show. Who's to blame?¶

In this project, I will try to analyse why would some patient not show up for their medical appointment or whether there are reasons for that using the data we have. I will try to find some correlation between the different attributes and whether the patient shows up or not. The dataset I'm going to use contains 110.527 medical appointments and its 14 associated variables ( PatientId, AppointmentID, Gender, ScheduledDay, AppointmentDay, Age, Neighbourhood, Scholarship, Hypertension, Diabetes, Alcoholism, Handcap', SMS_received, No-show )

Objectives¶

Questions to answer¶

What is the percentage of no-show?
What factors are important for to know in order to predict if a patient will show up for their scheduled appointment?
- Is the time gender related to whether a patient will show or not?
- Are patients with scholarship more likely to miss their appointment?
- Are patients who don't recieve sms more likely to miss their appointment?
- Is the time difference between the scheduling and appointment related to whether a patient will show?
- Does age affect whether a patient will show up or not?
- What is the percentage of patients missing their appointments for every neighbourhood

Setup¶

In [1]:

#importing needed modules
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
#choose plots style
sns.set_style('darkgrid')
#make sure plots are inline with the notebook
%matplotlib inline

Data Wrangling¶

Loading the dataset and checking the columns we have¶

In [6]:

## Loading the dataset and checking the columns we have

### Load data and print out a few lines. Perform operations to inspect data
### Types and look for instances of missing or possibly errant data.
df = pd.read_csv('data/no-showappointments.csv')
df.head()

Out[6]:

	PatientId	AppointmentID	Gender	ScheduledDay	AppointmentDay	Age	Neighbourhood	Hipertension	Diabetes	No-show
0	2.987250e+13	5642903	F	2016-04-29T18:38:08Z	2016-04-29T00:00:00Z	62	JARDIM DA PENHA	1	0	No
1	5.589978e+14	5642503	M	2016-04-29T16:08:27Z	2016-04-29T00:00:00Z	56	JARDIM DA PENHA	0	0	No
2	4.262962e+12	5642549	F	2016-04-29T16:19:04Z	2016-04-29T00:00:00Z	62	MATA DA PRAIA	0	0	No
3	8.679512e+11	5642828	F	2016-04-29T17:29:31Z	2016-04-29T00:00:00Z	8	PONTAL DE CAMBURI	0	0	No
4	8.841186e+12	5642494	F	2016-04-29T16:07:23Z	2016-04-29T00:00:00Z	56	JARDIM DA PENHA	1	1	No

In [7]:

### Get the shape and types of our data
print(df.shape)
pd.DataFrame(df.dtypes)

(110527, 14)

Out[7]:

	0
PatientId	float64
AppointmentID	int64
Gender	object
ScheduledDay	object
AppointmentDay	object
Age	int64
Neighbourhood	object
Scholarship	int64
Hipertension	int64
Diabetes	int64
Alcoholism	int64
Handcap	int64
SMS_received	int64
No-show	object

In [8]:

### Get some statistics about our data
df.describe()

Out[8]:

	PatientId	AppointmentID	Age	Scholarship	Hipertension	Diabetes	Alcoholism	Handcap	SMS_received
count	1.105270e+05	1.105270e+05	110527.000000	110527.000000	110527.000000	110527.000000	110527.000000	110527.000000	110527.000000
mean	1.474963e+14	5.675305e+06	37.088874	0.098266	0.197246	0.071865	0.030400	0.022248	0.321026
std	2.560949e+14	7.129575e+04	23.110205	0.297675	0.397921	0.258265	0.171686	0.161543	0.466873
min	3.921784e+04	5.030230e+06	-1.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000
25%	4.172614e+12	5.640286e+06	18.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000
50%	3.173184e+13	5.680573e+06	37.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000
75%	9.439172e+13	5.725524e+06	55.000000	0.000000	0.000000	0.000000	0.000000	0.000000	1.000000
max	9.999816e+14	5.790484e+06	115.000000	1.000000	1.000000	1.000000	1.000000	4.000000	1.000000

In [9]:

### Check if there is any missing values in our data
df.info()
df.isna().any()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 110527 entries, 0 to 110526
Data columns (total 14 columns):
 #   Column          Non-Null Count   Dtype  
---  ------          --------------   -----  
 0   PatientId       110527 non-null  float64
 1   AppointmentID   110527 non-null  int64  
 2   Gender          110527 non-null  object 
 3   ScheduledDay    110527 non-null  object 
 4   AppointmentDay  110527 non-null  object 
 5   Age             110527 non-null  int64  
 6   Neighbourhood   110527 non-null  object 
 7   Scholarship     110527 non-null  int64  
 8   Hipertension    110527 non-null  int64  
 9   Diabetes        110527 non-null  int64  
 10  Alcoholism      110527 non-null  int64  
 11  Handcap         110527 non-null  int64  
 12  SMS_received    110527 non-null  int64  
 13  No-show         110527 non-null  object 
dtypes: float64(1), int64(8), object(5)
memory usage: 11.8+ MB

Out[9]:

PatientId         False
AppointmentID     False
Gender            False
ScheduledDay      False
AppointmentDay    False
Age               False
Neighbourhood     False
Scholarship       False
Hipertension      False
Diabetes          False
Alcoholism        False
Handcap           False
SMS_received      False
No-show           False
dtype: bool

In [10]:

### Check if there is any duplicated rows in our data
df.duplicated().any()

Out[10]:

False

Notes on data exploration

It's clear that some columns need to have their type corrected like dates. Another great finding is that the data has no duplicated or missing values. Also, the column no-show can be a bit confusing and we can invert the values to make it more intuitive (show instead of no-show) and we can also turn it to integer instead of yes or no.

Data Cleaning¶

Drop irrelevant columns
Modify column names
Correct data types
Invert no-show column in to show with integer values
Create a new column for days difference between scheduling an appointment

In [12]:

### Drop irrelevant columns
df.drop(['PatientId','AppointmentID'],axis=1,inplace=True)
df.head()

Out[12]:

	Gender	ScheduledDay	AppointmentDay	Age	Neighbourhood	Hipertension	Diabetes	No-show
0	F	2016-04-29T18:38:08Z	2016-04-29T00:00:00Z	62	JARDIM DA PENHA	1	0	No
1	M	2016-04-29T16:08:27Z	2016-04-29T00:00:00Z	56	JARDIM DA PENHA	0	0	No
2	F	2016-04-29T16:19:04Z	2016-04-29T00:00:00Z	62	MATA DA PRAIA	0	0	No
3	F	2016-04-29T17:29:31Z	2016-04-29T00:00:00Z	8	PONTAL DE CAMBURI	0	0	No
4	F	2016-04-29T16:07:23Z	2016-04-29T00:00:00Z	56	JARDIM DA PENHA	1	1	No

In [15]:

### Change all cloumns name to lower case and replace all - with _
df.columns=df.columns.str.lower().str.replace('-','_')
pd.DataFrame(df.columns)

Out[15]:

	0
0	gender
1	scheduledday
2	appointmentday
3	age
4	neighbourhood
5	scholarship
6	hipertension
7	diabetes
8	alcoholism
9	handcap
10	sms_received
11	no_show

In [16]:

### Change data columns to data type
df['scheduledday']=pd.to_datetime(df['scheduledday'])
df['appointmentday']=pd.to_datetime(df['appointmentday'])

In [17]:

### Turn no_show column to show
print(df.no_show.unique())
df.no_show=df.no_show.map({'No':1,'Yes':0})
df.rename(columns={'no_show':'show'},inplace=True)
print(df.show.unique())
df.head()

['No' 'Yes']
[1 0]

Out[17]:

	gender	scheduledday	appointmentday	age	neighbourhood	hipertension	diabetes	show
0	F	2016-04-29 18:38:08+00:00	2016-04-29 00:00:00+00:00	62	JARDIM DA PENHA	1	0	1
1	M	2016-04-29 16:08:27+00:00	2016-04-29 00:00:00+00:00	56	JARDIM DA PENHA	0	0	1
2	F	2016-04-29 16:19:04+00:00	2016-04-29 00:00:00+00:00	62	MATA DA PRAIA	0	0	1
3	F	2016-04-29 17:29:31+00:00	2016-04-29 00:00:00+00:00	8	PONTAL DE CAMBURI	0	0	1
4	F	2016-04-29 16:07:23+00:00	2016-04-29 00:00:00+00:00	56	JARDIM DA PENHA	1	1	1

In [18]:

### Create a new column for days difference between scheduling and appointment
day_diff=(df.appointmentday.dt.date-df.scheduledday.dt.date).dt.days 
df.insert(3,'day_diff',day_diff)
df.day_diff.dtype

Out[18]:

dtype('int64')

In [19]:

### Check data one last time
df.dtypes

Out[19]:

gender                         object
scheduledday      datetime64[ns, UTC]
appointmentday    datetime64[ns, UTC]
day_diff                        int64
age                             int64
neighbourhood                  object
scholarship                     int64
hipertension                    int64
diabetes                        int64
alcoholism                      int64
handcap                         int64
sms_received                    int64
show                            int64
dtype: object

Now that we have our data cleaned and with the proper type for every column and also created a new Time difference column we can start analyzing our data and try to find the correlation between different variables and the show column.

Exploratory Data Analysis¶

In [20]:

#define function to get the ratio of show in different categories
def plot_rat(x):
    df.groupby(x).show.mean().plot(kind='bar',
                                    edgecolor='black',
                                    figsize=(14,8)).set_ylabel('Ratio of show');
    display(df.groupby(x)[['show']].mean())
#     plt.legend()

What is the percentage of no-show?¶

In [21]:

#get some statistics about our data
df.describe()

Out[21]:

	day_diff	age	scholarship	hipertension	diabetes	alcoholism	handcap	sms_received	show
count	110527.000000	110527.000000	110527.000000	110527.000000	110527.000000	110527.000000	110527.000000	110527.000000	110527.000000
mean	10.183702	37.088874	0.098266	0.197246	0.071865	0.030400	0.022248	0.321026	0.798067
std	15.254996	23.110205	0.297675	0.397921	0.258265	0.171686	0.161543	0.466873	0.401444
min	-6.000000	-1.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000
25%	0.000000	18.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	1.000000
50%	4.000000	37.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	1.000000
75%	15.000000	55.000000	0.000000	0.000000	0.000000	0.000000	0.000000	1.000000	1.000000
max	179.000000	115.000000	1.000000	1.000000	1.000000	1.000000	4.000000	1.000000	1.000000

In [24]:

# percentage of show and no show
print(f"percentage of patients who didn't show up for their appointment is { (1-df.show.mean())*100 } %" )
no_show=len(df[df.show==0])/len(df.show)
show=len(df[df.show==1])/len(df.show)
plt.bar(['show','no show'],[show*100,no_show*100],color=['g','r']);
plt.title('Percentage of patients showing up or missing their appointment ');
plt.ylabel('Percentage');
plt.xlabel('show or no-show');
display(df.groupby('show')[['show']].count())

percentage of patients who didn't show up for their appointment is 20.193255946510803 %

	show
show
0	22319
1	88208

What factors are important for us to know in order to predict if a patient will show up for their scheduled appointment?¶

In [28]:

#create filters for show and no-show
show=(df.show == 1)
no_show=(df.show == 0)
total_miss=len(df[no_show])
total=len(df)

In [32]:

#get the number of patients missing their appointments by gender
no_show_gender=df[no_show]['gender'].value_counts()
no_show_gender.plot(kind='pie');
plt.title('patients who missed their appointment by gender');
print('percentage of Females and Males who missed their appointment:')
#get the percentage of patients missing their appointments by gender
pd.DataFrame(no_show_gender*100/total)

percentage of Females and Males who missed their appointment:

Out[32]:

	gender
F	13.204013
M	6.989242

In [33]:

df.groupby(['gender','show']).size().unstack('gender').plot(kind='bar').set_ylabel('number of patients')

Out[33]:

Text(0, 0.5, 'number of patients')

Finding

The percentage of females missing their appointment is nearly two times the number of males. So females are more likely to miss their appointment.¶

Are patients with scholarships more likely to miss their appointment?¶

In [37]:

#what is the percentage of patients missing their appointment by scholarship
plot_rat(df.scholarship)
plt.title('Ratio of show or no-show by scholarship')
# df.groupby('scholarship')[['show']].mean()

	show
scholarship
0	0.801928
1	0.762637

Out[37]:

Text(0.5, 1.0, 'Ratio of show or no-show by scholarship')

Finding

It seems that patients with no scholarships are actually more likely to miss their appointment¶

Are patients who don't receive SMS more likely to miss their appointment?¶

In [52]:

#what is the percentage of patient who attended their appointment by sms_received
plot_rat(df.sms_received)
plt.title('Ratio of show or no-show by sms_received');

	show
sms_received
0	0.832967
1	0.724255

Finding

A strange finding here suggests that patients who received an SMS are more likely to miss their appointment !!¶

In [58]:

#filter for positive day difference
df1=df[df.day_diff>=0]
# df1.day_diff.unique()
#turn day diff into categorical column Day_diff2
bin_edges=[-1,0,4,15,179]
names=['sameday','fewdays','more_than_4','more_than_15']
df['day_diff2']=pd.cut(df1.day_diff,bin_edges,labels=names)
#filter for no-show records and count values for each category of day_diff2

no_show_day_diff=df[no_show].day_diff2.value_counts()/len(df[no_show])*100
no_show_day_diff.reindex(names).plot(kind='bar');
plt.title('propotion of time difference for no_show appointments');
plt.xlabel('days difference between scheduling and appointment');
plt.ylabel('Ratio of no_show');
print('the propotion of different time difference for patients who missed their appiontments:')
pd.DataFrame(no_show_day_diff)

the propotion of different time difference for patients who missed their appiontments:

Out[58]:

	day_diff2
more_than_15	38.460505
more_than_4	32.922622
fewdays	20.565438
sameday	8.029034

Finding

It appears that the longer the period between the scheduling and appointment the more likely the patient won't show up.

Does age affect whether a patient will show up or not?¶

In [59]:

#plot the histograns of age for patients who showed up and who didn't
df[show].age.hist(alpha=0.5,label='show')
df[no_show].age.hist(alpha=0.5,label='no_show')
plt.legend()
plt.xlabel('age')
plt.ylabel('ratio')
plt.title('Histogram of age values for patients who showed up or missed their appointment')
#ger the mean age for patients who showed up and who didn't
df[no_show][['age']].describe()

Out[59]:

	age
count	22319.000000
mean	34.317667
std	21.965941
min	0.000000
25%	16.000000
50%	33.000000
75%	51.000000
max	115.000000

Finding

There is no clear relation between the age and whether the patient shows up or not but younger patients are more likely to miss their appointments.¶

What is the percentage of patients missing their appointments for every neighborhood?¶

In [62]:

#get the number of records for each neighbourhood
rec_neigh=df['neighbourhood'].value_counts()
#get the number of records for patients missing their appointments for each neighbourhood
rec_neigh_no_show=df[no_show].neighbourhood.value_counts()
#percentage of patients missing their appointments for every neighbourhood
rec_neigh_no_show_percentage=rec_neigh_no_show/rec_neigh
pd.DataFrame(rec_neigh_no_show_percentage.sort_values(axis=0, ascending=False))

Out[62]:

	neighbourhood
ILHAS OCEÂNICAS DE TRINDADE	1.000000
SANTOS DUMONT	0.289185
SANTA CECÍLIA	0.274554
SANTA CLARA	0.264822
ITARARÉ	0.262664
...	...
SOLON BORGES	0.147122
MÁRIO CYPRESTE	0.145553
AEROPORTO	0.125000
ILHA DO BOI	0.085714
PARQUE INDUSTRIAL	NaN

81 rows × 1 columns

Conclusions¶

After analyzing the dataset here are some findings:¶

Percentage of patients who didn't show up for their appointment is 20.19%.
The percentage of females missing their appointment is nearly two times the number of males. So females are more likely to miss their appointment.
It appears that the longer the period between the scheduling and appointment the more likely the patient won't show up.
It seems that patients with scholarships are actually more likely to miss their appointment.
A strange finding here suggests that patients who received an SMS are more likely to miss their appointment !!
There is no clear relation between the age and whether the patients show up or not but younger patients are more likely to miss their appointments.

Analysis Shortcoming & Data Limitations¶

The data doesn't state the exact hour of the appointment which would have been very useful to try to find out which hours have the most missing appointments and which doesn't. It could also be very useful to know the difference between scheduling and the appointment since many of the scheduling are on the same day.
The data doesn't state if any day is a vacation or not which can indicate if people tend to miss their appointments more on working days.
The age column had a negative value but according to the data creator, it means a baby not born yet (a pregnant woman).
When calculating the day difference between the scheduling and appointment days we had some negative value which makes no sense and might mean that the records of questions have wrong data.

Thanks to¶

Mostafa Abdelaleem

Mridul Bhandari

For inspiring me

Investigate Medical Appointment Dataset¶

A person takes a doctor's appointment, receives all the instructions, and no-show. Who's to blame?¶

Objectives¶

Questions to answer¶

Setup¶

Data Wrangling¶

Loading the dataset and checking the columns we have¶

Notes on data exploration

Data Cleaning¶

Exploratory Data Analysis¶

What is the percentage of no-show?¶

What factors are important for us to know in order to predict if a patient will show up for their scheduled appointment?¶

Is the time gender related to whether a patient will show or not?¶

Finding

The percentage of females missing their appointment is nearly two times the number of males. So females are more likely to miss their appointment.¶

Are patients with scholarships more likely to miss their appointment?¶

Finding

It seems that patients with no scholarships are actually more likely to miss their appointment¶

Are patients who don't receive SMS more likely to miss their appointment?¶

Finding

A strange finding here suggests that patients who received an SMS are more likely to miss their appointment !!¶

Is the time difference between the scheduling and appointment related to whether a patient will show?¶

Finding

Does age affect whether a patient will show up or not?¶

Finding

There is no clear relation between the age and whether the patient shows up or not but younger patients are more likely to miss their appointments.¶

What is the percentage of patients missing their appointments for every neighborhood?¶

Conclusions¶

After analyzing the dataset here are some findings:¶

Analysis Shortcoming & Data Limitations¶

Thanks to¶