Analysis and visualisation of (simulated) malaria cases for Makeover Monday.
Data from VisualizeNoMalaria via Makeover Monday.
import collections
from datetime import datetime
import matplotlib as mpl
import matplotlib.pyplot as plt
import matplotlib.cm as cm
%matplotlib inline
import numpy as np
import pandas as pd
import scipy.stats
Rename the columns while we're here.
malaria_raw = pd.read_excel('Simulated VisualizeNoMalaria Counts.xlsx').drop('Disclaimer', axis=1)
malaria_raw.columns = ['country', 'province', 'district', 'ruralurban', 'date', 'report', 'cases']
malaria_raw.head()
country | province | district | ruralurban | date | report | cases | |
---|---|---|---|---|---|---|---|
0 | Zambia | Southern | Chikankata | Rural | 2014-01-01 | Health Facility | 0 |
1 | Zambia | Southern | Chikankata | Rural | 2014-01-01 | Community Health Worker | 288 |
2 | Zambia | Southern | Chikankata | Rural | 2014-02-01 | Health Facility | 0 |
3 | Zambia | Southern | Chikankata | Rural | 2014-02-01 | Community Health Worker | 251 |
4 | Zambia | Southern | Chikankata | Rural | 2014-03-01 | Health Facility | 0 |
Just see how many items there are for each category
malaria_raw.country.value_counts()
Zambia 3586 Name: country, dtype: int64
malaria_raw.province.value_counts()
Southern 3586 Name: province, dtype: int64
malaria_raw.district.value_counts()
Monze 400 Kalomo 400 Kazungula 400 Mazabuka 400 Choma 400 Pemba 200 Chikankata 200 Gwembe 200 Siavonga 200 Namwala 200 Zimba 200 Sinazongwe 200 Livingstone 186 Name: district, dtype: int64
malaria_raw.ruralurban.value_counts()
Rural 2404 Urban 1182 Name: ruralurban, dtype: int64
malaria_raw.report.value_counts()
Health Facility 1793 Community Health Worker 1793 Name: report, dtype: int64
Country and province don't mean anything.
malaria_raw.groupby(['district', 'ruralurban']).size()
district ruralurban Chikankata Rural 200 Choma Rural 200 Urban 200 Gwembe Rural 200 Kalomo Rural 200 Urban 200 Kazungula Rural 200 Urban 200 Livingstone Rural 4 Urban 182 Mazabuka Rural 200 Urban 200 Monze Rural 200 Urban 200 Namwala Rural 200 Pemba Rural 200 Siavonga Rural 200 Sinazongwe Rural 200 Zimba Rural 200 dtype: int64
Just a quick few plots to see what the data looks like.
malaria_raw.groupby('date').sum().plot()
<matplotlib.axes._subplots.AxesSubplot at 0x7ff983268588>
malaria_raw.groupby(['date', 'report']).sum().unstack().plot()
<matplotlib.axes._subplots.AxesSubplot at 0x7ff981157438>
ax = malaria_raw.groupby(['date', 'district']).sum().unstack().plot()
ax.legend(loc='center left', bbox_to_anchor=(1, 0.5))
<matplotlib.legend.Legend at 0x7ff9810d8128>
malaria_raw.groupby('district').sum().sort_values(by='cases')
cases | |
---|---|
district | |
Livingstone | 4790 |
Namwala | 6439 |
Mazabuka | 8068 |
Monze | 9243 |
Chikankata | 13917 |
Zimba | 14984 |
Choma | 32397 |
Kazungula | 33731 |
Pemba | 35081 |
Kalomo | 35529 |
Siavonga | 40703 |
Gwembe | 64059 |
Sinazongwe | 158874 |
ax = malaria_raw.groupby(['date', 'ruralurban']).sum().unstack().plot()
ax.legend(loc='center left', bbox_to_anchor=(1, 0.5))
<matplotlib.legend.Legend at 0x7ff981089d68>
ax = malaria_raw.groupby(['date', 'district', 'report']).sum().unstack([-2, -1]).plot(figsize=(15, 15))
ax.legend(loc='center left', bbox_to_anchor=(1, 0.5));
Atypical things are happening in 2014. Let's look at just this data.
malaria_2014 = malaria_raw[malaria_raw.date.dt.year == 2014]
malaria_2014.head()
country | province | district | ruralurban | date | report | cases | |
---|---|---|---|---|---|---|---|
0 | Zambia | Southern | Chikankata | Rural | 2014-01-01 | Health Facility | 0 |
1 | Zambia | Southern | Chikankata | Rural | 2014-01-01 | Community Health Worker | 288 |
2 | Zambia | Southern | Chikankata | Rural | 2014-02-01 | Health Facility | 0 |
3 | Zambia | Southern | Chikankata | Rural | 2014-02-01 | Community Health Worker | 251 |
4 | Zambia | Southern | Chikankata | Rural | 2014-03-01 | Health Facility | 0 |
malaria_2015p = malaria_raw[malaria_raw.date.dt.year >= 2015]
malaria_2015p.head()
country | province | district | ruralurban | date | report | cases | |
---|---|---|---|---|---|---|---|
24 | Zambia | Southern | Chikankata | Rural | 2015-01-01 | Health Facility | 0 |
25 | Zambia | Southern | Chikankata | Rural | 2015-01-01 | Community Health Worker | 87 |
26 | Zambia | Southern | Chikankata | Rural | 2015-02-01 | Health Facility | 0 |
27 | Zambia | Southern | Chikankata | Rural | 2015-02-01 | Community Health Worker | 77 |
28 | Zambia | Southern | Chikankata | Rural | 2015-03-01 | Health Facility | 0 |
ax = malaria_2014.groupby(['date', 'district']).sum().unstack().plot()
ax.legend(loc='center left', bbox_to_anchor=(1, 0.5));
ax = malaria_2014.groupby(['date', 'district', 'report']).sum().unstack([-2, -1]).plot()
ax.legend(loc='center left', bbox_to_anchor=(1, 0.5));
ax = malaria_2015p.groupby(['date', 'district']).sum().unstack().plot()
ax.legend(loc='center left', bbox_to_anchor=(1, 0.5));
ax = malaria_2015p.groupby(['date', 'district', 'report']).sum().unstack([-2, -1]).plot()
ax.legend(loc='center left', bbox_to_anchor=(1, 0.5));
Sinazongwe is an outlier. Let's look at just that, and everything except Sinazongwe.
malaria_sina = malaria_raw[malaria_raw.district == 'Sinazongwe']
malaria_sina.head()
country | province | district | ruralurban | date | report | cases | |
---|---|---|---|---|---|---|---|
3186 | Zambia | Southern | Sinazongwe | Rural | 2014-01-01 | Health Facility | 0 |
3187 | Zambia | Southern | Sinazongwe | Rural | 2014-01-01 | Community Health Worker | 3 |
3188 | Zambia | Southern | Sinazongwe | Rural | 2014-02-01 | Health Facility | 0 |
3189 | Zambia | Southern | Sinazongwe | Rural | 2014-02-01 | Community Health Worker | 0 |
3190 | Zambia | Southern | Sinazongwe | Rural | 2014-03-01 | Health Facility | 0 |
malaria_not_sina = malaria_raw[malaria_raw.district != 'Sinazongwe']
malaria_not_sina.head()
country | province | district | ruralurban | date | report | cases | |
---|---|---|---|---|---|---|---|
0 | Zambia | Southern | Chikankata | Rural | 2014-01-01 | Health Facility | 0 |
1 | Zambia | Southern | Chikankata | Rural | 2014-01-01 | Community Health Worker | 288 |
2 | Zambia | Southern | Chikankata | Rural | 2014-02-01 | Health Facility | 0 |
3 | Zambia | Southern | Chikankata | Rural | 2014-02-01 | Community Health Worker | 251 |
4 | Zambia | Southern | Chikankata | Rural | 2014-03-01 | Health Facility | 0 |
malaria_not_sina.groupby(['date', 'report']).sum().unstack().plot()
<matplotlib.axes._subplots.AxesSubplot at 0x7ff981149978>
malaria_sina.groupby(['date', 'report']).sum().unstack().plot()
<matplotlib.axes._subplots.AxesSubplot at 0x7ff9831ff940>
ax = malaria_not_sina.groupby(['date', 'district', 'report']).sum().unstack([-2, -1]).plot()
ax.legend(loc='center left', bbox_to_anchor=(1, 0.5));
district_month = malaria_raw.groupby(['date', 'district']).sum().unstack().T.reset_index().set_index('district').drop('level_0', axis=1).T
district_month.head()
district | Chikankata | Choma | Gwembe | Kalomo | Kazungula | Livingstone | Mazabuka | Monze | Namwala | Pemba | Siavonga | Sinazongwe | Zimba |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
date | |||||||||||||
2014-01-01 00:00:00 | 865 | 1370 | 1875 | 927 | 462 | 123 | 425 | 646 | 107 | 3064 | 1014 | 3543 | 725 |
2014-02-01 00:00:00 | 778 | 2377 | 2049 | 1688 | 869 | 142 | 280 | 294 | 114 | 4205 | 1168 | 4308 | 1333 |
2014-03-01 00:00:00 | 1046 | 6277 | 4606 | 3298 | 1951 | 283 | 439 | 545 | 289 | 4993 | 2742 | 11058 | 2086 |
2014-04-01 00:00:00 | 1177 | 5192 | 3980 | 4202 | 2165 | 299 | 503 | 783 | 298 | 3965 | 1974 | 9880 | 2086 |
2014-05-01 00:00:00 | 761 | 3161 | 2900 | 1862 | 1569 | 345 | 509 | 795 | 117 | 3605 | 1318 | 8058 | 773 |
district_month_sorted = district_month.reindex(district_month.sum().sort_values(ascending=False).index, axis=1)
ax = district_month_sorted.plot.area(figsize=(10, 7))
handles, labels = ax.get_legend_handles_labels()
ax.legend(handles[::-1], labels[::-1], loc='center left', bbox_to_anchor=(1, 0.5), title='District');
Can we simplify this by focusing on just the largest districts?
d2_month = pd.DataFrame()
d2_month['Sinazongwe'] = district_month['Sinazongwe']
d2_month['Gwembe'] = district_month['Gwembe']
d2_month['Siavonga'] = district_month['Siavonga']
d2_month['Others'] = district_month_sorted.drop(['Sinazongwe', 'Gwembe', 'Siavonga'], axis=1).sum(axis=1)
d2_month.head()
Sinazongwe | Gwembe | Siavonga | Others | |
---|---|---|---|---|
date | ||||
2014-01-01 | 3543 | 1875 | 1014 | 8714 |
2014-02-01 | 4308 | 2049 | 1168 | 12080 |
2014-03-01 | 11058 | 4606 | 2742 | 21207 |
2014-04-01 | 9880 | 3980 | 1974 | 20670 |
2014-05-01 | 8058 | 2900 | 1318 | 13497 |
f, ax = plt.subplots(1, 1, sharey=True, figsize=(10, 7), facecolor='lemonchiffon')
plt.suptitle('Incidence of malaria cases in southern Zambia (simulated)\nThree provinces with highest caseload separated', fontsize=20)
d2_month.plot.area(figsize=(10, 7), ax=ax, color=['firebrick', 'tomato', 'lightsalmon', 'darkgreen'])
ax.set_facecolor('lemonchiffon')
handles, labels = ax.get_legend_handles_labels()
ax.legend(handles[::-1], labels[::-1], title='District', facecolor='lemonchiffon'); # , loc='center left', bbox_to_anchor=(1, 0.5)
f.savefig('malaria-districts.png', facecolor=f.get_facecolor(), transparent=True)
ax = d2_month.plot(figsize=(10, 7))
handles, labels = ax.get_legend_handles_labels()
ax.legend(handles[::-1], labels[::-1], loc='center left', bbox_to_anchor=(1, 0.5), title='District');
district_year = malaria_raw.groupby([malaria_raw.date.dt.year, 'district']).sum().unstack().T.reset_index().set_index('district').drop('level_0', axis=1).T
district_year
district | Chikankata | Choma | Gwembe | Kalomo | Kazungula | Livingstone | Mazabuka | Monze | Namwala | Pemba | Siavonga | Sinazongwe | Zimba |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
date | |||||||||||||
2014 | 5852 | 21810 | 29523 | 14410 | 8093 | 1752 | 2919 | 4152 | 1073 | 24984 | 15995 | 70664 | 8326 |
2015 | 2717 | 4030 | 15349 | 4005 | 3623 | 837 | 1297 | 1539 | 695 | 6925 | 6295 | 43057 | 3930 |
2016 | 2178 | 3940 | 4963 | 9690 | 13009 | 1255 | 2328 | 1939 | 2801 | 2019 | 7087 | 16823 | 1306 |
2017 | 2073 | 2396 | 9899 | 6983 | 8627 | 845 | 1322 | 1452 | 1813 | 994 | 10595 | 18761 | 1094 |
2018 | 1097 | 221 | 4325 | 441 | 379 | 101 | 202 | 161 | 57 | 159 | 731 | 9569 | 328 |
# district_year = malaria_raw.groupby([malaria_raw.date.dt.year, 'district']).sum().unstack()
# district_year
ax = district_year.plot()
ax.legend(loc='center left', bbox_to_anchor=(1, 0.5));
ax = district_year.plot.area()
ax.legend(loc='center left', bbox_to_anchor=(1, 0.5));
district_year_sorted = district_year.reindex(district_year.sum().sort_values(ascending=False).index, axis=1)
ax = district_year_sorted.plot.area(figsize=(10, 7))
handles, labels = ax.get_legend_handles_labels()
ax.legend(handles[::-1], labels[::-1], loc='center left', bbox_to_anchor=(1, 0.5), title='District');
ax = district_year.drop('Sinazongwe', axis=1).plot()
ax.legend(loc='center left', bbox_to_anchor=(1, 0.5));
Again, can we simplify this by focusing on just the largest districts?
d2_year = pd.DataFrame()
d2_year['Others'] = district_year.drop(['Sinazongwe', 'Gwembe', 'Siavonga'], axis=1).sum(axis=1)
d2_year['Siavonga'] = district_year['Siavonga']
d2_year['Gwembe'] = district_year['Gwembe']
d2_year['Sinazongwe'] = district_year['Sinazongwe']
d2_year
Others | Siavonga | Gwembe | Sinazongwe | |
---|---|---|---|---|
date | ||||
2014 | 93371 | 15995 | 29523 | 70664 |
2015 | 29598 | 6295 | 15349 | 43057 |
2016 | 40465 | 7087 | 4963 | 16823 |
2017 | 27599 | 10595 | 9899 | 18761 |
2018 | 3146 | 731 | 4325 | 9569 |
d2_year.plot.bar(stacked=True)
<matplotlib.axes._subplots.AxesSubplot at 0x7ff97afccb00>
d2_year.plot.area(xticks=[x for x in range(2014, 2019)])
<matplotlib.axes._subplots.AxesSubplot at 0x7ff97af404a8>
What if we plot how cases each month change over the five years?
malaria_raw['month'] = malaria_raw['date'].dt.month
malaria_raw.head()
country | province | district | ruralurban | date | report | cases | month | |
---|---|---|---|---|---|---|---|---|
0 | Zambia | Southern | Chikankata | Rural | 2014-01-01 | Health Facility | 0 | 1 |
1 | Zambia | Southern | Chikankata | Rural | 2014-01-01 | Community Health Worker | 288 | 1 |
2 | Zambia | Southern | Chikankata | Rural | 2014-02-01 | Health Facility | 0 | 2 |
3 | Zambia | Southern | Chikankata | Rural | 2014-02-01 | Community Health Worker | 251 | 2 |
4 | Zambia | Southern | Chikankata | Rural | 2014-03-01 | Health Facility | 0 | 3 |
malaria_month_year = pd.pivot_table(malaria_raw,index='month',columns='district',values='cases', aggfunc=np.sum)
malaria_month_year
district | Chikankata | Choma | Gwembe | Kalomo | Kazungula | Livingstone | Mazabuka | Monze | Namwala | Pemba | Siavonga | Sinazongwe | Zimba |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
month | |||||||||||||
1 | 2556 | 2766 | 6381 | 2633 | 2618 | 509 | 1074 | 1274 | 765 | 5064 | 2074 | 16561 | 1640 |
2 | 1903 | 3516 | 4958 | 3480 | 4200 | 466 | 784 | 888 | 585 | 5972 | 1998 | 11983 | 2202 |
3 | 2051 | 7604 | 7330 | 5937 | 6528 | 614 | 1070 | 1215 | 1091 | 6593 | 4837 | 18379 | 3508 |
4 | 2267 | 6994 | 8352 | 9419 | 8215 | 799 | 1546 | 1676 | 1194 | 5423 | 5955 | 20070 | 3447 |
5 | 1738 | 5294 | 8341 | 6923 | 7708 | 1081 | 1571 | 1825 | 1668 | 5247 | 7246 | 21220 | 1739 |
6 | 599 | 2209 | 5534 | 2835 | 2215 | 393 | 587 | 817 | 472 | 2686 | 4542 | 17599 | 652 |
7 | 378 | 769 | 7538 | 1193 | 645 | 140 | 255 | 353 | 161 | 1083 | 3694 | 9924 | 243 |
8 | 260 | 422 | 4998 | 605 | 260 | 137 | 223 | 266 | 73 | 553 | 3995 | 8765 | 179 |
9 | 391 | 489 | 4205 | 439 | 156 | 124 | 210 | 234 | 57 | 458 | 3045 | 12220 | 158 |
10 | 515 | 731 | 2716 | 472 | 208 | 95 | 163 | 236 | 49 | 470 | 1667 | 10442 | 285 |
11 | 398 | 661 | 1731 | 540 | 173 | 167 | 218 | 141 | 69 | 570 | 880 | 5352 | 323 |
12 | 861 | 942 | 1975 | 1053 | 805 | 265 | 367 | 318 | 255 | 962 | 770 | 6359 | 608 |
malaria_month_year.plot()
<matplotlib.axes._subplots.AxesSubplot at 0x7ff983347ef0>
Normalise the data, so that scores are fraction of that district's cases in each month.
mmy_norm = malaria_month_year / malaria_month_year.sum()
mmy_norm
district | Chikankata | Choma | Gwembe | Kalomo | Kazungula | Livingstone | Mazabuka | Monze | Namwala | Pemba | Siavonga | Sinazongwe | Zimba |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
month | |||||||||||||
1 | 0.183660 | 0.085378 | 0.099611 | 0.074108 | 0.077614 | 0.106263 | 0.133118 | 0.137834 | 0.118807 | 0.144352 | 0.050954 | 0.104240 | 0.109450 |
2 | 0.136739 | 0.108529 | 0.077397 | 0.097948 | 0.124515 | 0.097286 | 0.097174 | 0.096073 | 0.090853 | 0.170235 | 0.049087 | 0.075425 | 0.146957 |
3 | 0.147374 | 0.234713 | 0.114426 | 0.167103 | 0.193531 | 0.128184 | 0.132623 | 0.131451 | 0.169436 | 0.187936 | 0.118836 | 0.115683 | 0.234116 |
4 | 0.162894 | 0.215884 | 0.130380 | 0.265107 | 0.243545 | 0.166806 | 0.191621 | 0.181326 | 0.185433 | 0.154585 | 0.146304 | 0.126327 | 0.230045 |
5 | 0.124883 | 0.163410 | 0.130208 | 0.194855 | 0.228514 | 0.225678 | 0.194720 | 0.197447 | 0.259046 | 0.149568 | 0.178021 | 0.133565 | 0.116057 |
6 | 0.043041 | 0.068185 | 0.086389 | 0.079794 | 0.065667 | 0.082046 | 0.072757 | 0.088391 | 0.073303 | 0.076566 | 0.111589 | 0.110773 | 0.043513 |
7 | 0.027161 | 0.023737 | 0.117673 | 0.033578 | 0.019122 | 0.029228 | 0.031606 | 0.038191 | 0.025004 | 0.030871 | 0.090755 | 0.062465 | 0.016217 |
8 | 0.018682 | 0.013026 | 0.078022 | 0.017028 | 0.007708 | 0.028601 | 0.027640 | 0.028779 | 0.011337 | 0.015764 | 0.098150 | 0.055170 | 0.011946 |
9 | 0.028095 | 0.015094 | 0.065643 | 0.012356 | 0.004625 | 0.025887 | 0.026029 | 0.025316 | 0.008852 | 0.013056 | 0.074810 | 0.076916 | 0.010545 |
10 | 0.037005 | 0.022564 | 0.042398 | 0.013285 | 0.006166 | 0.019833 | 0.020203 | 0.025533 | 0.007610 | 0.013398 | 0.040955 | 0.065725 | 0.019020 |
11 | 0.028598 | 0.020403 | 0.027022 | 0.015199 | 0.005129 | 0.034864 | 0.027020 | 0.015255 | 0.010716 | 0.016248 | 0.021620 | 0.033687 | 0.021556 |
12 | 0.061867 | 0.029077 | 0.030831 | 0.029638 | 0.023865 | 0.055324 | 0.045488 | 0.034404 | 0.039602 | 0.027422 | 0.018918 | 0.040025 | 0.040577 |
ax = mmy_norm.plot()
ax.legend(loc='center left', bbox_to_anchor=(1, 0.5));
and a very clear seasonal pattern emerges.
What can we see if we look at the proportions of reports by different categories?
First, let's look at the rural/urban split, and the health reporter split (whether a health worker or a health facility).
report_month = malaria_raw.pivot_table(index='date', columns=['ruralurban', 'report'], values='cases', aggfunc=sum)
report_month.head()
ruralurban | Rural | Urban | ||
---|---|---|---|---|
report | Community Health Worker | Health Facility | Community Health Worker | Health Facility |
date | ||||
2014-01-01 | 3899 | 10595 | 163 | 489 |
2014-02-01 | 6711 | 12303 | 163 | 428 |
2014-03-01 | 9669 | 28824 | 312 | 808 |
2014-04-01 | 9646 | 25533 | 415 | 910 |
2014-05-01 | 5913 | 18763 | 193 | 904 |
ax = report_month.divide(report_month.sum(axis=1), axis=0).plot.area()
ax.legend(loc='center left', bbox_to_anchor=(1, 0.5))
<matplotlib.legend.Legend at 0x7ff97af8d0b8>
Another look at districts. Which districts have the largest proportion of cases each month?
ax = district_month_sorted.divide(district_month_sorted.sum(axis=1), axis=0).plot.area()
ax.legend(loc='center left', bbox_to_anchor=(1, 0.5))
<matplotlib.legend.Legend at 0x7ff980312940>
ax = malaria_month_year.divide(malaria_month_year.sum(axis=1), axis=0).plot.area()
ax.legend(loc='center left', bbox_to_anchor=(1, 0.5))
<matplotlib.legend.Legend at 0x7ff980336dd8>
Count the numbers in each category pair.
malaria_raw.groupby('ruralurban').cases.sum() / malaria_raw.groupby('ruralurban').cases.sum().sum()
ruralurban Rural 0.963603 Urban 0.036397 Name: cases, dtype: float64
malaria_raw.groupby('report').cases.sum() / malaria_raw.groupby('report').cases.sum().sum()
report Community Health Worker 0.438267 Health Facility 0.561733 Name: cases, dtype: float64
malaria_raw.groupby('district').cases.sum() / malaria_raw.groupby('district').cases.sum().sum()
district Chikankata 0.030399 Choma 0.070764 Gwembe 0.139923 Kalomo 0.077606 Kazungula 0.073678 Livingstone 0.010463 Mazabuka 0.017623 Monze 0.020189 Namwala 0.014065 Pemba 0.076627 Siavonga 0.088907 Sinazongwe 0.347027 Zimba 0.032729 Name: cases, dtype: float64
While these are interesting graphs, they don't tell us much on their own. In particular, what is the base population in each of these categories? For instance, that only 4% of cases are urban could just be a reflection that 4% of the population is urban. Similarly, if Sinazongwe has 34% of the population, that would neatly explain the 34% of cases.