The MIMIC II database demo is a subset of 4,000 (of over 32,000) patients from the MIMIC II database. These data are located here: http://physionet.org/mimic2/demo/.
No living patients are included in the demo subset (although many of these patients lived for up to several years followingi their ICU admissions documented in this data set). Although these data are exempt from HIPAA requirements for protecting health information of living individuals, the data have been very carefully deidentified, and we have removed free-text notes and reports as a further measure to reduce the possibility of disclosing information that might be used to identify these patients.
The data can be obtained as a disk image that includes Ubuntu 10.04, PostgreSQL 8.4.8, pgadmin3 1.10.2, and MIMIC II v2.6 with data for 4000 patients. This disk image can then be loaded via Virtual Box. Note: Make sure to select the install option when booting the first time in virtualbox. And the next time you boot, boot from disk.
There is also data available for predicting mortality of ICU patients. It was used as a data challenge in 2012 and is publicly available here. It comes in text files and is already somewhat cleaned up, so it should be easy to get started with. I will not discuss these data in this post, but wanted to make a mention of it.
If you are anything like me, then you prefer to work on your own machine as opposed to a VM. So lets get the data out of the VM and onto our machines.
If you are new to PostgreSQL you can check out this tutorial. The fastest way to query the data is by typing "psql mimic2" in the terminal. You can then query the data. For example you can type "select * from mimic2v26.icustay_detail;" But I like doing my data work in Python, so I will show you how to use python to extract data from a PostgreSQL database.
If you are new to Python I would recommend installing the Anaconda distribution. It comes with almost all the libraries you will need for scientific computing. Once installed, type "pip install psycopg2" to get the library to read the data from PostgreSQL. If you run into an "image not found" error when loading the library into Python, check out this post.
Python has a Pandas library that is great for working with data. This library also has an SQL interface, so you can query our PostgreSQL data and it will load into a Pandas data frame. This is a great format to start working with our data in Python.
In the next few steps, I am going to load in some data and do some quick investigations. I will be roughly following the MIMIC Data Science Primer.
import psycopg2
import pandas.io.sql as psql
import pandas as pd
from matplotlib import pyplot as plt
from __future__ import division #now division always returns a floating point number
import statsmodels.api as sm
import numpy as np
from sklearn.linear_model import LogisticRegression
import brewer2mpl
from sklearn import metrics
%matplotlib inline
As you can see below, we first start by connecting to our database.
db = psycopg2.connect("dbname='mimic2' host='localhost'")
We can then get a list of all the tables in our database
tables = psql.read_sql("""select tablename from pg_catalog.pg_tables
where schemaname='mimic2v26';""", con=db)
tables.head()
tablename | |
---|---|
0 | chartevents |
1 | comorbidity_scores |
2 | d_chartitems_detail |
3 | d_codeditems |
4 | d_parammap_items |
There are 38 tables
tables.shape
(38, 1)
We can also create a table that shows us which tables have which variables. You can see that we have a total of 255 variables.
#lets get information on the tables and the columns
col_info = psql.read_sql("""SELECT table_name, column_name
FROM information_schema.columns
WHERE table_schema = 'mimic2v26';""", con = db)
col_info['value'] = "YES"
col_info.head()
col_info = col_info.pivot(index='table_name', columns='column_name', values = 'value')
col_info.shape
(38, 255)
This is useful because now we can see which tables have certain columns. For example, which tables have a subject_id?
col_info.ix[col_info.subject_id == 'YES']['subject_id']
table_name a_chartdurations YES a_iodurations YES a_meddurations YES additives YES admissions YES censusevents YES chartevents YES comorbidity_scores YES d_patients YES deliveries YES demographic_detail YES demographicevents YES drgevents YES icd9 YES icustay_days YES icustay_detail YES icustayevents YES ioevents YES labevents YES medevents YES microbiologyevents YES noteevents YES poe_order YES procedureevents YES totalbalevents YES Name: subject_id, dtype: object
Or what variables are in a certain table? For example, the icustay_detail table.
col_info.ix['icustay_detail',np.where(col_info.ix['icustay_detail',:] == "YES")]
column_name dob YES dod YES expire_flg YES gender YES hadm_id YES height YES hospital_admit_dt YES hospital_disch_dt YES hospital_expire_flg YES hospital_first_flg YES hospital_last_flg YES hospital_los YES hospital_seq YES hospital_total_num YES icustay_admit_age YES icustay_age_group YES icustay_expire_flg YES icustay_first_careunit YES icustay_first_flg YES icustay_first_service YES icustay_id YES icustay_intime YES icustay_last_careunit YES icustay_last_flg YES icustay_last_service YES icustay_los YES icustay_outtime YES icustay_seq YES icustay_total_num YES matched_waveforms_num YES sapsi_first YES sapsi_max YES sapsi_min YES sofa_first YES sofa_max YES sofa_min YES subject_icustay_seq YES subject_icustay_total_num YES subject_id YES weight_first YES weight_max YES weight_min YES Name: icustay_detail, dtype: object
Now that we have connected to the database, we can query it just as you would in the command line. The great thing, though, is that now our data is returned in a Pandas data frame that we can work with in Python.
data = psql.read_sql("""select subject_id, hadm_id, icustay_admit_age,
weight_min, weight_max, gender,
icustay_first_careunit,
icustay_last_careunit,
hospital_los, expire_flg,
sapsi_max, sapsi_min,
date_part('day',dod - icustay_intime) as survival_days
from mimic2v26.icustay_detail
where icustay_age_group = 'adult'
and icustay_los > 24*60
and subject_icustay_seq = 1;""", con=db)
Here is a look at the first 3 rows of our data. Cool :) And the dimensions
data.head(n=3)
subject_id | hadm_id | icustay_admit_age | weight_min | weight_max | gender | icustay_first_careunit | icustay_last_careunit | hospital_los | expire_flg | sapsi_max | sapsi_min | survival_days | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 26 | 15067 | 72.00637 | NaN | NaN | M | CCU | CCU | 10080 | Y | 6 | 5 | 658 |
1 | 37 | 18052 | 68.93134 | NaN | NaN | M | MICU | MICU | 7200 | Y | 11 | 11 | 501 |
2 | 78 | 15161 | 48.62681 | 72.4 | 72.4 | M | MICU | MICU | 4320 | Y | 14 | 11 | 1082 |
data.shape
(3206, 12)
Remove rows with any missing data
data = data.dropna(axis=0)
data.shape
(2698, 12)
Data visualization is a big part of data science, so I try and make my plots look good. The code below modifies some of the matplotlib defaults to make the plots look a bit cleaner. Source: Harvard CS 109 course.
# Set up some better defaults for matplotlib
from matplotlib import rcParams
#colorbrewer2 Dark2 qualitative color table
dark2_colors = brewer2mpl.get_map('Dark2', 'Qualitative', 7).mpl_colors
rcParams['figure.figsize'] = (10, 6)
rcParams['figure.dpi'] = 150
rcParams['axes.color_cycle'] = dark2_colors
rcParams['lines.linewidth'] = 2
rcParams['axes.facecolor'] = 'white'
rcParams['font.size'] = 14
rcParams['patch.edgecolor'] = 'white'
rcParams['patch.facecolor'] = dark2_colors[0]
rcParams['font.family'] = 'StixGeneral'
def remove_border(axes=None, top=False, right=False, left=True, bottom=True, ax = None):
"""
Minimize chartjunk by stripping out unnecesasry plot borders and axis ticks
The top/right/left/bottom keywords toggle whether the corresponding plot border is drawn
"""
if ax == None:
ax = axes or plt.gca()
ax.spines['top'].set_visible(top)
ax.spines['right'].set_visible(right)
ax.spines['left'].set_visible(left)
ax.spines['bottom'].set_visible(bottom)
#turn off all ticks
ax.yaxis.set_ticks_position('none')
ax.xaxis.set_ticks_position('none')
#now re-enable visibles
if top:
ax.xaxis.tick_top()
if bottom:
ax.xaxis.tick_bottom()
if left:
ax.yaxis.tick_left()
if right:
ax.yaxis.tick_right()
Lets look at the differences between males and females for age and weight
p = data[['gender','icustay_admit_age']].groupby('gender').boxplot(return_type='axes')
p = data[['gender','weight_max']].groupby('gender').boxplot()
Lets look at some different counts across gender
#number of men and women
data.gender.value_counts()
M 1453 F 1245 dtype: int64
#counts by ICU
pd.crosstab(data.icustay_first_careunit, data.gender)
gender | F | M |
---|---|---|
icustay_first_careunit | ||
CCU | 288 | 296 |
CSRU | 309 | 455 |
FICU | 132 | 177 |
MICU | 478 | 479 |
SICU | 38 | 46 |
#percentage by ICU
perct_icu = pd.crosstab(data.icustay_first_careunit,
data.gender).apply(lambda r:
r/r.sum(), axis=0)
perct_icu['diff'] = perct_icu.M - perct_icu.F
perct_icu
gender | F | M | diff |
---|---|---|---|
icustay_first_careunit | |||
CCU | 0.231325 | 0.203716 | -0.027609 |
CSRU | 0.248193 | 0.313145 | 0.064952 |
FICU | 0.106024 | 0.121817 | 0.015793 |
MICU | 0.383936 | 0.329663 | -0.054273 |
SICU | 0.030522 | 0.031659 | 0.001137 |
#max sapsi score difference
data.groupby('gender').sapsi_max.mean()
gender F 17.419561 M 16.942875 Name: sapsi_max, dtype: float64
#average admit age
data.groupby('gender').icustay_admit_age.mean()
gender F 73.220369 M 69.947374 Name: icustay_admit_age, dtype: float64
#average max weight
data.groupby('gender').weight_max.mean()
gender F 74.993463 M 87.143591 Name: weight_max, dtype: float64
There exists a MIMIC cookbook which provides SQL code to calculate various measures such as the Elixhauser scores. Fortunately, they have a github repository so we can easily use this code. If you are new to git and github check out these tutorials or do some google searching. Lets go ahead and try out the PostgreSQL code for Elixhauser scores. Note: The code on github doesn't seem to always work out of the box. I had to make a few changes to the substring and to_number functions.
elixhauser = psql.read_sql("""
WITH icd9list AS (
SELECT adm.subject_id,
adm.hadm_id,
code,
sequence,
substring(code from '^\D') as icd9_alpha,
to_number(substring(code from '\d+$|\d+\.\d+$'), '999999999') as icd9_numeric
FROM mimic2v26.admissions adm,
mimic2v26.icd9 icd
WHERE adm.hadm_id = icd.hadm_id
--AND adm.hadm_id < 100
)
--SELECT * FROM icd9list;
, drglist AS (
SELECT adm.subject_id,
adm.hadm_id,
to_number(ci.code, '999999999') AS codenum,
ci.description
FROM mimic2v26.admissions adm,
mimic2v26.drgevents drg,
mimic2v26.d_codeditems ci
WHERE adm.hadm_id = drg.hadm_id
AND drg.itemid = ci.itemid
AND ci.type='HFCA_DRG'
)
--SELECT * FROM drglist;
, drg_category AS (
SELECT subject_id,
hadm_id,
CASE
WHEN (drglist.codenum >= 103 AND drglist.codenum <= 108)
OR (drglist.codenum >= 110 AND drglist.codenum <= 112)
OR (drglist.codenum >= 115 AND drglist.codenum <= 118)
OR (drglist.codenum >= 120 AND drglist.codenum <= 127)
OR drglist.codenum = 129
OR (drglist.codenum >= 132 AND drglist.codenum <= 133)
OR (drglist.codenum >= 135 AND drglist.codenum <= 143)
THEN 1
ELSE 0
END AS cardiac,
CASE
WHEN (drglist.codenum >= 302 AND drglist.codenum <= 305)
OR (drglist.codenum >= 315 AND drglist.codenum <= 333)
THEN 1
ELSE 0
END AS renal,
CASE
WHEN (drglist.codenum >= 199 AND drglist.codenum <= 202)
OR (drglist.codenum >= 205 AND drglist.codenum <= 208)
THEN 1
ELSE 0
END AS liver,
CASE
WHEN (drglist.codenum >= 400 AND drglist.codenum <= 414)
OR drglist.codenum = 473
OR drglist.codenum = 492
THEN 1
ELSE 0
END AS leukemia_lymphoma,
CASE
WHEN drglist.codenum = 10
OR drglist.codenum = 11
OR drglist.codenum = 64
OR drglist.codenum = 82
OR drglist.codenum = 172
OR drglist.codenum = 173
OR drglist.codenum = 199
OR drglist.codenum = 203
OR drglist.codenum = 239
OR (drglist.codenum >= 257 AND drglist.codenum <= 260)
OR drglist.codenum = 274
OR drglist.codenum = 275
OR drglist.codenum = 303
OR drglist.codenum = 318
OR drglist.codenum = 319
OR drglist.codenum = 338
OR drglist.codenum = 344
OR drglist.codenum = 346
OR drglist.codenum = 347
OR drglist.codenum = 354
OR drglist.codenum = 355
OR drglist.codenum = 357
OR drglist.codenum = 363
OR drglist.codenum = 366
OR drglist.codenum = 367
OR (drglist.codenum >= 406 AND drglist.codenum <= 414)
THEN 1
ELSE 0
END AS cancer,
CASE
WHEN drglist.codenum = 88
THEN 1
ELSE 0
END AS copd,
CASE
WHEN (drglist.codenum >= 130 AND drglist.codenum <= 131)
THEN 1
ELSE 0
END AS peripheral_vascular,
CASE
WHEN drglist.codenum = 134
THEN 1
ELSE 0
END AS hypertension,
CASE
WHEN (drglist.codenum >= 14 AND drglist.codenum <= 17)
OR drglist.codenum=5
THEN 1
ELSE 0
END AS cerebrovascular,
CASE
WHEN (drglist.codenum >= 1 AND drglist.codenum <= 35)
THEN 1
ELSE 0
END AS nervous_system,
CASE
WHEN (drglist.codenum >= 96 AND drglist.codenum <= 98)
THEN 1
ELSE 0
END AS asthma,
CASE
WHEN (drglist.codenum >= 294 AND drglist.codenum <= 295)
THEN 1
ELSE 0
END AS diabetes,
CASE
WHEN drglist.codenum = 290
THEN 1
ELSE 0
END AS thyroid,
CASE
WHEN (drglist.codenum >= 300 AND drglist.codenum <= 301)
THEN 1
ELSE 0
END AS endocrine,
CASE
WHEN drglist.codenum = 302
THEN 1
ELSE 0
END AS kidney_transplant,
CASE
WHEN (drglist.codenum >= 316 AND drglist.codenum <= 317)
THEN 1
ELSE 0
END AS renal_failure_dialysis,
CASE
WHEN (drglist.codenum >= 174 AND drglist.codenum <= 178)
THEN 1
ELSE 0
END AS gi_hemorrhage_ulcer,
CASE
WHEN (drglist.codenum >= 488 AND drglist.codenum <= 490)
THEN 1
ELSE 0
END AS hiv,
CASE
WHEN (drglist.codenum >= 240 AND drglist.codenum <= 241)
THEN 1
ELSE 0
END AS connective_tissue,
CASE
WHEN drglist.codenum = 397
THEN 1
ELSE 0
END AS coagulation,
CASE
WHEN drglist.codenum = 288
THEN 1
ELSE 0
END AS obesity_procedure,
CASE
WHEN (drglist.codenum >= 396 AND drglist.codenum <= 298)
THEN 1
ELSE 0
END AS nutrition_metabolic,
CASE
WHEN (drglist.codenum >= 395 AND drglist.codenum <= 396)
THEN 1
ELSE 0
END AS anemia,
CASE
WHEN (drglist.codenum >= 433 AND drglist.codenum <= 437)
THEN 1
ELSE 0
END AS alcohol_drug,
CASE
WHEN drglist.codenum = 430
THEN 1
ELSE 0
END AS psychoses,
CASE
WHEN drglist.codenum = 426
THEN 1
ELSE 0
END AS depression
FROM drglist
)
--SELECT * FROM drg_category;
, elixhauser AS (
SELECT icd.subject_id,
icd.hadm_id,
MAX(
CASE
WHEN icd.icd9_alpha IS NULL
AND (icd.icd9_numeric = 398.91
OR icd.icd9_numeric = 402.11
OR icd.icd9_numeric = 402.91
OR icd.icd9_numeric = 404.11
OR icd.icd9_numeric = 404.13
OR icd.icd9_numeric = 404.91
OR icd.icd9_numeric = 404.93
OR icd.icd9_numeric BETWEEN 428 AND 428.9)
AND drg.cardiac = 0
THEN 1
ELSE 0
END
) AS congestive_heart_failure,
MAX(
CASE
WHEN ((icd.icd9_alpha IS NULL
AND (icd.icd9_numeric = 426.1
OR icd.icd9_numeric = 426.11
OR icd.icd9_numeric = 426.13
OR icd.icd9_numeric BETWEEN 426.2 AND 426.53
OR icd.icd9_numeric BETWEEN 426.6 AND 426.89
OR icd.icd9_numeric = 427
OR icd.icd9_numeric = 427.2
OR icd.icd9_numeric = 427.31
OR icd.icd9_numeric = 427.6
OR icd.icd9_numeric = 427.9
OR icd.icd9_numeric = 785))
OR (icd.icd9_alpha = 'V'
AND (icd.icd9_numeric = 45
OR icd.icd9_numeric = 53.3)))
AND drg.cardiac = 0
THEN 1
ELSE 0
END
) AS cardiac_arrhythmias,
MAX(
CASE
WHEN ((icd.icd9_alpha IS NULL
AND (icd.icd9_numeric BETWEEN 93.2 AND 93.24
OR icd.icd9_numeric BETWEEN 394 AND 397.1
OR icd.icd9_numeric BETWEEN 424 AND 424.91
OR icd.icd9_numeric BETWEEN 746.3 AND 746.6))
OR (icd.icd9_alpha = 'V'
AND (icd.icd9_numeric = 42.2
OR icd.icd9_numeric = 43.3)))
AND drg.cardiac = 0
THEN 1
ELSE 0
END
) AS valvular_disease,
MAX(
CASE
WHEN icd.icd9_alpha IS NULL
AND (icd.icd9_numeric BETWEEN 416 AND 416.9
OR icd.icd9_numeric = 417.9)
AND (drg.cardiac = 0 AND drg.copd = 0)
THEN 1
ELSE 0
END
) AS pulmonary_circulation,
MAX(
CASE
WHEN ((icd.icd9_alpha IS NULL
AND (icd.icd9_numeric BETWEEN 440 AND 440.9
OR icd.icd9_numeric = 441.2
OR icd.icd9_numeric = 441.4
OR icd.icd9_numeric = 441.7
OR icd.icd9_numeric = 441.9
OR icd.icd9_numeric BETWEEN 443.1 AND 443.9
OR icd.icd9_numeric = 447.1
OR icd.icd9_numeric = 557.1
OR icd.icd9_numeric = 557.9))
OR (icd.icd9_alpha = 'V'
AND icd.icd9_numeric = 43.4))
AND drg.peripheral_vascular = 0
THEN 1
ELSE 0
END
) AS peripheral_vascular,
MAX(
CASE
WHEN icd.icd9_alpha IS NULL
AND (icd.icd9_numeric = 401.1
OR icd.icd9_numeric = 401.9
OR icd.icd9_numeric = 402.1
OR icd.icd9_numeric = 402.9
OR icd.icd9_numeric = 404.1
OR icd.icd9_numeric = 404.9
OR icd.icd9_numeric = 405.11
OR icd.icd9_numeric = 405.19
OR icd.icd9_numeric = 405.91
OR icd.icd9_numeric = 405.99)
AND (drg.hypertension = 0 AND drg.cardiac = 0 AND drg.renal = 0)
THEN 1
ELSE 0
END
) AS hypertension,
MAX(
CASE
WHEN icd.icd9_alpha IS NULL
AND (icd.icd9_numeric BETWEEN 342 AND 342.12
OR icd.icd9_numeric BETWEEN 342.9 AND 344.9)
AND drg.cerebrovascular = 0
THEN 1
ELSE 0
END
) AS paralysis,
MAX(
CASE
WHEN icd.icd9_alpha IS NULL
AND (icd.icd9_numeric = 331.9
OR icd.icd9_numeric = 332
OR icd.icd9_numeric = 333.4
OR icd.icd9_numeric = 333.5
OR icd.icd9_numeric BETWEEN 334 AND 335.9
OR icd.icd9_numeric = 340
OR icd.icd9_numeric BETWEEN 341.1 AND 341.9
OR icd.icd9_numeric BETWEEN 345 AND 345.11
OR icd.icd9_numeric BETWEEN 345.4 AND 345.51
OR icd.icd9_numeric BETWEEN 345.8 AND 345.91
OR icd.icd9_numeric = 348.1
OR icd.icd9_numeric = 348.3
OR icd.icd9_numeric = 780.3
OR icd.icd9_numeric = 784.3)
AND drg.nervous_system = 0
THEN 1
ELSE 0
END
) AS other_neurological,
MAX(
CASE
WHEN icd.icd9_alpha IS NULL
AND (icd.icd9_numeric BETWEEN 490 AND 492.8
OR icd.icd9_numeric BETWEEN 493 AND 493.91
OR icd.icd9_numeric = 494
OR icd.icd9_numeric BETWEEN 495 AND 505
OR icd.icd9_numeric = 506.4)
AND (drg.copd = 0 AND drg.asthma = 0)
THEN 1
ELSE 0
END
) AS chronic_pulmonary,
MAX(
CASE
WHEN icd.icd9_alpha IS NULL
AND icd.icd9_numeric BETWEEN 250 AND 250.33
AND drg.diabetes = 0
THEN 1
ELSE 0
END
) AS diabetes_uncomplicated,
MAX(
CASE
WHEN icd.icd9_alpha IS NULL
AND (icd.icd9_numeric BETWEEN 250.4 AND 250.73
OR icd.icd9_numeric BETWEEN 250.9 AND 250.93)
AND drg.diabetes = 0
THEN 1
ELSE 0
END
) AS diabetes_complicated,
MAX(
CASE
WHEN icd.icd9_alpha IS NULL
AND (icd.icd9_numeric BETWEEN 243 AND 244.2
OR icd.icd9_numeric = 244.8
OR icd.icd9_numeric = 244.9)
AND (drg.thyroid = 0 AND drg.endocrine = 0)
THEN 1
ELSE 0
END
) AS hypothyroidism,
MAX(
CASE
WHEN ((icd.icd9_alpha IS NULL
AND (icd.icd9_numeric = 403.11
OR icd.icd9_numeric = 403.91
OR icd.icd9_numeric = 404.12
OR icd.icd9_numeric = 404.92
OR icd.icd9_numeric = 585
OR icd.icd9_numeric = 586))
OR (icd.icd9_alpha = 'V'
AND (icd.icd9_numeric = 42
OR icd.icd9_numeric = 45.1
OR icd.icd9_numeric = 56
OR icd.icd9_numeric = 56.8)))
AND (drg.kidney_transplant = 0 AND renal_failure_dialysis = 0)
THEN 1
ELSE 0
END
) AS renal_failure,
MAX(
CASE
WHEN ((icd.icd9_alpha IS NULL
AND (icd.icd9_numeric = 70.32
OR icd.icd9_numeric = 70.33
OR icd.icd9_numeric = 70.54
OR icd.icd9_numeric = 456
OR icd.icd9_numeric = 456.1
OR icd.icd9_numeric = 456.2
OR icd.icd9_numeric = 456.21
OR icd.icd9_numeric = 571
OR icd.icd9_numeric = 571.2
OR icd.icd9_numeric = 571.3
OR icd.icd9_numeric BETWEEN 571.4 AND 571.49
OR icd.icd9_numeric = 571.5
OR icd.icd9_numeric = 571.6
OR icd.icd9_numeric = 571.8
OR icd.icd9_numeric = 571.9
OR icd.icd9_numeric = 572.3
OR icd.icd9_numeric = 572.8))
OR (icd.icd9_alpha = 'V'
AND icd.icd9_numeric = 42.7))
AND drg.liver = 0
THEN 1
ELSE 0
END
) AS liver_disease,
MAX(
CASE
WHEN ((icd.icd9_alpha IS NULL
AND (icd.icd9_numeric = 531.7
OR icd.icd9_numeric = 531.9
OR icd.icd9_numeric = 532.7
OR icd.icd9_numeric = 532.9
OR icd.icd9_numeric = 533.7
OR icd.icd9_numeric = 533.9
OR icd.icd9_numeric = 534.7
OR icd.icd9_numeric = 534.9))
OR (icd.icd9_alpha = 'V'
AND icd.icd9_numeric = 12.71))
AND drg.gi_hemorrhage_ulcer = 0
THEN 1
ELSE 0
END
) AS peptic_ulcer,
MAX(
CASE
WHEN icd.icd9_alpha IS NULL
AND icd.icd9_numeric BETWEEN 42 AND 44.9
AND drg.hiv = 0
THEN 1
ELSE 0
END
) AS aids,
MAX(
CASE
WHEN ((icd.icd9_alpha IS NULL
AND (icd.icd9_numeric BETWEEN 200 AND 202.38
OR icd.icd9_numeric BETWEEN 202.5 AND 203.01
OR icd.icd9_numeric BETWEEN 203.8 AND 203.81
OR icd.icd9_numeric = 238.6
OR icd.icd9_numeric = 273.3))
OR (icd.icd9_alpha = 'V'
AND (icd.icd9_numeric = 10.71
OR icd.icd9_numeric = 10.72
OR icd.icd9_numeric = 10.79)))
AND drg.leukemia_lymphoma = 0
THEN 1
ELSE 0
END
) AS lymphoma,
MAX(
CASE
WHEN icd.icd9_alpha IS NULL
AND icd.icd9_numeric BETWEEN 196 AND 199.1
AND drg.cancer = 0
THEN 1
ELSE 0
END
) AS metastatic_cancer,
MAX(
CASE
WHEN ((icd.icd9_alpha IS NULL
AND (icd.icd9_numeric BETWEEN 140 AND 172.9
OR icd.icd9_numeric BETWEEN 174 AND 175.9
OR icd.icd9_numeric BETWEEN 179 AND 195.8))
OR (icd.icd9_alpha = 'V'
AND icd.icd9_numeric BETWEEN 10 AND 10.9))
AND drg.cancer = 0
THEN 1
ELSE 0
END
) AS solid_tumor,
MAX(
CASE
WHEN icd.icd9_alpha IS NULL
AND (icd.icd9_numeric = 701
OR icd.icd9_numeric BETWEEN 710 AND 710.9
OR icd.icd9_numeric BETWEEN 714 AND 714.9
OR icd.icd9_numeric BETWEEN 720 AND 720.9
OR icd.icd9_numeric = 725)
AND drg.connective_tissue = 0
THEN 1
ELSE 0
END
) AS rheumatoid_arthritis,
MAX(
CASE
WHEN icd.icd9_alpha IS NULL
AND (icd.icd9_numeric BETWEEN 2860 AND 2869
OR icd.icd9_numeric = 287.1
OR icd.icd9_numeric BETWEEN 287.3 AND 287.5)
AND drg.coagulation = 0
THEN 1
ELSE 0
END
) AS coagulopathy,
MAX(
CASE
WHEN icd.icd9_alpha IS NULL
AND icd.icd9_numeric = 278
AND (drg.obesity_procedure = 0 AND drg.nutrition_metabolic = 0)
THEN 1
ELSE 0
END
) AS obesity,
MAX(
CASE
WHEN icd.icd9_alpha IS NULL
AND icd.icd9_numeric BETWEEN 260 AND 263.9
AND drg.nutrition_metabolic = 0
THEN 1
ELSE 0
END
) AS weight_loss,
MAX(
CASE
WHEN icd.icd9_alpha IS NULL
AND icd.icd9_numeric BETWEEN 276 AND 276.9
AND drg.nutrition_metabolic = 0
THEN 1
ELSE 0
END
) AS fluid_electrolyte,
MAX(
CASE
WHEN icd.icd9_alpha IS NULL
AND icd.icd9_numeric = 2800
AND drg.anemia = 0
THEN 1
ELSE 0
END
) AS blood_loss_anemia,
MAX(
CASE
WHEN icd.icd9_alpha IS NULL
AND (icd.icd9_numeric BETWEEN 280.1 AND 281.9
OR icd.icd9_numeric = 285.9)
AND drg.anemia = 0
THEN 1
ELSE 0
END
) AS deficiency_anemias,
MAX(
CASE
WHEN ((icd.icd9_alpha IS NULL
AND (icd.icd9_numeric = 291.1
OR icd.icd9_numeric = 291.2
OR icd.icd9_numeric = 291.5
OR icd.icd9_numeric = 291.8
OR icd.icd9_numeric = 291.9
OR icd.icd9_numeric BETWEEN 303.9 AND 303.93
OR icd.icd9_numeric BETWEEN 305 AND 305.03))
OR (icd.icd9_alpha = 'V'
AND icd.icd9_numeric = 113))
AND drg.alcohol_drug = 0
THEN 1
ELSE 0
END
) AS alcohol_abuse,
MAX(
CASE
WHEN icd.icd9_alpha IS NULL
AND (icd.icd9_numeric = 292
OR icd.icd9_numeric BETWEEN 292.82 AND 292.89
OR icd.icd9_numeric = 292.9
OR icd.icd9_numeric BETWEEN 304 AND 304.93
OR icd.icd9_numeric BETWEEN 305.2 AND 305.93)
AND drg.alcohol_drug = 0
THEN 1
ELSE 0
END
) AS drug_abuse,
MAX(
CASE
WHEN icd.icd9_alpha IS NULL
AND (icd.icd9_numeric BETWEEN 295 AND 298.9
OR icd.icd9_numeric BETWEEN 299.1 AND 299.11)
AND drg.psychoses = 0
THEN 1
ELSE 0
END
) AS psychoses,
MAX(
CASE
WHEN icd.icd9_alpha IS NULL
AND (icd.icd9_numeric = 300.4
OR icd.icd9_numeric = 301.12
OR icd.icd9_numeric = 309
OR icd.icd9_numeric = 309.1
OR icd.icd9_numeric = 311)
AND drg.depression = 0
THEN 1
ELSE 0
END
) AS depression
FROM icd9list icd, drg_category drg
WHERE icd.hadm_id = drg.hadm_id
GROUP BY icd.subject_id, icd.hadm_id
)
SELECT
SUBJECT_ID,
HADM_ID,
'ELIXHAUSER',
CONGESTIVE_HEART_FAILURE,
CARDIAC_ARRHYTHMIAS,
VALVULAR_DISEASE,
PULMONARY_CIRCULATION,
PERIPHERAL_VASCULAR,
HYPERTENSION,
PARALYSIS,
OTHER_NEUROLOGICAL,
CHRONIC_PULMONARY,
DIABETES_UNCOMPLICATED,
DIABETES_COMPLICATED,
HYPOTHYROIDISM,
RENAL_FAILURE,
LIVER_DISEASE,
PEPTIC_ULCER,
AIDS,
LYMPHOMA,
METASTATIC_CANCER,
SOLID_TUMOR,
RHEUMATOID_ARTHRITIS,
COAGULOPATHY,
OBESITY,
WEIGHT_LOSS,
FLUID_ELECTROLYTE,
BLOOD_LOSS_ANEMIA,
DEFICIENCY_ANEMIAS,
ALCOHOL_ABUSE,
DRUG_ABUSE,
PSYCHOSES,
DEPRESSION
FROM elixhauser ;""", con=db)
Lets create a column for the total number of elixhauser points and merge it onto our data
elixhauser['total_elix'] = elixhauser.drop(['subject_id', '?column?'], 1).sum(1)
elixhauser = elixhauser.drop_duplicates(subset=['subject_id', 'hadm_id'])
data2 = pd.merge(data, elixhauser, how='left', on=['subject_id', 'hadm_id'])
Lets also pull in some demograpic information for our patients
demo_info = psql.read_sql("""
select subject_id, hadm_id, marital_status_descr,
case
when ethnicity_descr like '%WHITE%' then 'WHITE'
when ethnicity_descr like '%BLACK%' then 'BLACK'
when ethnicity_descr like '%HISPANIC%' then 'HISPANIC'
when ethnicity_descr like '%ASIAN%' then 'ASIAN'
else 'OTHER' end as ethnicity,
admission_type_descr
from mimic2v26.demographic_detail;""", con=db)
demo_info.head()
subject_id | hadm_id | marital_status_descr | ethnicity | admission_type_descr | |
---|---|---|---|---|---|
0 | 26 | 15067 | SINGLE | OTHER | EMERGENCY |
1 | 37 | 18052 | MARRIED | WHITE | EMERGENCY |
2 | 78 | 15161 | SEPARATED | BLACK | EMERGENCY |
3 | 67 | 35878 | SINGLE | WHITE | EMERGENCY |
4 | 56 | 28766 | None | WHITE | EMERGENCY |
And merge it onto our data
data_2v2 = pd.merge(data2, demo_info, how='left', on=['subject_id', 'hadm_id'])
Now that we have some data together, we can do a quick logistic regression to try and classify whether a person will die before a year after admission.
We first need to create some dummy variables
data_2v2 = data_2v2.dropna(axis=0)
dummy_list = ['icustay_first_careunit', 'gender', 'marital_status_descr', 'ethnicity', 'admission_type_descr']
dummy_frames = [pd.get_dummies(data_2v2[x]) for x in dummy_list]
dummy_frames = pd.concat(dummy_frames, axis=1)
data_2v2['year_die'] = (data_2v2.survival_days < 365).astype(int)
data3 = pd.concat([data_2v2, dummy_frames], axis=1)
data3['intercept'] = 1.0
#note: FICU is the same as MICU, so combining
data3['MICU'] = np.where(data3.FICU == 1, 1, data3.MICU)
train_cols = ['intercept','icustay_admit_age', 'weight_max','M', 'CSRU', 'MICU', 'SICU',
'sapsi_max', 'sapsi_min', 'total_elix', 'DIVORCED', 'MARRIED', 'SEPARATED', 'WIDOWED',
'ASIAN', 'BLACK', 'HISPANIC', 'EMERGENCY', 'URGENT']
Use statsmodels to do the estimation becuase we get P values.
logit = sm.Logit(data3.year_die, data3[train_cols])
result = logit.fit()
result.summary()
Optimization terminated successfully. Current function value: 0.533862 Iterations 7
Dep. Variable: | year_die | No. Observations: | 2208 |
---|---|---|---|
Model: | Logit | Df Residuals: | 2189 |
Method: | MLE | Df Model: | 18 |
Date: | Fri, 02 Jan 2015 | Pseudo R-squ.: | 0.1225 |
Time: | 21:22:35 | Log-Likelihood: | -1178.8 |
converged: | True | LL-Null: | -1343.3 |
LLR p-value: | 4.799e-59 |
coef | std err | z | P>|z| | [95.0% Conf. Int.] | |
---|---|---|---|---|---|
intercept | -2.0407 | 0.431 | -4.731 | 0.000 | -2.886 -1.195 |
icustay_admit_age | -0.0066 | 0.004 | -1.605 | 0.109 | -0.015 0.001 |
weight_max | -0.0038 | 0.002 | -1.727 | 0.084 | -0.008 0.001 |
M | -0.0920 | 0.111 | -0.832 | 0.406 | -0.309 0.125 |
CSRU | -0.1025 | 0.141 | -0.726 | 0.468 | -0.379 0.174 |
MICU | 0.4103 | 0.130 | 3.162 | 0.002 | 0.156 0.665 |
SICU | 1.3303 | 0.462 | 2.878 | 0.004 | 0.424 2.236 |
sapsi_max | 0.0786 | 0.012 | 6.759 | 0.000 | 0.056 0.101 |
sapsi_min | 0.1032 | 0.016 | 6.533 | 0.000 | 0.072 0.134 |
total_elix | 4.213e-07 | 7.14e-08 | 5.903 | 0.000 | 2.81e-07 5.61e-07 |
DIVORCED | -0.0580 | 0.236 | -0.245 | 0.806 | -0.521 0.405 |
MARRIED | 0.0081 | 0.141 | 0.057 | 0.954 | -0.269 0.285 |
SEPARATED | 0.4553 | 0.592 | 0.770 | 0.442 | -0.704 1.615 |
WIDOWED | -0.2520 | 0.172 | -1.467 | 0.143 | -0.589 0.085 |
ASIAN | 0.5206 | 0.503 | 1.034 | 0.301 | -0.466 1.507 |
BLACK | -0.2541 | 0.229 | -1.112 | 0.266 | -0.702 0.194 |
HISPANIC | -0.4725 | 0.404 | -1.168 | 0.243 | -1.265 0.320 |
EMERGENCY | 0.7361 | 0.164 | 4.490 | 0.000 | 0.415 1.057 |
URGENT | -0.0675 | 0.266 | -0.254 | 0.800 | -0.589 0.454 |
Lets take a look at the SICU, year_die table. You can see that 91% of people in the SICU die before a year as opposed to 70% not in the SICU. Thus, it makes sense that it has a large, significant coefficient.
sicu = pd.crosstab(data3.SICU, data3.year_die)
sicu_pct = sicu.div(sicu.sum(axis=1), axis=0)
sicu_pct
year_die | 0 | 1 |
---|---|---|
SICU | ||
0 | 0.303172 | 0.696828 |
1 | 0.093750 | 0.906250 |
Here are some additional links to help you better understand the MIMIC data