from datascience import *
import numpy as np
import math
%matplotlib inline
import matplotlib.pyplot as plots
plots.style.use('fivethirtyeight')
from scipy import stats
import pandas
sepsis = Table.read_table('sepsis.csv')
import statsmodels.formula.api as sfm
np.warnings.filterwarnings('ignore', category=np.VisibleDeprecationWarning)
def boolean_to_binary(x):
return int(x)
The data we call sepsis
was downloaded from UCI Machine Learning Repository and was originally published in here. It contains information on whether patients admitted to the hospital suffering from sepsis survived. The variable names are long, but also self-explanatory.
sepsis.show(3)
age_years | sex_0male_1female | episode_number | hospital_outcome_1alive_0dead |
---|---|---|---|
21 | 1 | 1 | 1 |
20 | 1 | 1 | 1 |
21 | 1 | 1 | 1 |
... (110201 rows omitted)
Question 1 Run a multivariate logistic model to predict whether a person survives (1) or does not survive (0), using all the variables in the sepsis
table.
If you don't recall how to do this, review the Logistic Lab or the in-class notebook over logistic regression.
## Prepare the data in the necessary format over the next few lines
Age = sepsis.column("age_years")
Sex = sepsis.column("sex_0male_1female")
Episode = sepsis.column("episode_number")
Survived = sepsis.column("hospital_outcome_1alive_0dead")
## On the next few lines, only edit the names of the arrays and variables
logistic_model_data = pandas.DataFrame({'Age': Age, "Sex": Sex, "Episode": Episode, "Survived": Survived})
## Finish the analysis
...
Ellipsis
Question 2 One of the variables in the so-called "complete model" logistic_model1
is not significant at the 5% level. Remove it, rerun the model and call that model logistic_model_2
. Print the summary of this model.
...
Ellipsis
Question 3 Based on this data and this model, which statement is true?
a) An older person with sepsis is more likely to die than a younger person with it.
b) A younger person with sepsis is more likely to die than an older person with it.
(Keep in mind that Survived is coded as 1 in this data set.)
Enter either a) or b) inside the quotes for the variable age_risk
.
age_risk = " "
Question 4 Based on this data and this model, which statement is true?
a) A male with sepsis is more likely to die than a female with it.
b) A female with sepsis is more likely to die than a male with it.
(Keep in mind that male is coded as 0 and female is coded as 1 in this data set.)
Enter either a) or b) inside the quotes for the variable sex_risk
.
sex_risk = " "
Question 5 Using a threshold of 0.8, construct the confusion matrix for logistic_model_2
. To help, we got you started on formatting a table that can be used to create the confusion matrix.
Use the variable confusion_matrix
for the confusion matrix.
predictions = Table().with_columns("Actual", Survived, "bool", logistic_model_2.predict() >=0.8)
predictions = predictions.with_column("Prediction", predictions.apply(boolean_to_binary, "bool")).drop("bool")
## keep going to make the confusion matrix
confusion_matrix = predictions.pivot("Prediction", "Actual")
confusion_matrix
Actual | 0 | 1 |
---|---|---|
0 | 102 | 8003 |
1 | 435 | 101664 |
Question 6 Find the sensitivity for this model with a threshold of 0.8.
sensitivity = ...
sensitivity
Ellipsis
Question 7 Find the specificity for this model with a threshold of 0.8.
specificity = ...
specificity
Ellipsis
Question 8 Find the positive predictive value, PPV, for this model with a threshold of 0.8.
ppv = ...
ppv
Question 9 Find the negative predictive value, NPV, for this model with a threshold of 0.8.
npv = ...
npv
Question 10 Which of these four individuals does this model predict has the highest probability of surviving sepsis?
a) A 20 year old female
b) A 20 year old male
c) A 40 year old female
d) A 40 year old male
In the cell below, replace the ellipse (...) with either a, b, c or d. Leave no spaces between you letter choice and the quotation marks.
best_chance = "..."
print(f"The coefficients of logistic_model_1 are {logistic_model_1._results.params}")
print(f"The coefficients of logistic_model_2 are {logistic_model_2._results.params}")
print(f"My choice for age_risk was {age_risk}")
print(f"My choice for sex_risk was {sex_risk}")
print("My second models confusion matrix was:")
print(confusion_matrix)
print(f"This model has a sensitivity of {sensitivity}, a specificity of {specificity}, a PPV of {ppv} and a NPV of {npv}")
print(f"My choice for best_chance was {best_chance}")
The coefficients of logistic_model_1 are [ 5.63538281 -0.04438333 0.1779475 -0.0256846 ] The coefficients of logistic_model_2 are [ 5.59674657 -0.0443472 0.17996867]