Logistic Homework¶

In [2]:

from datascience import *
import numpy as np
import math

%matplotlib inline
import matplotlib.pyplot as plots
plots.style.use('fivethirtyeight')
from scipy import stats

import pandas

sepsis = Table.read_table('sepsis.csv')

import statsmodels.formula.api as sfm

np.warnings.filterwarnings('ignore', category=np.VisibleDeprecationWarning)

def boolean_to_binary(x):
    return int(x)

The data we call sepsis was downloaded from UCI Machine Learning Repository and was originally published in here. It contains information on whether patients admitted to the hospital suffering from sepsis survived. The variable names are long, but also self-explanatory.

In [3]:

sepsis.show(3)

age_years	sex_0male_1female	episode_number	hospital_outcome_1alive_0dead
21	1	1	1
20	1	1	1
21	1	1	1

... (110201 rows omitted)

Question 1 Run a multivariate logistic model to predict whether a person survives (1) or does not survive (0), using all the variables in the sepsis table.

If you don't recall how to do this, review the Logistic Lab or the in-class notebook over logistic regression.

In [4]:

## Prepare the data in the necessary format over the next few lines

Age = sepsis.column("age_years")
Sex = sepsis.column("sex_0male_1female")
Episode = sepsis.column("episode_number")
Survived = sepsis.column("hospital_outcome_1alive_0dead")


## On the next few lines, only edit the names of the arrays and variables
logistic_model_data = pandas.DataFrame({'Age': Age, "Sex": Sex, "Episode": Episode, "Survived": Survived})

## Finish the analysis

...

Out[4]:

Ellipsis

Question 2 One of the variables in the so-called "complete model" logistic_model1 is not significant at the 5% level. Remove it, rerun the model and call that model logistic_model_2. Print the summary of this model.

In [5]:

...

Out[5]:

Ellipsis

Question 3 Based on this data and this model, which statement is true?

a) An older person with sepsis is more likely to die than a younger person with it.

b) A younger person with sepsis is more likely to die than an older person with it.

(Keep in mind that Survived is coded as 1 in this data set.)

Enter either a) or b) inside the quotes for the variable age_risk.

In [32]:

age_risk = " "

Question 4 Based on this data and this model, which statement is true?

a) A male with sepsis is more likely to die than a female with it.

b) A female with sepsis is more likely to die than a male with it.

(Keep in mind that male is coded as 0 and female is coded as 1 in this data set.)

Enter either a) or b) inside the quotes for the variable sex_risk.

In [33]:

sex_risk = " "

Question 5 Using a threshold of 0.8, construct the confusion matrix for logistic_model_2. To help, we got you started on formatting a table that can be used to create the confusion matrix.

Use the variable confusion_matrix for the confusion matrix.

In [36]:

predictions = Table().with_columns("Actual", Survived, "bool", logistic_model_2.predict() >=0.8)
predictions = predictions.with_column("Prediction", predictions.apply(boolean_to_binary, "bool")).drop("bool")

## keep going to make the confusion matrix
confusion_matrix = predictions.pivot("Prediction", "Actual")
confusion_matrix

Out[36]:

Actual	0	1
0	102	8003
1	435	101664

Question 6 Find the sensitivity for this model with a threshold of 0.8.

In [37]:

sensitivity = ...
sensitivity

Out[37]:

Ellipsis

Question 7 Find the specificity for this model with a threshold of 0.8.

In [38]:

specificity = ...
specificity

Out[38]:

Ellipsis

Question 8 Find the positive predictive value, PPV, for this model with a threshold of 0.8.

In [ ]:

ppv = ...
ppv

Question 9 Find the negative predictive value, NPV, for this model with a threshold of 0.8.

In [ ]:

npv = ...
npv

Question 10 Which of these four individuals does this model predict has the highest probability of surviving sepsis?

a) A 20 year old female
b) A 20 year old male
c) A 40 year old female
d) A 40 year old male

In the cell below, replace the ellipse (...) with either a, b, c or d. Leave no spaces between you letter choice and the quotation marks.

In [ ]:

best_chance = "..."

Summary¶

In [18]:

print(f"The coefficients of logistic_model_1 are {logistic_model_1._results.params}")
print(f"The coefficients of logistic_model_2 are {logistic_model_2._results.params}")
print(f"My choice for age_risk was {age_risk}")
print(f"My choice for sex_risk was {sex_risk}")
print("My second models confusion matrix was:")
print(confusion_matrix)
print(f"This model has a sensitivity of {sensitivity}, a specificity of {specificity}, a PPV of {ppv} and a NPV of {npv}")
print(f"My choice for best_chance was {best_chance}")

The coefficients of logistic_model_1 are [ 5.63538281 -0.04438333  0.1779475  -0.0256846 ]
The coefficients of logistic_model_2 are [ 5.59674657 -0.0443472   0.17996867]