Through modern statistical methods, we can determine survival risk based on a variety of factors. In this tutorial, we will walk through a small example of something you could do with our data to understand what factors relate with survival in various different types of cancer. In this use case, we will be looking at Ovarian Cancer.
import pandas as pd
import cptac
import cptac.utils as ut
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import scipy
import lifelines
from lifelines import KaplanMeierFitter
from lifelines import CoxPHFitter
from lifelines.statistics import proportional_hazard_test
%matplotlib inline
ov = cptac.Ovarian()
clinical = ov.get_clinical()
proteomics = ov.get_proteomics()
follow_up = ov.get_followup()
The Ovarian cancer dataset contains months of follow-up data, including whether a patient is still alive (Vital Status) at each follow-up period. We will first merge the clinical and follow-up tables together for analysis. Then we will choose a few attributes to focus on, and narrow our dataframe to those attributes. While you could study a wide variety of factors related to survival, we will be focusing on tumor stage, grade and a proteins of interest listed below in omics_genes.
First we will join the clinical and proteomics dataframes to contain protein data for proteins of interest, and clinical data for each patient in the same dataframe. We will then drop the multiindex of that dataframe to allow us to join it with the follow_up dataframe.
cols = list(clinical.columns)
omics_genes = ['RAC2', 'PODXL']
clinical_and_protein = ov.join_metadata_to_omics(metadata_df_name="clinical",
omics_df_name="proteomics",
metadata_cols=cols,
omics_genes=omics_genes,
quiet=True)
clinical_and_protein = ut.reduce_multiindex(clinical_and_protein,
levels_to_drop="Database_ID")
Next, we will rename the foreign key ("PPID" -> "Patient_ID") on the follow_up table to allow us to easily join that data with the dataframe of clinical and protein data
follow_up = follow_up.rename({'PPID' : 'Patient_ID'}, axis='columns')
clin_prot_follow = pd.merge(clinical_and_protein, follow_up, on = "Patient_ID")
#Determine columns to focus on, and create a subset to work with
columns_to_focus_on = ['Vital_Status',
'Days_Between_Collection_And_Last_Contact',
'Days_Between_Collection_And_Death',
'Tumor_Stage_Ovary_FIGO']
#This adds the protein data that we got from the clinical and proteomics join
#so that it will be present in our subset of data to work with
for i in range(len(omics_genes)):
omics_genes[i] += '_proteomics'
columns_to_focus_on.append(omics_genes[i])
focus_group = clin_prot_follow[columns_to_focus_on].copy().drop_duplicates()
Kaplan Meier plots show us the probability of some event occuring over a given length of time, based on some attribute(s). Oftentimes, they are used to plot the probability of death for clinical attributes, however they could also be used in a variety of other contexts.
We will start by finding all patients that have died during the follow-up period and update the column to contain boolean values, where True denotes that the event occurred ('Deceased'), and False denotes that it did not ('Living'). We will then combine the two columns containing timeframe data ('Days_Between_Collection_And_Last_Contact', and 'Days_Between_Collection_And_Death'), to help us with plotting survival curves. These steps are necessary to fit the requirements of the lifelines package.
#Make the Vital Status column boolean
focus_group['Vital_Status'] = focus_group['Vital_Status'].replace('Living', False)
focus_group['Vital_Status'] = focus_group['Vital_Status'].replace('Deceased', True)
focus_group['Vital_Status'] = focus_group['Vital_Status'].astype('bool')
cols = ['Days_Between_Collection_And_Last_Contact', 'Days_Between_Collection_And_Death']
focus_group = focus_group.assign(Days_Until_Last_Contact_Or_Death=focus_group[cols].sum(1)).drop(cols, 1)
<ipython-input-6-e6e65af9fe6e>:3: FutureWarning: In a future version of pandas all arguments of DataFrame.drop except for the argument 'labels' will be keyword-only focus_group = focus_group.assign(Days_Until_Last_Contact_Or_Death=focus_group[cols].sum(1)).drop(cols, 1)
Next, we will create a general Kaplan Meier Plot of overall survival for our cohort, using the KaplanMeierFitter() from the lifelines package.
time = focus_group['Days_Until_Last_Contact_Or_Death']
status = focus_group['Vital_Status']
kmf = KaplanMeierFitter()
kmf.fit(time, event_observed = status)
kmf.plot()
<AxesSubplot:xlabel='timeline'>
We will now group our columns of interest into 3-4 distinct categories each, and assign them numeric values. It is necessary for the requirements of the lifelines package that the categories are assigned numeric values (other data types, including category, are not compatible with the functions we will be using).
df_genes = focus_group.copy()
#Here, we are separating the protein abundance values for each of our proteins
#of interest into 3 groups, based on relative abundance of the protein
for col in omics_genes:
lower_25_filter = df_genes[col] <= df_genes[col].quantile(.25)
upper_25_filter = df_genes[col] >= df_genes[col].quantile(.75)
df_genes[col] = np.where(lower_25_filter, "Lower_25%", df_genes[col])
df_genes[col] = np.where(upper_25_filter, "Upper_25%", df_genes[col])
df_genes[col] = np.where(~lower_25_filter & ~upper_25_filter, "Middle_50%", df_genes[col])
#Here, we map numeric values to correspond with our 3 protein categories
proteomics_map = {"Lower_25%" : 1, "Middle_50%" : 2, "Upper_25%" : 3}
for gene in omics_genes:
df_genes[gene] = df_genes[gene].map(proteomics_map)
#Here we map numeric values to corresponding tumor stages
figo_map = {"IIIC" : 3, "IV" : 4, "IIIB" : 3,
"Not Reported/ Unknown" : np.nan,
"IIIA" : 3, "III" : 3, "IC" : 1, "IIB" : 2}
df_genes['Tumor_Stage_Ovary_FIGO'] = df_genes['Tumor_Stage_Ovary_FIGO'].map(figo_map)
#Then we will drop missing values, as missing values
# will throw an error in the lifelines functions
df_clean = df_genes.dropna(axis=0, how='any').copy()
Verify that your columns are the correct data types. They may appear to be correct up front, but could actually be hidden as slightly different data types. The event of interest, in this case Vital_Status needs to contain boolean values, and all other columns in the table must be of a numeric type (either int64 or float64).
for col in df_clean.columns:
print(col, ":", df_clean[col].dtype)
df_clean.head()
With the CoxPHFitter from the lifelines package we can create covariate survival plots, as shown below. The variables we are interested in exploring are Tumor Stage, RAC2 abundance, and PODXL abundance.
First we will fit our model to the data we have prepared using the CoxPHFitter() class from the lifelines module.
cph = CoxPHFitter()
cph.fit(df_clean, duration_col = "Days_Until_Last_Contact_Or_Death",
event_col = "Vital_Status")
Then we will plot each of the attributes to see how different levels of protein or different tumor stages affect survival outcomes in Ovarian Cancer patients.
attributes = ['Tumor_Stage_Ovary_FIGO', 'PODXL_proteomics', 'RAC2_proteomics']
for attribute in attributes:
plot_title = "Ovarian Cancer Survival Risk: " + attribute
cph.plot_partial_effects_on_outcome(attribute, [1,2,3], cmap='coolwarm',
title=plot_title)
These different analyses tend to follow the baseline survival function, however, there are some differences in varying levels of each attribute. For example, FIGO Stage I tumors tend to have a higher survival rate over time comparatively to Stage III tumors. We can explore these differences with the CoxPHFitter object's print_summary function (which prints out results for multivariate linear regression).
cph.print_summary(model="untransformed variables", decimals=3)
model | lifelines.CoxPHFitter |
---|---|
duration col | 'Days_Until_Last_Contact_Or_Death' |
event col | 'Vital_Status' |
baseline estimation | breslow |
number of observations | 95 |
number of events observed | 7 |
partial log-likelihood | -20.448 |
time fit was run | 2021-09-27 17:02:26 UTC |
model | untransformed variables |
coef | exp(coef) | se(coef) | coef lower 95% | coef upper 95% | exp(coef) lower 95% | exp(coef) upper 95% | z | p | -log2(p) | |
---|---|---|---|---|---|---|---|---|---|---|
Tumor_Stage_Ovary_FIGO | 0.477 | 1.611 | 0.809 | -1.108 | 2.062 | 0.330 | 7.861 | 0.590 | 0.555 | 0.848 |
RAC2_proteomics | -0.944 | 0.389 | 0.715 | -2.346 | 0.457 | 0.096 | 1.580 | -1.321 | 0.187 | 2.422 |
PODXL_proteomics | 0.140 | 1.150 | 0.497 | -0.833 | 1.113 | 0.435 | 3.043 | 0.281 | 0.778 | 0.361 |
Concordance | 0.730 |
---|---|
Partial AIC | 46.896 |
log-likelihood ratio test | 2.143 on 3 df |
-log2(p) of ll-ratio test | 0.881 |
With the proportional_hazard_test function, we can now perform Cox's Proportional Hazard Test on the data to determine how each attribute contributes to our cohort's overall survival. This is shown by the hazard ratio in the column labeled -log2(p) below. In general, a hazard ratio of 1 suggests that an attribute has no effect on overall survival. A ratio less than 1 suggests that an attribute contributes to lower survival risk. A ratio greater than 1 suggests that an attribute contributes to higher survival risk.
results = proportional_hazard_test(cph, df_clean, time_transform='rank')
results.print_summary(decimals=3, model="untransformed variables")
time_transform | rank |
---|---|
null_distribution | chi squared |
degrees_of_freedom | 1 |
model | <lifelines.CoxPHFitter: fitted with 95 total o... |
test_name | proportional_hazard_test |
test_statistic | p | -log2(p) | |
---|---|---|---|
PODXL_proteomics | 1.94 | 0.16 | 2.61 |
RAC2_proteomics | 4.50 | 0.03 | 4.88 |
Tumor_Stage_Ovary_FIGO | 0.34 | 0.56 | 0.84 |
Below, we show confidence intervals for each of the hazard ratios. Since both bars include the log(HR) of 1.0 and both of their p-values were greater than 0.05, there is insufficient evidence to suggest that a specific Histologic Grade or Tumor Stage is connected with negative clinical outcomes of death or the development of a new tumor in our cohort of Ovarian cancer tumors.
cph.plot()
plt.tight_layout()
It is important to note that there are relatively few patients who died in our cohort (7 out of 88), which is good, but with such a small sample size of death events, it is difficult to conclude with certainty that these features are not more or less connected with survival. Perhaps a sample of patients with more deaths might have different results. Alternatively, studying an event with more negative outcomes (such as tumor recurrence) may also provide more data to work with.
df_clean['Vital_Status'].value_counts()
False 88 True 7 Name: Vital_Status, dtype: int64
It is also important to note that the confidence intervals for these ratios are very large, especially since hazard ratios are standardly shown on a log-scale.
cph.confidence_intervals_
95% lower-bound | 95% upper-bound | |
---|---|---|
covariate | ||
Tumor_Stage_Ovary_FIGO | -1.108103 | 2.061942 |
RAC2_proteomics | -2.345647 | 0.457209 |
PODXL_proteomics | -0.833496 | 1.112925 |
This is just one example of how you might use Survival Analysis to learn more about different types of cancer, and how clinical and/or genetic attributes contribute to likelihood of survival. There are many other clinical and genetic attributes, as well as several other cancer types, that can be explored using a similar process to that above. In particular, lung cancer and ovarian cancer have a larger number of negative outcomes per cohort, and would be good to look into further.