This use case will guide you through accessing and manipulating data from the clinical dataframe. We will explore how to retrieve clinical data, select variables for analysis, and represent interactions between those variables in a meaningful way. We will use endometrial cancer data from the CPTAC to will compare FIGO stage--a common measurement of uterine cancer stage (see the NIH for more information)--with BMI to see if there is any correlation.
We will start by importing the standard data analysis libraries, plus the seaborn library for visualization.
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
Our cancer data and a variety of accessory functions are provided in the cptac package.
import cptac
cptac.download(dataset="endometrial", version="latest")
en = cptac.Endometrial()
We load the dataframe for clinical results by calling the en.get_clinical()
method. We can then select the data for the variables we are interested in from the dataframe. To see what clinical attributes are available for analysis, we will print out the column names in the clinical dataframe.
clinical_data = en.get_clinical()
print(clinical_data.columns)
Index(['Sample_ID', 'Sample_Tumor_Normal', 'Proteomics_Tumor_Normal', 'Country', 'Histologic_Grade_FIGO', 'Myometrial_invasion_Specify', 'Histologic_type', 'Treatment_naive', 'Tumor_purity', 'Path_Stage_Primary_Tumor-pT', 'Path_Stage_Reg_Lymph_Nodes-pN', 'Clin_Stage_Dist_Mets-cM', 'Path_Stage_Dist_Mets-pM', 'tumor_Stage-Pathological', 'FIGO_stage', 'LVSI', 'BMI', 'Age', 'Diabetes', 'Race', 'Ethnicity', 'Gender', 'Tumor_Site', 'Tumor_Site_Other', 'Tumor_Focality', 'Tumor_Size_cm', 'Num_full_term_pregnancies'], dtype='object', name='Name')
You can get definitions for terms in the dataset by calling the en.define
method. If you need more information, searching the term online can be helpful. The en.search
function will open a webpage with the search results for the specified term.
en.define(term='FIGO_stage')
en.define(term='BMI')
# cptac.utils.search('FIGO_stage') will open a browser search for FIGO_stage
Categorical classification of FIGO state (I being less severe, IV being more severe) BMI (body mass index) of specified patient.
For this use case, we are only interested in FIGO stage and BMI.
For the purposes of this example, we want to simplify the FIGO stages to remove the subcategories. The following section of code creates a function to sort stages into simple categories and adds a new column to the dataframe with the simplified FIGO stage.
def simplify_FIGO(row):
if row['FIGO_stage'] == 'IA':
return 'I'
elif row['FIGO_stage'] == 'IB':
return 'I'
elif row['FIGO_stage'] == 'II':
return 'II'
elif row['FIGO_stage'] == 'IIIA':
return 'III'
elif row['FIGO_stage'] == 'IIIB':
return 'III'
elif row['FIGO_stage'] == 'IIIC1':
return 'III'
elif row['FIGO_stage'] == 'IIIC2':
return 'III'
elif row['FIGO_stage'] == 'IVB':
return 'IV'
else:
return row['FIGO_stage']
#axis=1 means that we are working with rows. (axis=0 means columns)
clinical_data['simple_FIGO'] = clinical_data.apply(simplify_FIGO, axis=1)
FIGO stage is categorical and BMI is continuous, so we can use a histogram to see the distribution of the patient's BMI separated by cancer stage.
sns.displot(clinical_data, x="BMI", hue="simple_FIGO", kind="kde", common_norm=False, bw_adjust=.7)
plt.show()
Another way of viewing this same data is a boxplot. In this example, we add a strip plot over the box plot to visualize each datapoint on top of the boxplot.
figoStrip = sns.boxplot(data=clinical_data, x="simple_FIGO", y="BMI", showfliers=False)
figoStrip = sns.stripplot(data=clinical_data, x="simple_FIGO", y="BMI", jitter=True, color=".3")
figoStrip.set(xlabel = "FIGO stage")
plt.show()
The same technique will work for visualizing any other variables in this dataset. As another example, we will create a similar histogram and boxplot to examine the BMI of research subjects compared to diagnosis of diabetes. Once again, we select the variables of interest from the clinical dataframe, then plot the resulting dataframe using the seaborn package.
diabetes_bmi = clinical_data[['Diabetes', 'BMI']]
#generate histogram
sns.displot(diabetes_bmi[diabetes_bmi.Diabetes != "Unknown"], x="BMI", hue="Diabetes", kind="kde", common_norm=False)
plt.show()
#generate boxplot
diabetes_plot = sns.boxplot(data=diabetes_bmi, x="Diabetes", y="BMI", showfliers=False)
diabetes_plot = sns.stripplot(data=diabetes_bmi, x="Diabetes", y="BMI", jitter=True, color=".3")
plt.show()