For this use case, we will be looking at the derived molecular data contained in the Endometrial dataset, and comparing it with protein data. Derived molecular data means that we created new variables based on molecular data. One example of this is the activity of a pathway based on the abundance of phosphorylation sites. A second example is inferred cell type percentages from algorithms like CIBERSORT, which are based on comparing transcriptomics data to known profiles of pure cell types.
We will start by importing the python packages we will need, including the cptac data package. We will then load the Endometrial dataset which includes the endometrial patient data as well as accessory functions that we will use to analyze the data.
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import cptac
en = cptac.Ucec()
For this use case, we will be using two dataframes contained in the Endometrial dataset: derived_molecular
and proteomics
. We will load the derived_molecular dataframe and examine the data contained within it.
der_molecular = en.get_derived_molecular('awg')
The derived molecular dataframe contains many different attributes that we can choose from for analysis. To view a list of these attributes, we can print out the column names of the dataframe. Here we print only the first 10 column names. To view the full list of column names without truncation, omit the slice ([:10]
) at the end of the call. If your terminal is still abbreviating the list, first use the command pd.set_option('display.max_seq_items', None)
.
der_molecular.columns.tolist()[:10]
['Estrogen_Receptor', 'Estrogen_Receptor_%', 'Progesterone_Receptor', 'Progesterone_Receptor_%', 'MLH1', 'MLH2', 'MSH6', 'PMS2', 'p53', 'Other_IHC_specify']
For this use case, we will compare MSI status with the JAK1 protein abundance. MSI stands for Microsatellite instability. The possible values for MSI status are MSI-H (high microsatellite instability) or MSS (microsatellite stable). In this context, "nan" refers to non-tumor samples. To see all of the possible values in any column, you can use the pandas function .unique()
der_molecular['MSI_status'].unique()
array(['MSI-H', 'MSS', nan], dtype=object)
We will use the en.join_metadada_to_omics
function to join our desired molecular trait with the proteomics data.
joined_data = en.join_metadata_to_omics(metadata_name="derived_molecular",
metadata_source="awg",
metadata_cols='MSI_status',
omics_name="proteomics",
omics_source="awg")
Now we will use the seaborn and matplotlib libraries to create a boxplot and histogram that will allow us to visualize this data. For more information on using seaborn, see this Seaborn tutorial.
msi_boxplot = sns.boxplot(x='MSI_status', y='JAK1_awg_proteomics', data=joined_data, showfliers=False,
order=['MSS', 'MSI-H'])
msi_boxplot = sns.stripplot(x='MSI_status', y='JAK1_awg_proteomics', data=joined_data, color = '.3',
order=['MSS', 'MSI-H'])
plt.show()
msi_histogram = sns.FacetGrid(joined_data[['MSI_status', 'JAK1_awg_proteomics']], hue="MSI_status",
legend_out=False, aspect=3)
msi_histogram = msi_histogram.map(sns.kdeplot, "JAK1_awg_proteomics").add_legend(title="MSI_status")
msi_histogram.set(ylabel='Proportion')
plt.show()