When looking at data, we often want to identify outliers, extremely high or low data points. In this use case we will show you how to use the Blacksheep package to find these in the CPTAC data. For more detailed information about the Blacksheep package see this repository.
In the CPTAC breast cancer study (here) it was shown that tumors classified as HER-2 enriched are frequently outliers for high abundance of ERBB2 phosphorylation, protein and mRNA (see figure 4 of the manuscript). In this use case we will show that same phenomena in an independent cohort of breast cancer tumors, whose data are included in the cptac package.
Before we begin performing the analysis, we must import the packages we will be using. In this first code block, we import the standard set of data science packages.
We will need an external package called blacksheep. To install it run the following on your command line:
pip install blksheep
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
In this next code block we import the blacksheep and cptac packages and grab our proteomic and clinical data.
import blacksheep
import cptac
brca = cptac.Brca()
clinical = brca.get_clinical()
proteomics = brca.get_proteomics()
The Blacksheep package requires that annotations are a binary variable. Our cptac tumors are divided into 4 subtypes: LumA, LumB, Basal, and Her2. We will use the binarize_annotations function to create a binary table of these PAM50 tumor classifications. We will call this table 'annotations'.
annotations = clinical[['PAM50']].copy()
annotations = blacksheep.binarize_annotations(annotations)
annotations.head()
PAM50_LumA | PAM50_Basal | PAM50_LumB | PAM50_Her2 | PAM50_Normal-like | |
---|---|---|---|---|---|
Patient_ID | |||||
CPT000814 | not-LumA | Basal | not-LumB | not-Her2 | not-Normal-like |
CPT001846 | not-LumA | Basal | not-LumB | not-Her2 | not-Normal-like |
X01BR001 | not-LumA | Basal | not-LumB | not-Her2 | not-Normal-like |
X01BR008 | not-LumA | Basal | not-LumB | not-Her2 | not-Normal-like |
X01BR009 | not-LumA | Basal | not-LumB | not-Her2 | not-Normal-like |
Now that our dataframes are correctly formatted, we will start looking for outliers.
We will start by using the deva function found in the blacksheep package. This function takes the proteomics data frame (which we transpose to fit the requirements of the function), and the annotations data frame that includes the binarized columns. We also indicate that we want to look for up regulated genes, and that we do not want to aggregate the data. The function returns two things:
outliers, qvalues = blacksheep.deva(proteomics.transpose(),
annotations,
up_or_down='up',
aggregate=False,
frac_filter=0.3)
07/29/2021 16:11:25:WARNING:No rows tested for fisherFDR_PAM50_LumA_not-LumA 07/29/2021 16:11:25:WARNING:No rows tested for fisherFDR_PAM50_LumA_LumA 07/29/2021 16:11:25:WARNING:No rows tested for fisherFDR_PAM50_Basal_not-Basal 07/29/2021 16:11:26:WARNING:No rows tested for fisherFDR_PAM50_LumB_not-LumB 07/29/2021 16:11:26:WARNING:No rows tested for fisherFDR_PAM50_Her2_not-Her2 07/29/2021 16:11:28:WARNING:No rows tested for fisherFDR_PAM50_Normal-like_not-Normal-like
Because these two tables that are returned are quite complex, we will now look at each of these individually.
The outliers table indicates whether each sample is an outlier for a particular gene. In this use case, we will focus on ERBB2. The first line below simplifies the index for each row by dropping the database id and leaving the gene name. We also only print off a portion of the table for brevity.
outliers.df.index = outliers.df.index.droplevel('Database_ID')
erbb2_outliers = outliers.df[outliers.df.index.str.match('ERBB2')]
erbb2_outliers.iloc[:, :8]
CPT000814_outliers | CPT001846_outliers | X01BR001_outliers | X01BR008_outliers | X01BR009_outliers | X01BR010_outliers | X01BR015_outliers | X01BR017_outliers | |
---|---|---|---|---|---|---|---|---|
Name | ||||||||
ERBB2 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
In the chart above you can see that most of the samples have 0, indiciating that the sample is not an outlier for ERBB2 protein abundance. X01BR017, however, has a 1, indicating that particular sample is an outlier.
The Outliers table contains boolean columns for both outlier and notOutliers. The notOutliers columns are redundant so we will remove them.
erbb2_outliers = erbb2_outliers.loc[:,~erbb2_outliers.columns.str.endswith('_notOutliers')]
We can now complile a list of all the samples that were considered to be outliers.
outlier_list = erbb2_outliers.columns[erbb2_outliers.isin([1.0]).all()].tolist()
print(outlier_list)
['X01BR017_outliers', 'X05BR026_outliers', 'X09BR004_outliers', 'X09BR005_outliers', 'X11BR004_outliers', 'X11BR010_outliers', 'X11BR011_outliers', 'X11BR028_outliers', 'X11BR030_outliers', 'X11BR038_outliers', 'X11BR060_outliers', 'X18BR009_outliers', 'X21BR001_outliers', 'X22BR005_outliers']
To understand what this means, we will plot the proteomics data for the ERBB2 gene and label the outlier samples. Before we graph the result we will join the proteomics and clinical data, isolating the PAM50 subtype and ERBB2.
combined_data = brca.join_metadata_to_omics(metadata_df_name="clinical",
omics_df_name="proteomics",
metadata_cols=["PAM50"],
omics_genes=['ERBB2'])
combined_data.columns = combined_data.columns.droplevel("Database_ID")
We will now create the graph.
plt.figure(figsize=(8, 8))
sns.set_palette('colorblind')
ax = sns.boxplot(data=combined_data, showfliers=False, y='ERBB2_proteomics', color='lightgray')
left = False
# This for loop labels all the specific outlier data points.
for sample in outlier_list:
if left:
position = -0.08
left = False
else:
position = 0.01
left = True
sample = sample.split("_")[0]
ax.annotate(sample, (position, combined_data.transpose()[sample].values[1]))
ax = sns.swarmplot(data=combined_data, y='ERBB2_proteomics')
plt.show()
As you can see from this graph, the samples we labeled, which had a 1.0 in the outliers table were all located at the top of the graph, indicating they are very highly expressed.
Let's now take a look at the Qvalues table. Remember that the qvalues table indicates the probability that a gene shows an enrichment in outliers for categories defined in our annotation dataframe.
qvalues.df.head()
fisherFDR_PAM50_Basal_Basal | fisherFDR_PAM50_LumB_LumB | fisherFDR_PAM50_Her2_Her2 | fisherFDR_PAM50_Normal-like_Normal-like | ||
---|---|---|---|---|---|
Name | Database_ID | ||||
A2ML1 | NP_653271.2|NP_001269353.1 | 1.441146e-07 | NaN | NaN | NaN |
ABCC11 | NP_149163.2|NP_660187.1|NP_150229.2 | NaN | NaN | 0.001545 | NaN |
ABCC5 | NP_005679.2|NP_001306961.1|NP_001018881.1 | NaN | NaN | NaN | 0.093281 |
ACACB | NP_001084.3 | NaN | NaN | NaN | 0.068994 |
ACAD8 | NP_055199.1 | NaN | NaN | NaN | 0.093281 |
This table includes all the q-values. Before really analyzing the table we will want to remove any insignificant q-values. For our purposes we will remove any q-values that are greater than 0.05.
for col in qvalues.df.columns:
qvalues.df.loc[qvalues.df[col] > 0.05, col] = np.nan
We will now isolate the ERBB2 gene.
qvalues.df.index = qvalues.df.index.droplevel('Database_ID')
qvalues = qvalues.df[qvalues.df.index.str.match('ERBB2')]
erbb2_qvalues = qvalues.reset_index()['Name'] == 'ERBB2'
qvalues = qvalues.reset_index()[erbb2_qvalues]
qvalues.head()
Name | fisherFDR_PAM50_Basal_Basal | fisherFDR_PAM50_LumB_LumB | fisherFDR_PAM50_Her2_Her2 | fisherFDR_PAM50_Normal-like_Normal-like | |
---|---|---|---|---|---|
0 | ERBB2 | NaN | NaN | 0.000366 | NaN |
Here we see that the only PAM50 subtype that has a significant enrichment is the Her2, which is exactly what is to be expected. To visualize this pattern, we will create a graph similiar to the one above, but with each of the categories in the PAM50 category differentially colored.
plt.figure(figsize=(8, 8))
sns.set_palette('colorblind')
cols = {'Basal': 0, 'Her2':1, 'LumA':2, 'LumB':3, 'Normal-like':4}
ax = sns.boxplot(data=combined_data, y='ERBB2_proteomics', x='PAM50', color='lightgray')
ax = sns.swarmplot(data=combined_data, y='ERBB2_proteomics',x='PAM50', hue='PAM50')
for sample in outlier_list:
sample = sample.split("_")[0]
ax.annotate(sample, (cols[combined_data.transpose()[sample].values[0]], combined_data.transpose()[sample].values[1]))
plt.show()
Looking at the distribution of the graph you can see that distribution of the Her2 category is much different than the distributions of the other catgeories. The median of the proteomic data in the Her2 category is much higher than other categories, with many more data points in the upper portion of the graph.
We have just walked through one example of how you might use the Outlier Analysis. Using this same approach, you can run the outlier analysis on a number of different clinical attributes, cohorts, and omics data. For example, you may look for outliers within the transcriptomics of the Endometrial cancer type using the clinical attribute of Histological_type. You can also look at more than one clinical attribute at a time by appending more attributes to your annotations table, or you can look for downregulated omics by chaning the 'up_or_down' variable of the run_outliers function.