In order to compare multiple conditions and perform operations on them, we create a 'median reference' condition and compare all conditions in the dataset against the median condition. This then allows to do comparative analyses, such as heatmaps or general significance analyses (similar to ANOVA analyses on multiple conditions).
In order to run differential analyses you need two types of files:
For the results file, AlphaQuant is compatible to the default output tables of most common proteomics search engines. Detailed specifications on which tables you need can be found in our README.
The sample mapping file has to look as follows:
INPUT_FILE = "./data/mouse_tissues/example_dataset_mouse_sn.tsv"
SAMPLEMAP_FILE = "./data/mouse_tissues/samplemap_200.tsv"
RESULTS_DIR = "./data/mouse_tissues/results_median_comparison"
import pandas as pd
display(pd.read_csv(INPUT_FILE, sep="\t"))
#displaying the samplemap file
import pandas as pd
display(pd.read_csv(SAMPLEMAP_FILE, sep='\t'))
%reload_ext autoreload
%autoreload 2
import alphaquant.run_pipeline as aqrunner
aqrunner.run_pipeline(input_file=INPUT_FILE, samplemap_file=SAMPLEMAP_FILE, multicond_median_analysis=True, results_dir=RESULTS_DIR)
There are four different main results tables written out to the directory:
medianref_protein_alphaquant.tsv
-> protein quantities derived with AlphaQuant's cluster approachmedianref_protein_avg.tsv
-> averaged protein quantities ('classic approach')medianref_proteoforms.tsv
-> quantities of all potential proteoformsmedianref_peptides.tsv
-> peptide quantities, for expert usersIn the following we will have a look at tables 1 and 3, which are likely to be the most relevant.
import pandas as pd
df_protein = pd.read_csv(RESULTS_DIR + "/medianref_protein_alphaquant.tsv", sep='\t')
df_proteoform = pd.read_csv(RESULTS_DIR + "/medianref_proteoforms.tsv", sep='\t')
display(df_protein)
display(df_proteoform)
The protein dataframe contains the protein quantities relative to the reference as well as a p_value for each protein which tests the null hypothesis that none of the conditions differs substantially from the median. In the example here all proteins are significant, which is not surprising, as we compare multiple tissues.
The proteoform dataframe contains a bit more information than just protein quantities. A given protein can have multiple proteoform ids. A novel proteoform is defined if - for a given protein - one or more peptides exist that show a significantly altered regulatory profile as the other peptides that belong to the same protein, indicating a potential proteoform.
#display dfs as clustermaps
import seaborn as sns
import numpy as np
nan_mask = df_protein.set_index('protein').drop(columns=["p_value"]).isna()
sns.clustermap(df_protein.set_index('protein').drop(columns=["p_value"]).replace(np.nan, 0), cmap="vlag", center=0, figsize=(12, 12), row_cluster=True, col_cluster=True, mask=nan_mask)
We define the proteoform with the most consistent peptides as the 'reference proteoform'. The pearson correlation coefficient between the reference proteoform and each other proteoform is calculated and reported under the 'corr_to_ref' column. In other words, the 'corr_to_ref' column says how similar a proteoform behaves as compared to another proteoform. We use one of the AlphaQuant utility functions to filter for the interesting proteoforms, meaning those with low correlation compared to the reference proteoform.
import alphaquant.multicond.multicond_utils as aq_multicond_utils
df_proteoform_low_corr = aq_multicond_utils.get_low_correlation_proteoform_df(proteoform_df=df_proteoform,
max_correlation=0.7, keep_reference_proteoform=True)
#we specified that we always keep also the reference proteoform
display(df_proteoform_low_corr)
Now, we can use this dataframe to plot the profiles of the low correlation proteoforms and their reference proteoforms. For this we make use of the alphaquant plotting functions.
import alphaquant.plotting.multicond as aq_plotting_multicond
fig, ax = aq_plotting_multicond.plot_proteoform_intensity_profiles(proteoform_df=df_proteoform_low_corr)