#!/usr/bin/env python # coding: utf-8 # # Biological Duplicate PTM Correlations # # ## Non-reproducible Systematic Variation in Cell Line Data # Of the 45 cell lines measured in the PTM experimental data three cell lines are duplicated - resulting in 42 unique cell lines. We showed in a different notebook that the PTM distributions (measuring ~4000 PTMs for each cell line) showed systematic variation that was unlikely to be biological in nature. Furthermore, we showed that biological replicate experiments showed a non-reproducible systematic variation, which further indicates that this variation is not biological. # # ## Normalization and Gene-Expression Comparison # We proposed that we correct for this variation using cell-line quantile normalization to enforce that the cell lines have the same distributions. However, this normalization might also introduce other problems and reduce the overall quality of the data. In a different notebook, we showed that cell line quantile normalization (as well as quantile normalization followed by PTM Z-score normalization) improved the correlation of cell-line/cell-line distances in gene-expression-space and PTM-space. In other words normalization of the PTM data improved the similarity of how cell lines are arranged in the two spaces (gene-expression and PTM) where we expect the cell lines to behave similarly in both spaces. # # ## PTM Correlation in Biological Replicates # We can also use the correlation of PTM values between biological replicates (and non-biological replicates) to check the quality of our data. We would expect that biological replicates (3 instances) will be more correlated than non-replicates. We can also check whether our normalization preserves the absolute correlations and relative correlations (biological-replicate vs non-replicate). # In[1]: import bio_duplicate_correlation # In[2]: fig_data = bio_duplicate_correlation.compare_duplicate_non_duplicate_correlation() # In[3]: import matplotlib.pyplot as plt get_ipython().run_line_magic('matplotlib', 'inline') import matplotlib matplotlib.style.use('ggplot') fig = fig_data.plot(kind='bar', figsize=(10,5)) full_title = 'PTM Correlation of Bio-Replicates and Non-Bio-Replicates' fig.set_title(full_title) # ## Cell-Line Quantile Normalization Effect # The above bar graph shows Non-Biological-Repeats vs Biological-Repeat correlation (Pearson) for six different conditions (there are 12 bars in sets of 2). The six conditions are: # # 1) No normalization (none) # 2) Cell-line quantile normalization (col-qn) # 3) Cell-line quantile normalization and row Z-score (col-qn_row-zscore) # 4) No-missing values No normalization (filter_none) # 5) No-missing values Cell-line quantile normalization (filter_col-qn) # 6) No-missing values Cell-line quantile normalization and row Z-score (filter_col-qn_row-zscore) # # The first two bars show us that correlation of biological repeats (3 instances) is greater than the other non-biological repeat combinations (987 other cell line combinations). The next two bars show that cell-line quantile normalization actually increases the correlation of both repeat and non-repeats. The third set of bars shows us that while row Z-score normalization reduces the overall cell-line/cell-line correlations we still have a large relative difference in the correlations of repeat vs non-repeats (we expect that Z-score normalization will reduce the overall correlation). The last set of six bars is the same as the first except that PTMs with missing values have been first filtered out. # # # We see that fitlering out PTMs with missing values first gives similar results. Cell-line quantile normalization does not reduce the overall correlations and biological-replicate correlation is still higher than non-replicate correlation. # # # Conclusions # Based on this, we can conclude that 1) cell-line quantile normalization does not reduce the cell-line/cell-line correlation (in fact it increases it when no filtering is done), this is not dependent on missing PTMs, and Z-score row normalization in addition to cell-line quantile normalization maintains the difference in correlation between biological-replicates and non-replicates. So, we can conclude that cell-line quantile normalization does not reduce the quality of the data as measued by cell-line/cell-line correlation of PTM data. # In[4]: import bio_duplicate_correlation df_scatter = bio_duplicate_correlation.view_scatter() # In[5]: df_scatter.shape cols = df_scatter.columns.tolist() # In[6]: import pandas as pd import numpy as np s = pd.Series(np.random.randn(5), index=['a', 'b', 'c', 'd', 'e']) # In[7]: # s.plot(kind='bar', figsize=(10,5)) # df_scatter = df_scatter.transpose() df_scatter.plot(kind='scatter', figsize=(10,5), x=cols[4], y=cols[5]) # In[ ]: # In[8]: df = pd.DataFrame(np.random.rand(50, 4), columns=['a', 'b', 'c', 'd']) # In[9]: df.plot(kind='scatter', x='a', y='b') # In[10]: inst_series = df_scatter[cols[0]] # In[11]: print(type(inst_series)) # In[12]: inst_series.hist(bins=100) # In[ ]: