Check out more notebooks at our Community Notebooks Repository!
Title: Comparing protein expression from different pipelines
Author: Boris Aguilar
Created: 05-23-2021
Purpose: Compare proteomic expression from PDC and other pipelines available in the cptac library (https://github.com/PayneLab/cptac)
Notes: Runs in Google Colab
This notebook uses BigQuery to compare protein expression from the PDC and other pipelines. We used the cptac library to obtain protein expression derived from pipelines different than the one used by PDC.
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from google.cloud import bigquery
from google.colab import auth
import pandas_gbq
The first step is to authorize access to BigQuery and the Google Cloud. For more information see 'Quick Start Guide to ISB-CGC' and alternative authentication methods can be found here.
Moreover you need to create a google cloud project to be able to run BigQuery queries.
auth.authenticate_user()
my_project_id = "" # write your project id here
bqclient = bigquery.Client( my_project_id )
try:
import cptac
except ImportError:
!pip install cptac --quiet
import cptac
import cptac.utils as ut
/usr/local/lib/python3.7/dist-packages/statsmodels/tools/_testing.py:19: FutureWarning: pandas.util.testing is deprecated. Use the functions in the public API at pandas.testing instead. import pandas.util.testing as tm
Use the cptac library to download proteomic data of Lung adenocarcinoma (LUAD) and save it into a pandas dataframe.
cptac.download(dataset="Luad", version="latest")
ov = cptac.Luad()
df = ov.get_proteomics( )
df
Name | A1BG | A2M | AAAS | AACS | AADAC | AADAT | AAED1 | AAGAB | AAMDC | AAMP | AAR2 | AARS | AARS2 | AARSD1 | AASDHPPT | AASS | AATF | AATK | ABAT | ABCA1 | ABCA12 | ABCA13 | ABCA2 | ABCA3 | ABCA6 | ABCA7 | ABCA8 | ABCB1 | ABCB10 | ABCB5 | ABCB6 | ABCB7 | ABCB8 | ABCC1 | ABCC10 | ABCC2 | ABCC3 | ABCC4 | ABCC5 | ... | ZNF778 | ZNF786 | ZNF787 | ZNF789 | ZNF799 | ZNF8 | ZNF800 | ZNF804A | ZNF804B | ZNF806 | ZNF827 | ZNF830 | ZNF831 | ZNF837 | ZNF860 | ZNF92 | ZNF98 | ZNHIT1 | ZNHIT2 | ZNHIT3 | ZNHIT6 | ZNRD1 | ZNRF2 | ZPR1 | ZRANB2 | ZRSR2 | ZSCAN16 | ZSCAN18 | ZSCAN23 | ZSCAN26 | ZSCAN31 | ZSWIM9 | ZW10 | ZWILCH | ZWINT | ZXDC | ZYG11B | ZYX | ZZEF1 | ZZZ3 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Database_ID | NP_570602.2 | NP_000005.2|NP_001334353.1|NP_001334354.1|K4JDR8|K4JBA2|K4JB97 | NP_056480.1|NP_001166937.1 | NP_076417.2|NP_001306769.1|NP_001306768.1 | NP_001077.2 | NP_001273611.1|NP_001273612.1 | NP_714542.1 | NP_078942.3|NP_001258814.1 | NP_001303889.1|NP_001350493.1|NP_001303886.1|NP_001303887.1 | NP_001289474.1|NP_001078.2 | NP_001258803.1 | NP_001596.2 | NP_065796.1 | NP_001248363.1|NP_001129514.2|NP_079543.1 | NP_056238.2 | NP_005754.2 | NP_036270.1 | NP_001073864.2|NP_004911.2 | NP_000654.2 | NP_005493.2 | NP_775099.2|NP_056472.2 | NP_689914.3 | NP_997698.1|NP_001597.2 | NP_001080.2 | NP_525023.2 | NP_061985.2 | NP_001275914.1|NP_001275915.1|NP_009099.1 | NP_000918.2|NP_001335874.1 | NP_036221.2 | NP_001157413.1|NP_848654.3|NP_001157414.1|NP_001157465.1 | NP_005680.1|NP_001336757.1 | NP_004290.2|NP_001258625.1|NP_001258627.1 | NP_001269220.1|NP_009119.2|NP_001269222.1|NP_001269221.1 | NP_004987.2 | NP_001185863.1|NP_258261.2|NP_001337447.1 | NP_000383.2 | NP_003777.2|NP_001137542.1 | NP_001288759.1 | NP_005836.2|NP_001288758.1|NP_001098985.1 | NP_005679.2|NP_001306961.1|NP_001018881.1 | ... | NP_001188336.1|NP_872337.2 | NP_689624.2 | NP_001002836.2 | NP_001337928.1|NP_998768.2|NP_001337929.1|NP_001337931.1 | NP_001074290.1|NP_001309426.1|NP_005806.2|NP_660319.1 | NP_066575.2 | NP_789784.2 | NP_919226.1 | NP_857597.1 | NP_001291378.1 | NP_001293144.1|NP_849157.2 | NP_443089.3 | NP_848552.1 | NP_612475.1 | NP_001131146.2 | NP_689839.1|NP_001274461.1|NP_009070.2|NP_001274462.1|NP_001001415.2|NP_001333841.1|NP_001333842.1|NP_001333845.1 | NP_001092096.1|NP_065906.1 | NP_006340.1 | NP_055020.1 | NP_004764.1|NP_001268361.1|NP_001268363.1|NP_001268362.1 | NP_060423.3|NP_001164141.1 | NP_001265714.1 | NP_667339.1 | NP_003895.1|NP_001304015.1 | NP_976225.1|NP_005446.2 | NP_005080.1 | NP_001307484.1|NP_001307485.1|NP_001307486.1|NP_001307487.1 | NP_001139014.1|NP_001139015.1|NP_001139016.1 | NP_001012458.1 | NP_001018854.2|NP_001104509.1|NP_001274350.1|NP_001274351.1 | NP_001128687.1|NP_001230171.1 | NP_955373.3 | NP_004715.1 | NP_060445.3|NP_001274750.1 | NP_008988.2|NP_001005413.1 | NP_079388.3|NP_001035743.1 | NP_078922.1 | NP_001010972.1|NP_001349712.1 | NP_055928.3 | NP_056349.1|NP_001295166.1 |
Patient_ID | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
C3L-00001 | -2.5347 | -3.4057 | 0.1572 | -1.1998 | -1.6826 | NaN | NaN | -0.8179 | -0.8053 | -0.1899 | 0.9872 | 0.9620 | -1.7236 | 0.1699 | 0.1604 | -3.6203 | -0.0132 | NaN | 3.3068 | -1.0451 | -1.1461 | NaN | NaN | -4.3714 | -1.1808 | NaN | -4.2168 | -1.8246 | -0.3982 | -0.0605 | -1.9382 | 0.2204 | 0.2898 | 0.0689 | NaN | 1.1008 | -0.1426 | -2.3611 | -1.0862 | NaN | ... | NaN | 2.2780 | 0.6211 | 0.6022 | NaN | NaN | -2.0645 | -5.6022 | -2.3169 | 1.6342 | NaN | -0.1710 | NaN | 2.4294 | -3.1501 | 0.3403 | 1.6657 | 2.3285 | 0.5044 | NaN | -0.0795 | -0.1931 | 1.0693 | 0.3908 | 0.7316 | 1.9592 | -3.6550 | 1.2744 | NaN | NaN | NaN | NaN | 0.2992 | -1.3607 | NaN | NaN | 0.6527 | -0.9694 | -1.1840 | -2.5284 |
C3L-00009 | -0.5627 | -1.7945 | 1.0054 | -0.3624 | -4.4887 | 0.0079 | 0.2157 | 1.3342 | 0.0645 | 0.6427 | 0.0948 | 0.1628 | -0.6043 | 1.4588 | -0.8877 | -0.7743 | -0.0186 | -0.0828 | -2.5503 | 0.5029 | 0.3668 | -0.9708 | 1.6440 | -2.9886 | -0.5022 | 0.4689 | -2.3613 | -1.6887 | -0.6534 | 1.2472 | 2.4375 | 0.2081 | 0.1666 | 2.3657 | -1.4771 | 9.9418 | 0.6994 | -3.5100 | -2.1951 | 3.3330 | ... | -2.0704 | -2.5578 | -0.3965 | 1.0432 | NaN | NaN | -1.5565 | -0.2944 | -0.6685 | -0.2415 | -1.4280 | -1.1900 | -3.6612 | NaN | NaN | 0.4764 | 1.1263 | -1.0275 | -0.5211 | NaN | 1.3039 | -1.2240 | 1.1679 | 0.0910 | 0.1363 | -0.0602 | 1.0885 | -0.3133 | NaN | 1.1074 | 11.6158 | -0.5098 | -0.1622 | 0.9828 | 0.5633 | -1.4620 | -1.0690 | 0.7674 | 0.5066 | 0.4311 |
C3L-00080 | -1.9422 | -2.3782 | 0.1940 | 0.1920 | -2.2655 | NaN | -1.6626 | 0.2149 | -0.7593 | 0.6113 | -0.0980 | -0.4297 | 0.4757 | 0.9284 | -0.1043 | 0.2984 | 1.1558 | -1.2350 | 0.9513 | -0.8448 | -1.7002 | -4.3892 | 0.7844 | -1.7607 | -1.7252 | NaN | -2.7975 | -1.0764 | -0.7239 | NaN | -1.4290 | 1.2142 | -0.3963 | -1.2350 | NaN | NaN | 0.4590 | -0.9429 | -0.7343 | 0.0751 | ... | NaN | 0.1920 | 1.8650 | 0.1878 | NaN | NaN | -0.2586 | -2.4345 | -0.8177 | -0.4339 | -1.1015 | 0.7302 | -1.7794 | 1.0827 | NaN | -0.3087 | 1.2142 | 0.4047 | 0.5570 | NaN | 1.0264 | 0.8658 | -0.2440 | -0.0542 | 0.0522 | NaN | -0.5027 | -1.5020 | NaN | 2.2405 | NaN | NaN | -0.2795 | 0.6613 | NaN | 0.9659 | -0.3442 | -1.6480 | 1.2872 | -0.7301 |
C3L-00083 | 2.1636 | 3.1227 | -0.3044 | -1.7183 | -3.2851 | -1.8216 | 3.6147 | -0.4863 | -1.2387 | -0.4946 | -0.0068 | 0.3281 | -1.4413 | 0.3777 | 0.0594 | -1.6149 | -0.8873 | -2.6815 | -2.1565 | 0.7250 | -1.6397 | -0.1969 | 0.8986 | -0.8625 | NaN | -0.7757 | -3.9465 | 0.5141 | -2.0862 | NaN | -0.2837 | -2.5906 | -1.3421 | -0.5897 | -0.6186 | NaN | -2.7270 | -1.7224 | -1.8092 | NaN | ... | NaN | NaN | 0.2124 | -1.7886 | NaN | 0.6671 | -1.8630 | 1.6055 | -0.0150 | -2.3425 | NaN | -0.1142 | NaN | -0.8625 | NaN | -0.1514 | -0.0894 | -0.2755 | -0.6889 | 0.2620 | -0.4656 | 0.5886 | -0.1184 | 0.7002 | 0.1049 | -0.5111 | -1.0940 | 0.1379 | -1.4992 | -1.0072 | -3.0742 | -1.6769 | -0.5897 | -0.8129 | NaN | 0.9399 | -0.2465 | 0.3157 | 0.6547 | NaN |
C3L-00093 | -1.0022 | -0.9632 | 0.8190 | 0.2556 | -11.1252 | NaN | -0.1696 | 0.2911 | -0.4459 | -0.1518 | 0.3690 | 0.5533 | -0.5912 | 1.6340 | 0.1564 | -0.5097 | 0.7942 | NaN | -0.9349 | -0.5558 | 0.5639 | -2.2211 | 1.1521 | -7.0045 | -2.7561 | 2.5835 | -3.5888 | -0.2900 | -1.3069 | -1.8668 | -0.4459 | -0.2865 | -0.5841 | -0.1199 | -3.0254 | 0.3832 | 1.2867 | -1.7888 | -2.9333 | -0.3609 | ... | NaN | 0.4434 | -0.9774 | -2.0085 | NaN | -1.3176 | NaN | -4.9494 | -2.7419 | 2.2363 | NaN | 0.5143 | NaN | NaN | 4.8406 | -0.0243 | 0.2982 | -0.9916 | 0.0324 | NaN | 0.0076 | -0.0916 | 2.4383 | 0.4080 | 0.1139 | -1.3176 | 1.2655 | -1.4522 | NaN | NaN | NaN | NaN | 0.6950 | -0.1625 | 1.8536 | -2.2990 | 0.4293 | -0.5876 | -0.4991 | -0.3077 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
C3N-02582.N | 1.8277 | 3.6204 | 0.1783 | -1.6842 | 0.6852 | NaN | 1.5338 | -0.6666 | 2.3787 | -0.0458 | -0.2589 | -1.0964 | -0.6556 | -0.2442 | -0.1964 | -0.6336 | -1.0083 | NaN | -0.6630 | 1.4897 | 3.5248 | 4.4983 | 0.0644 | 0.5420 | 3.1648 | NaN | 4.7444 | 1.5375 | -0.3581 | NaN | 1.4493 | -0.9054 | -0.5013 | -0.0715 | 0.2407 | NaN | -3.5173 | NaN | 0.2407 | -1.1516 | ... | 0.0203 | NaN | -1.5005 | -0.7328 | NaN | NaN | 0.5640 | NaN | -1.4418 | 1.1554 | NaN | 0.2187 | 0.3620 | 2.4448 | NaN | -0.3324 | 0.9313 | NaN | 0.5199 | -0.2074 | -1.1075 | 1.0048 | -0.8613 | -0.4572 | -0.5564 | -0.6924 | -3.3226 | 1.6293 | NaN | 0.9460 | -0.2001 | NaN | -0.0826 | -1.6769 | -0.0017 | -0.1266 | 0.2995 | 2.3934 | 0.7770 | 0.9497 |
C3N-02586.N | 0.8035 | 1.6403 | 0.2300 | -1.8837 | 1.4085 | NaN | 1.3378 | -0.8544 | 0.1946 | -0.0726 | -0.3908 | -0.9801 | -0.3044 | -0.4615 | -0.4065 | 0.4460 | -1.2983 | NaN | -1.2237 | 0.5521 | 1.0000 | 0.4421 | -0.9644 | 2.3004 | 2.7207 | 0.5482 | 1.6128 | 1.8368 | -0.2179 | 0.0924 | -0.5519 | -0.5951 | -0.8151 | -0.0922 | 0.1632 | -0.4144 | -2.2216 | -0.0726 | -0.0136 | -0.2651 | ... | 1.7032 | -0.4458 | -1.4555 | -0.1079 | NaN | -1.4869 | 0.4578 | -1.6872 | 0.7996 | 0.7328 | 0.4775 | 0.0807 | 2.3829 | 1.0628 | 3.3140 | -2.7834 | -0.9722 | NaN | -0.1708 | 1.5578 | -0.8544 | -1.1372 | -0.5911 | -0.0608 | 0.0924 | -0.2336 | -1.1922 | 0.9332 | NaN | NaN | -0.8229 | 0.1750 | -0.0804 | -1.6401 | NaN | 2.4025 | 1.2161 | 1.6443 | 1.1886 | 1.1807 |
C3N-02587.N | 1.7637 | 2.2513 | -0.0532 | -1.4159 | 4.8264 | 0.8151 | 0.4511 | -0.8181 | 2.6187 | -0.3304 | 0.1037 | -1.1587 | -0.6043 | -0.1134 | -0.4707 | -0.7012 | -0.4640 | 2.0142 | -1.8668 | 0.4210 | 1.9540 | NaN | 0.2140 | 2.5452 | 1.1525 | 1.4163 | 3.1998 | 2.0743 | -0.3605 | NaN | 0.1605 | -0.9951 | -0.2369 | -0.6477 | 2.0509 | NaN | -2.1507 | -1.5929 | 0.1605 | -0.2336 | ... | NaN | NaN | -0.1735 | -1.7199 | 0.1338 | -0.2703 | -1.7599 | NaN | NaN | 1.4230 | 0.8886 | 0.2707 | 2.4751 | 2.3348 | -0.1668 | -0.9684 | 0.8252 | -0.7680 | -0.6143 | -0.6644 | -0.0866 | -2.6383 | -1.1988 | -0.2904 | 0.1806 | 1.2560 | NaN | 1.8572 | -13.4196 | 0.0670 | -0.1301 | NaN | -0.0800 | -2.4146 | -2.8354 | NaN | 1.2861 | 2.1244 | 0.7083 | 1.1825 |
C3N-02588.N | 1.0875 | 1.7414 | -0.2270 | -1.7000 | 4.5153 | 0.4875 | NaN | -0.2169 | 0.5044 | -0.3012 | -0.1225 | -1.0831 | 0.4707 | -0.7191 | -0.3585 | 0.1033 | -2.3201 | NaN | -1.2281 | 0.3055 | 1.7313 | 0.1269 | -0.1293 | 2.7492 | 2.4492 | NaN | 4.0232 | 4.9703 | 0.1370 | -0.0686 | -0.5203 | -0.3585 | 0.2752 | -0.3450 | 0.4066 | NaN | -1.7876 | NaN | 0.2347 | 0.1673 | ... | 0.6763 | NaN | -0.4731 | -0.5169 | NaN | NaN | 0.0831 | 0.8718 | NaN | 0.4168 | 0.2448 | -0.2877 | 0.1606 | 0.2887 | -0.1832 | -0.6854 | -1.3730 | -0.7292 | -0.9854 | NaN | -1.5955 | -1.4202 | -0.1192 | -0.8135 | 0.2179 | -1.0933 | NaN | 0.9628 | -7.3995 | -1.3157 | -0.9652 | -0.1293 | -0.4764 | -1.4775 | -2.2999 | 2.1054 | 0.4943 | 1.5459 | 0.6358 | 1.2729 |
C3N-02729.N | 2.6011 | 3.0462 | -0.2924 | -2.1953 | 4.1405 | 1.2990 | NaN | -0.4741 | 2.0892 | -0.5594 | -0.5335 | -0.8970 | -0.5706 | -1.7502 | -0.4074 | 0.2900 | -2.3029 | 1.8295 | -1.0306 | 0.5757 | 3.4320 | 2.6715 | -1.2791 | 1.8517 | 2.1596 | 0.1528 | 0.9503 | 0.3605 | 0.2010 | NaN | -0.4853 | -0.6077 | -0.7783 | -0.6411 | 1.9371 | NaN | -4.8996 | 0.8947 | 0.5534 | -1.2309 | ... | NaN | NaN | -0.2108 | -1.1233 | NaN | 1.4511 | -0.6967 | 2.9423 | -0.4741 | 0.4978 | NaN | 0.3902 | NaN | 2.4972 | 2.3191 | NaN | NaN | -1.2197 | -0.5187 | -2.2547 | -0.7301 | -1.8318 | -1.0306 | -0.2330 | -0.5372 | -1.2606 | -0.0142 | 1.3064 | NaN | NaN | NaN | 3.2948 | -0.7338 | NaN | -1.4238 | -2.0766 | 0.3234 | 1.6588 | 0.6202 | 0.8390 |
211 rows × 10699 columns
The following commands transform the dataframe to a tidy format and save it into a BigQuery table in your project.
tdf = pd.melt(df, var_name="gene_name", value_name="protein_abundance",ignore_index = False)
tdf.reset_index(inplace=True)
tdf[0:10]
Patient_ID | gene_name | protein_abundance | |
---|---|---|---|
0 | C3L-00001 | A1BG | -2.5347 |
1 | C3L-00009 | A1BG | -0.5627 |
2 | C3L-00080 | A1BG | -1.9422 |
3 | C3L-00083 | A1BG | 2.1636 |
4 | C3L-00093 | A1BG | -1.0022 |
5 | C3L-00094 | A1BG | -1.5576 |
6 | C3L-00095 | A1BG | -1.0718 |
7 | C3L-00140 | A1BG | -1.0799 |
8 | C3L-00144 | A1BG | -1.9159 |
9 | C3L-00263 | A1BG | -1.1384 |
The following commands send the protein expression data to a BigQuery table.
table_id = 'test_dataset2.luad_cptac_paynelab' # test_dataset2 is dataset and luad_cptac_paynelab is the table name
pandas_gbq.to_gbq(tdf, table_id, project_id=my_project_id)
Here we compare protein expressions from the cptac library with those generated from PDC proteomics data. The comparison is made by computing Pearson correlation.
The first step is to build a query to retrieve PDC based protein expressions, which are available in BigQuery tables in the public project isb-cgc-bq.
pdc = '''
With pdc AS (
SELECT meta.case_submitter_id, quant.gene_symbol,
CAST(quant.protein_abundance_log2ratio AS FLOAT64) AS protein_abundance_log2ratio
FROM `isb-cgc-bq.CPTAC.quant_proteome_CPTAC_LUAD_discovery_study_pdc_current` as quant
JOIN `isb-cgc-bq.PDC_metadata.aliquot_to_case_mapping_current` as meta
ON quant.case_id = meta.case_id
AND quant.aliquot_id = meta.aliquot_id
AND meta.sample_type = 'Primary Tumor'
)
'''
The following query combines the pdc and cptac data:
cptac = '''
qdata AS (
SELECT pdc.case_submitter_id, pdc.gene_symbol, pdc.protein_abundance_log2ratio,
cptac.protein_abundance
FROM pdc
JOIN `{0}.{1}` as cptac
ON pdc.case_submitter_id = cptac.Patient_ID
AND pdc.gene_symbol = cptac.gene_name
)
'''.format(my_project_id, table_id)
Finally we compute Pearson correlations.
mysql = (pdc + ',' + cptac + '''
SELECT gene_symbol, count(*) as N, corr(protein_abundance_log2ratio,protein_abundance) as Correlations
FROM qdata
WHERE NOT IS_NAN(protein_abundance_log2ratio)
AND NOT IS_NAN(protein_abundance)
GROUP BY gene_symbol
HAVING N >= 20
ORDER BY Correlations DESC
''' )
df1 = pandas_gbq.read_gbq(mysql,project_id=my_project_id )
df1
Downloading: 100%|██████████| 9650/9650 [00:00<00:00, 25317.80rows/s]
gene_symbol | N | Correlations | |
---|---|---|---|
0 | GLYATL2 | 21 | 0.990671 |
1 | SLC2A10 | 49 | 0.989670 |
2 | TNC | 108 | 0.983017 |
3 | BCAS1 | 104 | 0.980725 |
4 | SLC27A2 | 108 | 0.980098 |
... | ... | ... | ... |
9645 | SACS | 108 | -0.166687 |
9646 | SLC38A9 | 30 | -0.303329 |
9647 | SASS6 | 27 | -0.329240 |
9648 | SPEF2 | 30 | -0.382587 |
9649 | NUFIP1 | 23 | -0.383124 |
9650 rows × 3 columns
The results above show the correlation between PDC and cptac protein expressions for 9650 proteins. Next we show a histogram of these correlations.
sns.displot(data=df1, x="Correlations", binwidth=0.1)
plt.xlim(-1.0, 1.1)
(-1.0, 1.1)
The histogram shows that the two pipelines used in this analysis produced similar protein expression for most genes.