Title: Correlations of protein and gene expression in CPTAC
Author: Boris Aguilar
Created: 05-23-2021
Purpose: Compute correlations between proteomic and gene expression available in the PDC
Notes: Runs in Google Colab
This notebook uses BigQuery to compute Pearson correlation between protein and gene expression for all the genes in the BigQuery tables of the PDC dataset. We used CCRCC as example; but this can be changed easily for other cancer types.
from google.cloud import bigquery
from google.colab import auth
import numpy as np
import pandas as pd
import seaborn as sns
import pandas_gbq
The first step is to authorize access to BigQuery and the Google Cloud. For more information see 'Quick Start Guide to ISB-CGC' and alternative authentication methods can be found here.
Moreover you need to create a google cloud project to be able to run BigQuery queries.
auth.authenticate_user()
my_project_id = "" # write your project id here
bqclient = bigquery.Client( my_project_id )
The following query will retrieve protein expression and case IDs from CPTAC table quant_proteome_CPTAC_CCRCC_discovery_study_pdc_current
. Moreover, to label samples as Tumor or Normal samples we join the table with metadata available in the table aliquot_to_case_mapping_pdc_current
prot = '''quant AS (
SELECT meta.sample_submitter_id, meta.sample_type, quant.case_id, quant.aliquot_id, quant.gene_symbol,
CAST(quant.protein_abundance_log2ratio AS FLOAT64) AS protein_abundance_log2ratio
FROM `isb-cgc-bq.CPTAC.quant_proteome_CPTAC_CCRCC_discovery_study_pdc_current` as quant
JOIN `isb-cgc-bq.PDC_metadata.aliquot_to_case_mapping_current` as meta
ON quant.case_id = meta.case_id
AND quant.aliquot_id = meta.aliquot_id
AND meta.sample_type IN ('Primary Tumor','Solid Tissue Normal')
)'''
Next we retrieve gene expression data from the table CPTAC.RNAseq_hg38_gdc_current
which contains RNA-seq data from all tumor types of CPTAC. Moreover we join the data with the metadata table aliquot_to_case_mapping_pdc_current
to label samples to cancer or normal tissue
gexp = '''gexp AS (
SELECT DISTINCT meta.sample_submitter_id, meta.sample_type, rnaseq.gene_name , LOG(rnaseq.HTSeq__FPKM + 1) as HTSeq__FPKM
FROM `isb-cgc-bq.CPTAC.RNAseq_hg38_gdc_current` as rnaseq
JOIN `isb-cgc-bq.PDC_metadata.aliquot_to_case_mapping_current` as meta
ON meta.sample_submitter_id = rnaseq.sample_barcode
)'''
The following query join the protein and gene expression data and compute correlation for each gene and semple type (normal or tumor).
corr = '''correlation AS (
SELECT quant.gene_symbol, gexp.sample_type, COUNT(*) as n, CORR(protein_abundance_log2ratio,HTSeq__FPKM) as corr
FROM quant JOIN gexp
ON quant.sample_submitter_id = gexp.sample_submitter_id
AND gexp.gene_name = quant.gene_symbol
AND gexp.sample_type = quant.sample_type
GROUP BY quant.gene_symbol, gexp.sample_type
)'''
pval = '''SELECT gene_symbol, sample_type, n, corr,
`cgc-05-0042.functions.corr_pvalue`(corr, n) as p
FROM correlation
WHERE ABS(corr) <= 1.0'''
The following commands generate the final query which will be sent to Google to retrieve the final data that include the correlation for each gene. The query also includes a function (BHmultipletests) that adjusts the computed p values with the Benjamini-Hochberg method for multipletest correction.
mysql = '''DECLARE Nrows INT64;
CREATE TEMP TABLE PearsonCorrelation AS
WITH {0},
{1},
{2}
{3}
;
# Adjust pvalues for multiple tests
SET Nrows = ( SELECT COUNT(*) FROM PearsonCorrelation );
CALL `cgc-05-0042.functions.BHmultipletests`( 'PearsonCorrelation', 'p', Nrows )
'''.format(prot, gexp, corr, pval)
job_config = bigquery.QueryJobConfig()
job_config.use_legacy_sql = False
try:
query_job = bqclient.query ( mysql, job_config=job_config )
except:
print ( " FATAL ERROR: query execution failed " )
mydf = query_job.to_dataframe()
The following command displays the results.
mydf
gene_symbol | sample_type | n | corr | p | p_adj | |
---|---|---|---|---|---|---|
0 | ANXA13 | Primary Tumor | 100 | 0.955971 | 1.205495e-53 | 1.767496e-49 |
1 | MYO1B | Primary Tumor | 100 | 0.952969 | 2.759249e-52 | 2.022806e-48 |
2 | CES1 | Primary Tumor | 100 | 0.949441 | 8.494844e-51 | 4.151714e-47 |
3 | PHYHIPL | Primary Tumor | 100 | 0.942866 | 2.748145e-48 | 1.007332e-44 |
4 | LGALS3 | Primary Tumor | 100 | 0.941492 | 8.429648e-48 | 2.471910e-44 |
... | ... | ... | ... | ... | ... | ... |
14657 | AHSA1 | Solid Tissue Normal | 75 | 0.000278 | 9.981109e-01 | 9.983833e-01 |
14658 | CIC | Primary Tumor | 100 | -0.000204 | 9.983915e-01 | 9.985958e-01 |
14659 | SARM1 | Solid Tissue Normal | 75 | 0.000200 | 9.986388e-01 | 9.987750e-01 |
14660 | C7orf26 | Solid Tissue Normal | 75 | 0.000131 | 9.991117e-01 | 9.991798e-01 |
14661 | TFB1M | Solid Tissue Normal | 75 | -0.000006 | 9.999588e-01 | 9.999588e-01 |
14662 rows × 6 columns
The results above show the correlation between protein and gene expression for tumors and normal samples. Next we show two histograms of these correlations, one for tumor and the other for normal samples. Moreover we colored the bars by genes that have significant correlations (significant level = 0.01).
s_level = 0.01
mydf['significant'] = np.where( mydf['p_adj'] <= s_level, True, False)
sns.displot(data=mydf, x="corr", hue="significant", multiple="stack", binwidth=0.1, col='sample_type')
<seaborn.axisgrid.FacetGrid at 0x7f1f91049ed0>