ISB-CGC Community Notebooks¶

Check out more notebooks at our Community Notebooks Repository!

Title:   Comparing gene expression in tumor against gene expression in normal tissue
Author:  Fabian Seidl
Created: 2021-07-09
URL:     https://github.com/isb-cgc/Community-Notebooks/blob/master/Notebooks/How_to_analyze_differential_expression_between_paired_tumor_and_normal_samples.ipynb
Purpose: To perform differential expression analysis comparing tumor to normal tissue
Notes:

Overview¶

This notebook is intended to demonstrate how to run a differential gene expression analysis between tumor and normal samples using GDC cancer data made available in BigQuery. We will be using the TCGA gene expression dataset in isb-cgc-bq, specifically the table isb-cgc-bq.TCGA_versioned.RNAseq_hg38_gdc_r35. This notebook is modeled to somewhat mirror the differential expression analysis found in Peng, et al. Scientific Reports 2015 (https://doi.org/10.1038/srep13413).

First we need to set up the environment and authenticate our session. This notebook will use both python to construct and run our SQL query and the Bioconductor R libraries edgeR and DESeq2.

In [ ]:

from google.colab import auth
import pandas_gbq
import pandas as pd
import numpy
import seaborn
from google.cloud import bigquery
!pip install rpy2==3.5.1
%load_ext rpy2.ipython

bq_project='isb-project-zero'
auth.authenticate_user()
client = bigquery.Client(project=bq_project)
print('Authenticated')

Requirement already satisfied: rpy2==3.5.1 in /usr/local/lib/python3.10/dist-packages (3.5.1)
Requirement already satisfied: cffi>=1.10.0 in /usr/local/lib/python3.10/dist-packages (from rpy2==3.5.1) (1.15.1)
Requirement already satisfied: jinja2 in /usr/local/lib/python3.10/dist-packages (from rpy2==3.5.1) (3.1.2)
Requirement already satisfied: pytz in /usr/local/lib/python3.10/dist-packages (from rpy2==3.5.1) (2022.7.1)
Requirement already satisfied: tzlocal in /usr/local/lib/python3.10/dist-packages (from rpy2==3.5.1) (5.0.1)
Requirement already satisfied: pycparser in /usr/local/lib/python3.10/dist-packages (from cffi>=1.10.0->rpy2==3.5.1) (2.21)
Requirement already satisfied: MarkupSafe>=2.0 in /usr/local/lib/python3.10/dist-packages (from jinja2->rpy2==3.5.1) (2.1.3)
Authenticated

Finding all cases with data for both normal and tumor aliquots¶

Our first step is to join sample_type data to our RNA seq table and identify the cases in the RNA seq BigQuery table of choice that have data for both tumor and normal aliquots. To perform this operation we can write an SQL query which groups rows by case barcode and then use the array_agg() function to concatenate the sample_type_name field for each case.

In [ ]:

sub_project = "TCGA-BRCA"
rna_table = "isb-cgc-bq.TCGA_versioned.RNAseq_hg38_gdc_r35"

case_query = """
WITH rna as (
    SELECT
       case_barcode,
       sample_barcode,
       aliquot_barcode,
       Ensembl_gene_id_v,
       gene_name,
       unstranded,
       fpkm_uq_unstranded,
       sample_type_name,
    FROM `isb-cgc-bq.TCGA_versioned.RNAseq_hg38_gdc_r35`
    WHERE gene_type = 'protein_coding'
    AND project_short_name = '{0}'
),
cases as (
  SELECT case_barcode, agg FROM
      (SELECT
        case_barcode,
        array_agg(distinct sample_type_name) agg
      FROM rna
      GROUP BY case_barcode)
  WHERE array_length(agg) > 1
  AND ("Solid Tissue Normal" in UNNEST(agg))
  )""".format(sub_project)

select_cases = """
SELECT * FROM cases
"""

query_job = client.query( case_query + select_cases )
# print( case_query + select_cases )
joined_data = query_job.result().to_dataframe()
joined_data.head(5)

Out[ ]:

	case_barcode	agg
0	TCGA-GI-A2C9	[Solid Tissue Normal, Primary Tumor]
1	TCGA-BH-A0DH	[Primary Tumor, Solid Tissue Normal]
2	TCGA-BH-A0DL	[Primary Tumor, Solid Tissue Normal]
3	TCGA-BH-A1EV	[Primary Tumor, Solid Tissue Normal]
4	TCGA-BH-A0BS	[Solid Tissue Normal, Primary Tumor]

Subsetting the gene set¶

Performing a comparison across all genes is outside of the scope of this notebook, so we will reduce our gene set to a more manageable size. A common metric to subset by is the variance of expression level. We can calculate this metric in an SQL query by grouping by the Ensembl_gene_id_v field and calculating the VARIANCE() of the counts field. We then order our results by this value and select the top 2000 genes.

In [ ]:

expression_query = """,
mean_expr as (
  SELECT * FROM (
    SELECT
      rna.Ensembl_gene_id_v,
      VARIANCE(rna.fpkm_uq_unstranded) var_fpkm
    FROM rna
    JOIN cases ON rna.case_barcode = cases.case_barcode
    WHERE rna.sample_type_name = 'Solid Tissue Normal'
    GROUP BY rna.Ensembl_gene_id_v)
  ORDER BY var_fpkm DESC
  LIMIT 2000)""".format(rna_table)

select_genes = """
SELECT * FROM mean_expr
"""

query_job = client.query( case_query + expression_query + select_genes )
# print( case_query + expression_query + select_genes )
joined_data = query_job.result().to_dataframe()
joined_data.head(5)

Out[ ]:

	Ensembl_gene_id_v	var_fpkm
0	ENSG00000198886.2	4.219112e+07
1	ENSG00000198938.2	3.523055e+07
2	ENSG00000198804.2	2.799827e+07
3	ENSG00000143632.14	2.615222e+07
4	ENSG00000198840.2	2.575547e+07

Joins are used to generate the table for analysis¶

To generate the final table we join the full RNA expression table back to the tables housing the list of cases as well as the list of genes. As both edgeR and DESeq2 require counts as input the important fields we need are case_barcode, aliquot_barcode, gene_name, sample_type_name, and HTSeq__Counts.

In [ ]:

final_join = """
SELECT rna.case_barcode,
       rna.sample_barcode,
       rna.aliquot_barcode,
       rna.Ensembl_gene_id_v,
       rna.gene_name,
       rna.unstranded,
       rna.sample_type_name,
FROM rna
  JOIN cases ON rna.case_barcode = cases.case_barcode
  JOIN mean_expr ON rna.Ensembl_gene_id_v = mean_expr.Ensembl_gene_id_v
WHERE rna.sample_type_name = 'Solid Tissue Normal'
  OR rna.sample_type_name = 'Primary Tumor'
ORDER BY Ensembl_gene_id_v, case_barcode, aliquot_barcode
""".format(rna_table)

query_job = client.query( case_query + expression_query + final_join )
# print( case_query + expression_query + final_join )
joined_data = query_job.result().to_dataframe()
joined_data.head(10)

Out[ ]:

	case_barcode	sample_barcode	aliquot_barcode	Ensembl_gene_id_v	gene_name	unstranded	sample_type_name
0	TCGA-A7-A0CE	TCGA-A7-A0CE-01A	TCGA-A7-A0CE-01A-11R-A00Z-07	ENSG00000000005.6	TNMD	54	Primary Tumor
1	TCGA-A7-A0CE	TCGA-A7-A0CE-11A	TCGA-A7-A0CE-11A-21R-A089-07	ENSG00000000005.6	TNMD	319	Solid Tissue Normal
2	TCGA-A7-A0CH	TCGA-A7-A0CH-01A	TCGA-A7-A0CH-01A-21R-A00Z-07	ENSG00000000005.6	TNMD	0	Primary Tumor
3	TCGA-A7-A0CH	TCGA-A7-A0CH-11A	TCGA-A7-A0CH-11A-32R-A089-07	ENSG00000000005.6	TNMD	2114	Solid Tissue Normal
4	TCGA-A7-A0D9	TCGA-A7-A0D9-01A	TCGA-A7-A0D9-01A-31R-A056-07	ENSG00000000005.6	TNMD	72	Primary Tumor
5	TCGA-A7-A0D9	TCGA-A7-A0D9-11A	TCGA-A7-A0D9-11A-53R-A089-07	ENSG00000000005.6	TNMD	3612	Solid Tissue Normal
6	TCGA-A7-A0DB	TCGA-A7-A0DB-01A	TCGA-A7-A0DB-01A-11R-A00Z-07	ENSG00000000005.6	TNMD	68	Primary Tumor
7	TCGA-A7-A0DB	TCGA-A7-A0DB-01A	TCGA-A7-A0DB-01A-11R-A277-07	ENSG00000000005.6	TNMD	47	Primary Tumor
8	TCGA-A7-A0DB	TCGA-A7-A0DB-01C	TCGA-A7-A0DB-01C-02R-A277-07	ENSG00000000005.6	TNMD	29	Primary Tumor
9	TCGA-A7-A0DB	TCGA-A7-A0DB-11A	TCGA-A7-A0DB-11A-33R-A089-07	ENSG00000000005.6	TNMD	303	Solid Tissue Normal

Moving the analysis to R¶

At this juncture we will move the analysis to R to take advantage of the standard Bioconductor differential expression packages. During the initial setup we loaded rpy2 which allows us to execute cells in R using cell magic.

We will first need to install our R packages. Be aware that unfortunately this step can take upwards of 10 minutes and will request user input on whether to update dependencies, enter "a" for updating all. This behaviour can be common working in the cloud, which can be dependent on server availability and speed of interconnections. Executing this code in a Google Virtual Machine, especially one with the libraries pre-installed, is substantially faster.

In [ ]:

%%R
if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install(c("edgeR", "DESeq2"))

We can import the data frame into the R environment by with the -i flag in our %%R cell magic call and load the libraries we just installed.

In [ ]:

%%R -i joined_data
library(DESeq2)
library(edgeR)
library(scales)

Converting our longform data table into a count matrix¶

Both edgeR and DESeq2 expect a data matrix of counts as an input where each column corresponds to a sample and each row to a gene. In our query we sorted our data frame by the gene_name, case_barcode, and aliquot_barcode. This allows us to split our data frame on gene_name and then rbind() the resulting count vectors into a matrix.

In [ ]:

%%R
split_df <- split(joined_data, f=joined_data$gene_name)
str(split_df[c(1,2)]) # showing two data frames from our list output
expression_list <- lapply(split_df, function(df){df$unstranded})
exp_matrix <- as.data.frame(Reduce(rbind, expression_list))

List of 2
 $ A2M :'data.frame':	231 obs. of  7 variables:
  ..$ case_barcode     : chr [1:231] "TCGA-A7-A0CE" "TCGA-A7-A0CE" "TCGA-A7-A0CH" "TCGA-A7-A0CH" ...
  ..$ sample_barcode   : chr [1:231] "TCGA-A7-A0CE-01A" "TCGA-A7-A0CE-11A" "TCGA-A7-A0CH-01A" "TCGA-A7-A0CH-11A" ...
  ..$ aliquot_barcode  : chr [1:231] "TCGA-A7-A0CE-01A-11R-A00Z-07" "TCGA-A7-A0CE-11A-21R-A089-07" "TCGA-A7-A0CH-01A-21R-A00Z-07" "TCGA-A7-A0CH-11A-32R-A089-07" ...
  ..$ Ensembl_gene_id_v: chr [1:231] "ENSG00000175899.15" "ENSG00000175899.15" "ENSG00000175899.15" "ENSG00000175899.15" ...
  ..$ gene_name        : chr [1:231] "A2M" "A2M" "A2M" "A2M" ...
  ..$ unstranded       : int [1:231] 28180 122749 12099 85733 28978 93660 31811 22960 2606 84430 ...
  ..$ sample_type_name : chr [1:231] "Primary Tumor" "Solid Tissue Normal" "Primary Tumor" "Solid Tissue Normal" ...
 $ AAMP:'data.frame':	231 obs. of  7 variables:
  ..$ case_barcode     : chr [1:231] "TCGA-A7-A0CE" "TCGA-A7-A0CE" "TCGA-A7-A0CH" "TCGA-A7-A0CH" ...
  ..$ sample_barcode   : chr [1:231] "TCGA-A7-A0CE-01A" "TCGA-A7-A0CE-11A" "TCGA-A7-A0CH-01A" "TCGA-A7-A0CH-11A" ...
  ..$ aliquot_barcode  : chr [1:231] "TCGA-A7-A0CE-01A-11R-A00Z-07" "TCGA-A7-A0CE-11A-21R-A089-07" "TCGA-A7-A0CH-01A-21R-A00Z-07" "TCGA-A7-A0CH-11A-32R-A089-07" ...
  ..$ Ensembl_gene_id_v: chr [1:231] "ENSG00000127837.10" "ENSG00000127837.10" "ENSG00000127837.10" "ENSG00000127837.10" ...
  ..$ gene_name        : chr [1:231] "AAMP" "AAMP" "AAMP" "AAMP" ...
  ..$ unstranded       : int [1:231] 3062 9829 6161 7860 7086 4722 7269 2941 214 4417 ...
  ..$ sample_type_name : chr [1:231] "Primary Tumor" "Solid Tissue Normal" "Primary Tumor" "Solid Tissue Normal" ...

We load our genes as rownames and the aliquot_barcodes as column names

In [ ]:

%%R
colnames(exp_matrix) <- split_df[[1]]$aliquot_barcode
genes <- names(split_df)
rownames(exp_matrix) <- genes
str(exp_matrix, list.len=5)

'data.frame':	2000 obs. of  231 variables:
 $ TCGA-A7-A0CE-01A-11R-A00Z-07: int  28180 3062 1306 8 1660 1029 168 16859 299 1 ...
 $ TCGA-A7-A0CE-11A-21R-A089-07: int  122749 9829 2846 1676 6656 1275 4077 12451 4387 6 ...
 $ TCGA-A7-A0CH-01A-21R-A00Z-07: int  12099 6161 32 2992 3193 978 343 5454 2849 0 ...
 $ TCGA-A7-A0CH-11A-32R-A089-07: int  85733 7860 5980 8475 8724 2671 6010 17694 6122 6 ...
 $ TCGA-A7-A0D9-01A-31R-A056-07: int  28978 7086 455 211 3870 1412 463 3870 3917 1 ...
  [list output truncated]

We also need to generate vectors of which case each aliquot corresponds to as well as whether the aliquot is a tumor or normal sample. We can easily retrieve this information from our split data frame.

As a quick sanity check we can generate a heatmap to look at how our aliquots cluster and generate a histogram to compare the distributions of normal and tumor expression values.

In [ ]:

%%R
colnames(exp_matrix) <- split_df[[1]]$id
cases <- gsub('-', '.', split_df[[1]]$case_barcode)
str(cases)
sample_type <- gsub(' ', '.', split_df[[1]]$sample_type_name)
str(sample_type)
heatmap(as.matrix(exp_matrix), labCol=sample_type, rowv=NA, scale="row")

 chr [1:231] "TCGA.A7.A0CE" "TCGA.A7.A0CE" "TCGA.A7.A0CH" "TCGA.A7.A0CH" ...
 chr [1:231] "Primary.Tumor" "Solid.Tissue.Normal" "Primary.Tumor" ...

In [ ]:

%%R
norm <- log10(unlist(exp_matrix[,which(sample_type == 'Solid.Tissue.Normal')]))
tumor <- log10(unlist(exp_matrix[,which(sample_type == 'Primary.Tumor')]))

p_norm <- hist(norm, breaks=seq(0,8,0.1), plot=0)
p_tumor <- hist(tumor, breaks=seq(0,8,0.1), plot=0)
plot( p_norm, col=rgb(0,0,1,1/8), xlim=c(0,8), xlab='log10(count)', main='')
plot( p_tumor, col=rgb(1,0,0,1/8), xlim=c(0,8), add=T)
legend("topright", legend=c("Normal", "Tumor"), fill=c(rgb(0,0,1,1/8), rgb(1,0,0,1/8)), bty="n")

Calculating differential expression using edgeR¶

We can now feed these data to edgeR, generate significance values of differential expression, and visualize the results using a scatterplot. Genes that fall below a 0.01 significance threshold are color coded red.

In [ ]:

%%R
dge_obj <- DGEList(
             counts=exp_matrix,
             lib.size=colSums(exp_matrix),
             samples=cases,
             group=sample_type)
disp <- estimateDisp(dge_obj, method="pooled")
edgeR_test <- exactTest(disp)
results_df <- topTags(edgeR_test, nrow(exp_matrix))@.Data[[1]]
str(results_df)

WARNING:rpy2.rinterface_lib.callbacks:R[write to console]: Using classic mode.

'data.frame':	2000 obs. of  4 variables:
 $ logFC : num  2.89 3.38 -3.46 2.34 3.4 ...
 $ logCPM: num  9.64 6.54 12.97 11.38 6.31 ...
 $ PValue: num  4.08e-121 3.55e-115 4.38e-115 1.47e-108 3.14e-108 ...
 $ FDR   : num  8.17e-118 2.92e-112 2.92e-112 7.37e-106 1.26e-105 ...

In [ ]:

%%R
colors <- c(alpha('red', 0.3), alpha('black', 0.6))[(results_df$FDR > 0.01)+1]
plot(results_df$logCPM, results_df$logFC,
     pch=19, cex=1.7, col=colors,
     xlab='log(fold change)', ylab='log(counts per million)')
sum(results_df$FDR <= 0.01)

[1] 1587

Or perform the same calcuations using DESeq2¶

As above we can perform comparisons using DESeq2 instead. We generate the expected input matrix and sample information data frame and run the model on our data.

In [ ]:

%%R
col_info <- as.data.frame(cbind(cases, sample_type), stringAsFactors=TRUE)
deseq_obj <- DESeqDataSetFromMatrix( countData = exp_matrix, colData= col_info, design= ~ cases + sample_type)
deseq_obj <- estimateSizeFactors(deseq_obj)
deseq_obj <- estimateDispersions(deseq_obj)
deseq_obj <- nbinomWaldTest(deseq_obj)
deseq_output <- results(deseq_obj)
summary(deseq_output)

WARNING:rpy2.rinterface_lib.callbacks:R[write to console]: gene-wise dispersion estimates

WARNING:rpy2.rinterface_lib.callbacks:R[write to console]: mean-dispersion relationship

WARNING:rpy2.rinterface_lib.callbacks:R[write to console]: final dispersion estimates

WARNING:rpy2.rinterface_lib.callbacks:R[write to console]: 20 rows did not converge in beta, labelled in mcols(object)$betaConv. Use larger maxit argument with nbinomWaldTest

out of 2000 with nonzero total read count
adjusted p-value < 0.1
LFC > 0 (up)       : 911, 46%
LFC < 0 (down)     : 828, 41%
outliers [1]       : 0, 0%
low counts [2]     : 0, 0%
(mean count < 28)
[1] see 'cooksCutoff' argument of ?results
[2] see 'independentFiltering' argument of ?results

We can once again visualize our results plotting mean expression against fold change.

In [ ]:

%%R
deseq_results <- deseq_output@listData
colors <- c(alpha('red', 0.3), alpha('black', 0.6))[(deseq_results$padj > 0.01)+1]
plot(log(deseq_results$baseMean), deseq_results$log2FoldChange,
     pch=19, cex=1.7, col=colors,
     xlab='log(mean expression)', ylab='log2(fold change)')
sum(deseq_results$padj <= 0.01)

[1] 1600

Summary¶

It is quite simple to retrieve Genomic Data Commons gene expression data through the ISB-CGC BQ ecosystem. These data can be subset and summarized easily via SQL queries and the results loaded into either R or Python for further analysis and visualization.