Check out more notebooks at our Community Notebooks Repository!
Title: Comparing gene expression in tumor against gene expression in normal tissue
Author: Fabian Seidl
Created: 2021-07-09
URL: https://github.com/isb-cgc/Community-Notebooks/blob/master/Notebooks/How_to_analyze_differential_expression_between_paired_tumor_and_normal_samples.ipynb
Purpose: To perform differential expression analysis comparing tumor to normal tissue
Notes:
This notebook is intended to demonstrate how to run a differential gene expression analysis between tumor and normal samples using GDC cancer data made available in BigQuery. We will be using the TCGA gene expression dataset in isb-cgc-bq
, specifically the table isb-cgc-bq.TCGA_versioned.RNAseq_hg38_gdc_r35
. This notebook is modeled to somewhat mirror the differential expression analysis found in Peng, et al. Scientific Reports 2015 (https://doi.org/10.1038/srep13413).
First we need to set up the environment and authenticate our session. This notebook will use both python to construct and run our SQL query and the Bioconductor R libraries edgeR
and DESeq2
.
from google.colab import auth
import pandas_gbq
import pandas as pd
import numpy
import seaborn
from google.cloud import bigquery
!pip install rpy2==3.5.1
%load_ext rpy2.ipython
bq_project='isb-project-zero'
auth.authenticate_user()
client = bigquery.Client(project=bq_project)
print('Authenticated')
Requirement already satisfied: rpy2==3.5.1 in /usr/local/lib/python3.10/dist-packages (3.5.1) Requirement already satisfied: cffi>=1.10.0 in /usr/local/lib/python3.10/dist-packages (from rpy2==3.5.1) (1.15.1) Requirement already satisfied: jinja2 in /usr/local/lib/python3.10/dist-packages (from rpy2==3.5.1) (3.1.2) Requirement already satisfied: pytz in /usr/local/lib/python3.10/dist-packages (from rpy2==3.5.1) (2022.7.1) Requirement already satisfied: tzlocal in /usr/local/lib/python3.10/dist-packages (from rpy2==3.5.1) (5.0.1) Requirement already satisfied: pycparser in /usr/local/lib/python3.10/dist-packages (from cffi>=1.10.0->rpy2==3.5.1) (2.21) Requirement already satisfied: MarkupSafe>=2.0 in /usr/local/lib/python3.10/dist-packages (from jinja2->rpy2==3.5.1) (2.1.3) Authenticated
Our first step is to join sample_type data to our RNA seq table and identify the cases in the RNA seq BigQuery table of choice that have data for both tumor and normal aliquots. To perform this operation we can write an SQL query which groups rows by case barcode and then use the array_agg()
function to concatenate the sample_type_name
field for each case.
sub_project = "TCGA-BRCA"
rna_table = "isb-cgc-bq.TCGA_versioned.RNAseq_hg38_gdc_r35"
case_query = """
WITH rna as (
SELECT
case_barcode,
sample_barcode,
aliquot_barcode,
Ensembl_gene_id_v,
gene_name,
unstranded,
fpkm_uq_unstranded,
sample_type_name,
FROM `isb-cgc-bq.TCGA_versioned.RNAseq_hg38_gdc_r35`
WHERE gene_type = 'protein_coding'
AND project_short_name = '{0}'
),
cases as (
SELECT case_barcode, agg FROM
(SELECT
case_barcode,
array_agg(distinct sample_type_name) agg
FROM rna
GROUP BY case_barcode)
WHERE array_length(agg) > 1
AND ("Solid Tissue Normal" in UNNEST(agg))
)""".format(sub_project)
select_cases = """
SELECT * FROM cases
"""
query_job = client.query( case_query + select_cases )
# print( case_query + select_cases )
joined_data = query_job.result().to_dataframe()
joined_data.head(5)
case_barcode | agg | |
---|---|---|
0 | TCGA-GI-A2C9 | [Solid Tissue Normal, Primary Tumor] |
1 | TCGA-BH-A0DH | [Primary Tumor, Solid Tissue Normal] |
2 | TCGA-BH-A0DL | [Primary Tumor, Solid Tissue Normal] |
3 | TCGA-BH-A1EV | [Primary Tumor, Solid Tissue Normal] |
4 | TCGA-BH-A0BS | [Solid Tissue Normal, Primary Tumor] |
Performing a comparison across all genes is outside of the scope of this notebook, so we will reduce our gene set to a more manageable size. A common metric to subset by is the variance of expression level. We can calculate this metric in an SQL query by grouping by the Ensembl_gene_id_v
field and calculating the VARIANCE()
of the counts field. We then order our results by this value and select the top 2000 genes.
expression_query = """,
mean_expr as (
SELECT * FROM (
SELECT
rna.Ensembl_gene_id_v,
VARIANCE(rna.fpkm_uq_unstranded) var_fpkm
FROM rna
JOIN cases ON rna.case_barcode = cases.case_barcode
WHERE rna.sample_type_name = 'Solid Tissue Normal'
GROUP BY rna.Ensembl_gene_id_v)
ORDER BY var_fpkm DESC
LIMIT 2000)""".format(rna_table)
select_genes = """
SELECT * FROM mean_expr
"""
query_job = client.query( case_query + expression_query + select_genes )
# print( case_query + expression_query + select_genes )
joined_data = query_job.result().to_dataframe()
joined_data.head(5)
Ensembl_gene_id_v | var_fpkm | |
---|---|---|
0 | ENSG00000198886.2 | 4.219112e+07 |
1 | ENSG00000198938.2 | 3.523055e+07 |
2 | ENSG00000198804.2 | 2.799827e+07 |
3 | ENSG00000143632.14 | 2.615222e+07 |
4 | ENSG00000198840.2 | 2.575547e+07 |
To generate the final table we join the full RNA expression table back to the tables housing the list of cases as well as the list of genes. As both edgeR
and DESeq2
require counts as input the important fields we need are case_barcode
, aliquot_barcode
, gene_name
, sample_type_name
, and HTSeq__Counts
.
final_join = """
SELECT rna.case_barcode,
rna.sample_barcode,
rna.aliquot_barcode,
rna.Ensembl_gene_id_v,
rna.gene_name,
rna.unstranded,
rna.sample_type_name,
FROM rna
JOIN cases ON rna.case_barcode = cases.case_barcode
JOIN mean_expr ON rna.Ensembl_gene_id_v = mean_expr.Ensembl_gene_id_v
WHERE rna.sample_type_name = 'Solid Tissue Normal'
OR rna.sample_type_name = 'Primary Tumor'
ORDER BY Ensembl_gene_id_v, case_barcode, aliquot_barcode
""".format(rna_table)
query_job = client.query( case_query + expression_query + final_join )
# print( case_query + expression_query + final_join )
joined_data = query_job.result().to_dataframe()
joined_data.head(10)
case_barcode | sample_barcode | aliquot_barcode | Ensembl_gene_id_v | gene_name | unstranded | sample_type_name | |
---|---|---|---|---|---|---|---|
0 | TCGA-A7-A0CE | TCGA-A7-A0CE-01A | TCGA-A7-A0CE-01A-11R-A00Z-07 | ENSG00000000005.6 | TNMD | 54 | Primary Tumor |
1 | TCGA-A7-A0CE | TCGA-A7-A0CE-11A | TCGA-A7-A0CE-11A-21R-A089-07 | ENSG00000000005.6 | TNMD | 319 | Solid Tissue Normal |
2 | TCGA-A7-A0CH | TCGA-A7-A0CH-01A | TCGA-A7-A0CH-01A-21R-A00Z-07 | ENSG00000000005.6 | TNMD | 0 | Primary Tumor |
3 | TCGA-A7-A0CH | TCGA-A7-A0CH-11A | TCGA-A7-A0CH-11A-32R-A089-07 | ENSG00000000005.6 | TNMD | 2114 | Solid Tissue Normal |
4 | TCGA-A7-A0D9 | TCGA-A7-A0D9-01A | TCGA-A7-A0D9-01A-31R-A056-07 | ENSG00000000005.6 | TNMD | 72 | Primary Tumor |
5 | TCGA-A7-A0D9 | TCGA-A7-A0D9-11A | TCGA-A7-A0D9-11A-53R-A089-07 | ENSG00000000005.6 | TNMD | 3612 | Solid Tissue Normal |
6 | TCGA-A7-A0DB | TCGA-A7-A0DB-01A | TCGA-A7-A0DB-01A-11R-A00Z-07 | ENSG00000000005.6 | TNMD | 68 | Primary Tumor |
7 | TCGA-A7-A0DB | TCGA-A7-A0DB-01A | TCGA-A7-A0DB-01A-11R-A277-07 | ENSG00000000005.6 | TNMD | 47 | Primary Tumor |
8 | TCGA-A7-A0DB | TCGA-A7-A0DB-01C | TCGA-A7-A0DB-01C-02R-A277-07 | ENSG00000000005.6 | TNMD | 29 | Primary Tumor |
9 | TCGA-A7-A0DB | TCGA-A7-A0DB-11A | TCGA-A7-A0DB-11A-33R-A089-07 | ENSG00000000005.6 | TNMD | 303 | Solid Tissue Normal |
At this juncture we will move the analysis to R to take advantage of the standard Bioconductor differential expression packages. During the initial setup we loaded rpy2
which allows us to execute cells in R using cell magic.
We will first need to install our R packages. Be aware that unfortunately this step can take upwards of 10 minutes and will request user input on whether to update dependencies, enter "a" for updating all. This behaviour can be common working in the cloud, which can be dependent on server availability and speed of interconnections. Executing this code in a Google Virtual Machine, especially one with the libraries pre-installed, is substantially faster.
%%R
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install(c("edgeR", "DESeq2"))
We can import the data frame into the R environment by with the -i
flag in our %%R
cell magic call and load the libraries we just installed.
%%R -i joined_data
library(DESeq2)
library(edgeR)
library(scales)
Both edgeR
and DESeq2
expect a data matrix of counts as an input where each column corresponds to a sample and each row to a gene. In our query we sorted our data frame by the gene_name
, case_barcode
, and aliquot_barcode
. This allows us to split our data frame on gene_name and then rbind()
the resulting count vectors into a matrix.
%%R
split_df <- split(joined_data, f=joined_data$gene_name)
str(split_df[c(1,2)]) # showing two data frames from our list output
expression_list <- lapply(split_df, function(df){df$unstranded})
exp_matrix <- as.data.frame(Reduce(rbind, expression_list))
List of 2 $ A2M :'data.frame': 231 obs. of 7 variables: ..$ case_barcode : chr [1:231] "TCGA-A7-A0CE" "TCGA-A7-A0CE" "TCGA-A7-A0CH" "TCGA-A7-A0CH" ... ..$ sample_barcode : chr [1:231] "TCGA-A7-A0CE-01A" "TCGA-A7-A0CE-11A" "TCGA-A7-A0CH-01A" "TCGA-A7-A0CH-11A" ... ..$ aliquot_barcode : chr [1:231] "TCGA-A7-A0CE-01A-11R-A00Z-07" "TCGA-A7-A0CE-11A-21R-A089-07" "TCGA-A7-A0CH-01A-21R-A00Z-07" "TCGA-A7-A0CH-11A-32R-A089-07" ... ..$ Ensembl_gene_id_v: chr [1:231] "ENSG00000175899.15" "ENSG00000175899.15" "ENSG00000175899.15" "ENSG00000175899.15" ... ..$ gene_name : chr [1:231] "A2M" "A2M" "A2M" "A2M" ... ..$ unstranded : int [1:231] 28180 122749 12099 85733 28978 93660 31811 22960 2606 84430 ... ..$ sample_type_name : chr [1:231] "Primary Tumor" "Solid Tissue Normal" "Primary Tumor" "Solid Tissue Normal" ... $ AAMP:'data.frame': 231 obs. of 7 variables: ..$ case_barcode : chr [1:231] "TCGA-A7-A0CE" "TCGA-A7-A0CE" "TCGA-A7-A0CH" "TCGA-A7-A0CH" ... ..$ sample_barcode : chr [1:231] "TCGA-A7-A0CE-01A" "TCGA-A7-A0CE-11A" "TCGA-A7-A0CH-01A" "TCGA-A7-A0CH-11A" ... ..$ aliquot_barcode : chr [1:231] "TCGA-A7-A0CE-01A-11R-A00Z-07" "TCGA-A7-A0CE-11A-21R-A089-07" "TCGA-A7-A0CH-01A-21R-A00Z-07" "TCGA-A7-A0CH-11A-32R-A089-07" ... ..$ Ensembl_gene_id_v: chr [1:231] "ENSG00000127837.10" "ENSG00000127837.10" "ENSG00000127837.10" "ENSG00000127837.10" ... ..$ gene_name : chr [1:231] "AAMP" "AAMP" "AAMP" "AAMP" ... ..$ unstranded : int [1:231] 3062 9829 6161 7860 7086 4722 7269 2941 214 4417 ... ..$ sample_type_name : chr [1:231] "Primary Tumor" "Solid Tissue Normal" "Primary Tumor" "Solid Tissue Normal" ...
We load our genes as rownames and the aliquot_barcodes as column names
%%R
colnames(exp_matrix) <- split_df[[1]]$aliquot_barcode
genes <- names(split_df)
rownames(exp_matrix) <- genes
str(exp_matrix, list.len=5)
'data.frame': 2000 obs. of 231 variables: $ TCGA-A7-A0CE-01A-11R-A00Z-07: int 28180 3062 1306 8 1660 1029 168 16859 299 1 ... $ TCGA-A7-A0CE-11A-21R-A089-07: int 122749 9829 2846 1676 6656 1275 4077 12451 4387 6 ... $ TCGA-A7-A0CH-01A-21R-A00Z-07: int 12099 6161 32 2992 3193 978 343 5454 2849 0 ... $ TCGA-A7-A0CH-11A-32R-A089-07: int 85733 7860 5980 8475 8724 2671 6010 17694 6122 6 ... $ TCGA-A7-A0D9-01A-31R-A056-07: int 28978 7086 455 211 3870 1412 463 3870 3917 1 ... [list output truncated]
We also need to generate vectors of which case each aliquot corresponds to as well as whether the aliquot is a tumor or normal sample. We can easily retrieve this information from our split data frame.
As a quick sanity check we can generate a heatmap to look at how our aliquots cluster and generate a histogram to compare the distributions of normal and tumor expression values.
%%R
colnames(exp_matrix) <- split_df[[1]]$id
cases <- gsub('-', '.', split_df[[1]]$case_barcode)
str(cases)
sample_type <- gsub(' ', '.', split_df[[1]]$sample_type_name)
str(sample_type)
heatmap(as.matrix(exp_matrix), labCol=sample_type, rowv=NA, scale="row")
chr [1:231] "TCGA.A7.A0CE" "TCGA.A7.A0CE" "TCGA.A7.A0CH" "TCGA.A7.A0CH" ... chr [1:231] "Primary.Tumor" "Solid.Tissue.Normal" "Primary.Tumor" ...
%%R
norm <- log10(unlist(exp_matrix[,which(sample_type == 'Solid.Tissue.Normal')]))
tumor <- log10(unlist(exp_matrix[,which(sample_type == 'Primary.Tumor')]))
p_norm <- hist(norm, breaks=seq(0,8,0.1), plot=0)
p_tumor <- hist(tumor, breaks=seq(0,8,0.1), plot=0)
plot( p_norm, col=rgb(0,0,1,1/8), xlim=c(0,8), xlab='log10(count)', main='')
plot( p_tumor, col=rgb(1,0,0,1/8), xlim=c(0,8), add=T)
legend("topright", legend=c("Normal", "Tumor"), fill=c(rgb(0,0,1,1/8), rgb(1,0,0,1/8)), bty="n")
We can now feed these data to edgeR, generate significance values of differential expression, and visualize the results using a scatterplot. Genes that fall below a 0.01 significance threshold are color coded red.
%%R
dge_obj <- DGEList(
counts=exp_matrix,
lib.size=colSums(exp_matrix),
samples=cases,
group=sample_type)
disp <- estimateDisp(dge_obj, method="pooled")
edgeR_test <- exactTest(disp)
results_df <- topTags(edgeR_test, nrow(exp_matrix))@.Data[[1]]
str(results_df)
WARNING:rpy2.rinterface_lib.callbacks:R[write to console]: Using classic mode.
'data.frame': 2000 obs. of 4 variables: $ logFC : num 2.89 3.38 -3.46 2.34 3.4 ... $ logCPM: num 9.64 6.54 12.97 11.38 6.31 ... $ PValue: num 4.08e-121 3.55e-115 4.38e-115 1.47e-108 3.14e-108 ... $ FDR : num 8.17e-118 2.92e-112 2.92e-112 7.37e-106 1.26e-105 ...
%%R
colors <- c(alpha('red', 0.3), alpha('black', 0.6))[(results_df$FDR > 0.01)+1]
plot(results_df$logCPM, results_df$logFC,
pch=19, cex=1.7, col=colors,
xlab='log(fold change)', ylab='log(counts per million)')
sum(results_df$FDR <= 0.01)
[1] 1587
As above we can perform comparisons using DESeq2 instead. We generate the expected input matrix and sample information data frame and run the model on our data.
%%R
col_info <- as.data.frame(cbind(cases, sample_type), stringAsFactors=TRUE)
deseq_obj <- DESeqDataSetFromMatrix( countData = exp_matrix, colData= col_info, design= ~ cases + sample_type)
deseq_obj <- estimateSizeFactors(deseq_obj)
deseq_obj <- estimateDispersions(deseq_obj)
deseq_obj <- nbinomWaldTest(deseq_obj)
deseq_output <- results(deseq_obj)
summary(deseq_output)
WARNING:rpy2.rinterface_lib.callbacks:R[write to console]: gene-wise dispersion estimates WARNING:rpy2.rinterface_lib.callbacks:R[write to console]: mean-dispersion relationship WARNING:rpy2.rinterface_lib.callbacks:R[write to console]: final dispersion estimates WARNING:rpy2.rinterface_lib.callbacks:R[write to console]: 20 rows did not converge in beta, labelled in mcols(object)$betaConv. Use larger maxit argument with nbinomWaldTest
out of 2000 with nonzero total read count adjusted p-value < 0.1 LFC > 0 (up) : 911, 46% LFC < 0 (down) : 828, 41% outliers [1] : 0, 0% low counts [2] : 0, 0% (mean count < 28) [1] see 'cooksCutoff' argument of ?results [2] see 'independentFiltering' argument of ?results
We can once again visualize our results plotting mean expression against fold change.
%%R
deseq_results <- deseq_output@listData
colors <- c(alpha('red', 0.3), alpha('black', 0.6))[(deseq_results$padj > 0.01)+1]
plot(log(deseq_results$baseMean), deseq_results$log2FoldChange,
pch=19, cex=1.7, col=colors,
xlab='log(mean expression)', ylab='log2(fold change)')
sum(deseq_results$padj <= 0.01)
[1] 1600
It is quite simple to retrieve Genomic Data Commons gene expression data through the ISB-CGC BQ ecosystem. These data can be subset and summarized easily via SQL queries and the results loaded into either R or Python for further analysis and visualization.