Check out other notebooks at our Community Notebooks Repository!
This notebook demonstrates basic usage of the Reactome BigQuery dataset. Analysis of this dataset can provide a powerful tool for identifying pathways related to cancer biomarkers.
The Reactome is an open-source, manually curated, and peer-reviewed pathway database. More information can be found here: https://reactome.org/.
Before running the analysis, we need to load dependencies, authenticate to BigQuery, and customize notebook parameters.
# GCP Libraries
from google.cloud import bigquery
from google.colab import auth
# Data Analytics
import numpy as np
from scipy import stats
Before using BigQuery, we need to get authorization for access to BigQuery and the Google Cloud. For more information see 'Quick Start Guide to ISB-CGC'. Alternative authentication methods can be found here.
# if you're using Google Colab, authenticate to gcloud with the following
auth.authenticate_user()
# alternatively, use the gcloud SDK
#!gcloud auth application-default login
Customize the following parameters based on your notebook, execution environment, or project.
# set the google project that will be billed for this notebook's computations
google_project = 'google-project' ## change me
Create the BigQuery client.
# Create a client to access the data within BigQuery
client = bigquery.Client(google_project)
We can join tables from the Reactome BigQuery dataset to identify all pathways related to our genes of interest. We first have to map the gene names to Uniprot IDs in the Reactome "physical entities" table. These can then be mapped to Reactome pathways. We further filter the physical entity to pathway evidence codes to retain only interactions that have "Traceable Author Statements" (TAS) rather than just "Inferred from Electronic Annotation" (IEA) to avoid evidence that have not been manually curated.
We use the following genes to identify related pathways. These genes were identified in an ovarian cancer chemo-response study by Bosquet et al.
# set parameters for query
genes = "'RHOT1','MYO7A','ZBTB10','MATK','ST18','RPS23','GCNT1','DROSHA','NUAK1','CCPG1',\
'PDGFD','KLRAP1','MTAP','RNF13','THBS1','MLX','FAP','TIMP3','PRSS1','SLC7A11',\
'OLFML3','RPS20','MCM5','POLE','STEAP4','LRRC8D','WBP1L','ENTPD5','SYNE1','DPT',\
'COPZ2','TRIO','PDPR'"
# run query and put results in data frame
pathways = client.query(("""
SELECT
DISTINCT pathway.*
FROM
`isb-cgc-bq.reactome_versioned.pathway_v77` as pathway
INNER JOIN `isb-cgc-bq.reactome_versioned.pe_to_pathway_v77` as pe2pathway
-- link pathways to physical entities via intermediate table
ON pathway.stable_id = pe2pathway.pathway_stable_id
INNER JOIN `isb-cgc-bq.reactome_versioned.physical_entity_v77` AS pe
-- link pathways to physical entities
ON pe2pathway.pe_stable_id = pe.stable_id
WHERE
-- filter by stronger evidence: "Traceable Author Statement"
pe2pathway.evidence_code = 'TAS'
-- filter by pathways that are related to genes in list
AND pe.name IN ({genes})
ORDER BY pathway.name ASC
""").format(
genes=genes
)).result().to_dataframe()
# Display the data frame
pathways
stable_id | url | name | species | lowest_level | |
---|---|---|---|---|---|
0 | R-HSA-176187 | https://reactome.org/PathwayBrowser/#/R-HSA-17... | Activation of ATR in response to replication s... | Homo sapiens | True |
1 | R-HSA-72662 | https://reactome.org/PathwayBrowser/#/R-HSA-72662 | Activation of the mRNA upon binding of the cap... | Homo sapiens | False |
2 | R-HSA-68962 | https://reactome.org/PathwayBrowser/#/R-HSA-68962 | Activation of the pre-replicative complex | Homo sapiens | True |
3 | R-HSA-352230 | https://reactome.org/PathwayBrowser/#/R-HSA-35... | Amino acid transport across the plasma membrane | Homo sapiens | True |
4 | R-HSA-446203 | https://reactome.org/PathwayBrowser/#/R-HSA-44... | Asparagine N-linked glycosylation | Homo sapiens | False |
... | ... | ... | ... | ... | ... |
148 | R-HSA-192823 | https://reactome.org/PathwayBrowser/#/R-HSA-19... | Viral mRNA Translation | Homo sapiens | True |
149 | R-HSA-2187338 | https://reactome.org/PathwayBrowser/#/R-HSA-21... | Visual phototransduction | Homo sapiens | False |
150 | R-HSA-193704 | https://reactome.org/PathwayBrowser/#/R-HSA-19... | p75 NTR receptor-mediated signalling | Homo sapiens | False |
151 | R-HSA-72312 | https://reactome.org/PathwayBrowser/#/R-HSA-72312 | rRNA processing | Homo sapiens | False |
152 | R-HSA-8868773 | https://reactome.org/PathwayBrowser/#/R-HSA-88... | rRNA processing in the nucleus and cytosol | Homo sapiens | False |
153 rows × 5 columns
If we're only interested in the lowest level pathways, i.e., pathways that are not parents of other pathways in the hierarchy, we can filter by the "lowest_level" field in the pathways table.
# run query and put results in data frame
lowest_pathways = client.query(("""
SELECT
DISTINCT pathway.*
FROM
`isb-cgc-bq.reactome_versioned.pathway_v77` as pathway
INNER JOIN `isb-cgc-bq.reactome_versioned.pe_to_pathway_v77` as pe2pathway
-- link pathways to physical entities via intermediate table
ON pathway.stable_id = pe2pathway.pathway_stable_id
INNER JOIN `isb-cgc-bq.reactome_versioned.physical_entity_v77` AS pe
-- link pathways to physical entities
ON pe2pathway.pe_stable_id = pe.stable_id
WHERE
-- filter by stronger evidence: "Traceable Author Statement"
pe2pathway.evidence_code = 'TAS'
-- filter by pathways that are related to genes in list
AND pe.name IN ({genes})
-- filter to include just lowest level pathways
AND pathway.lowest_level = TRUE
ORDER BY pathway.name ASC
""").format(
genes=genes
)).result().to_dataframe()
# display data frame
lowest_pathways
stable_id | url | name | species | lowest_level | |
---|---|---|---|---|---|
0 | R-HSA-176187 | https://reactome.org/PathwayBrowser/#/R-HSA-17... | Activation of ATR in response to replication s... | Homo sapiens | True |
1 | R-HSA-68962 | https://reactome.org/PathwayBrowser/#/R-HSA-68962 | Activation of the pre-replicative complex | Homo sapiens | True |
2 | R-HSA-352230 | https://reactome.org/PathwayBrowser/#/R-HSA-35... | Amino acid transport across the plasma membrane | Homo sapiens | True |
3 | R-HSA-210991 | https://reactome.org/PathwayBrowser/#/R-HSA-21... | Basigin interactions | Homo sapiens | True |
4 | R-HSA-9013148 | https://reactome.org/PathwayBrowser/#/R-HSA-90... | CDC42 GTPase cycle | Homo sapiens | True |
5 | R-HSA-6811434 | https://reactome.org/PathwayBrowser/#/R-HSA-68... | COPI-dependent Golgi-to-ER retrograde traffic | Homo sapiens | True |
6 | R-HSA-6807878 | https://reactome.org/PathwayBrowser/#/R-HSA-68... | COPI-mediated anterograde transport | Homo sapiens | True |
7 | R-HSA-163765 | https://reactome.org/PathwayBrowser/#/R-HSA-16... | ChREBP activates metabolic gene expression | Homo sapiens | True |
8 | R-HSA-418885 | https://reactome.org/PathwayBrowser/#/R-HSA-41... | DCC mediated attractive signaling | Homo sapiens | True |
9 | R-HSA-68952 | https://reactome.org/PathwayBrowser/#/R-HSA-68952 | DNA replication initiation | Homo sapiens | True |
10 | R-HSA-5696400 | https://reactome.org/PathwayBrowser/#/R-HSA-56... | Dual Incision in GG-NER | Homo sapiens | True |
11 | R-HSA-6782135 | https://reactome.org/PathwayBrowser/#/R-HSA-67... | Dual incision in TC-NER | Homo sapiens | True |
12 | R-HSA-72689 | https://reactome.org/PathwayBrowser/#/R-HSA-72689 | Formation of a pool of free 40S subunits | Homo sapiens | True |
13 | R-HSA-72695 | https://reactome.org/PathwayBrowser/#/R-HSA-72695 | Formation of the ternary complex, and subseque... | Homo sapiens | True |
14 | R-HSA-416482 | https://reactome.org/PathwayBrowser/#/R-HSA-41... | G alpha (12/13) signalling events | Homo sapiens | True |
15 | R-HSA-72706 | https://reactome.org/PathwayBrowser/#/R-HSA-72706 | GTP hydrolysis and joining of the 60S ribosoma... | Homo sapiens | True |
16 | R-HSA-5696397 | https://reactome.org/PathwayBrowser/#/R-HSA-56... | Gap-filling DNA repair synthesis and ligation ... | Homo sapiens | True |
17 | R-HSA-6782210 | https://reactome.org/PathwayBrowser/#/R-HSA-67... | Gap-filling DNA repair synthesis and ligation ... | Homo sapiens | True |
18 | R-HSA-8950505 | https://reactome.org/PathwayBrowser/#/R-HSA-89... | Gene and protein expression by JAK-STAT signal... | Homo sapiens | True |
19 | R-HSA-216083 | https://reactome.org/PathwayBrowser/#/R-HSA-21... | Integrin cell surface interactions | Homo sapiens | True |
20 | R-HSA-156827 | https://reactome.org/PathwayBrowser/#/R-HSA-15... | L13a-mediated translational silencing of Cerul... | Homo sapiens | True |
21 | R-HSA-6791226 | https://reactome.org/PathwayBrowser/#/R-HSA-67... | Major pathway of rRNA processing in the nucleo... | Homo sapiens | True |
22 | R-HSA-1237112 | https://reactome.org/PathwayBrowser/#/R-HSA-12... | Methionine salvage pathway | Homo sapiens | True |
23 | R-HSA-203927 | https://reactome.org/PathwayBrowser/#/R-HSA-20... | MicroRNA (miRNA) biogenesis | Homo sapiens | True |
24 | R-HSA-5223345 | https://reactome.org/PathwayBrowser/#/R-HSA-52... | Miscellaneous transport and binding events | Homo sapiens | True |
25 | R-HSA-193648 | https://reactome.org/PathwayBrowser/#/R-HSA-19... | NRAGE signals death through JNK | Homo sapiens | True |
26 | R-HSA-975957 | https://reactome.org/PathwayBrowser/#/R-HSA-97... | Nonsense Mediated Decay (NMD) enhanced by the ... | Homo sapiens | True |
27 | R-HSA-975956 | https://reactome.org/PathwayBrowser/#/R-HSA-97... | Nonsense Mediated Decay (NMD) independent of t... | Homo sapiens | True |
28 | R-HSA-5173214 | https://reactome.org/PathwayBrowser/#/R-HSA-51... | O-glycosylation of TSR domain-containing proteins | Homo sapiens | True |
29 | R-HSA-68949 | https://reactome.org/PathwayBrowser/#/R-HSA-68949 | Orc1 removal from chromatin | Homo sapiens | True |
30 | R-HSA-5651801 | https://reactome.org/PathwayBrowser/#/R-HSA-56... | PCNA-Dependent Long Patch Base Excision Repair | Homo sapiens | True |
31 | R-HSA-8850843 | https://reactome.org/PathwayBrowser/#/R-HSA-88... | Phosphate bond hydrolysis by NTPDase proteins | Homo sapiens | True |
32 | R-HSA-114608 | https://reactome.org/PathwayBrowser/#/R-HSA-11... | Platelet degranulation | Homo sapiens | True |
33 | R-HSA-9660826 | https://reactome.org/PathwayBrowser/#/R-HSA-96... | Purinergic signaling in leishmaniasis infection | Homo sapiens | True |
34 | R-HSA-9013149 | https://reactome.org/PathwayBrowser/#/R-HSA-90... | RAC1 GTPase cycle | Homo sapiens | True |
35 | R-HSA-9013404 | https://reactome.org/PathwayBrowser/#/R-HSA-90... | RAC2 GTPase cycle | Homo sapiens | True |
36 | R-HSA-9013423 | https://reactome.org/PathwayBrowser/#/R-HSA-90... | RAC3 GTPase cycle | Homo sapiens | True |
37 | R-HSA-8980692 | https://reactome.org/PathwayBrowser/#/R-HSA-89... | RHOA GTPase cycle | Homo sapiens | True |
38 | R-HSA-9013408 | https://reactome.org/PathwayBrowser/#/R-HSA-90... | RHOG GTPase cycle | Homo sapiens | True |
39 | R-HSA-9013409 | https://reactome.org/PathwayBrowser/#/R-HSA-90... | RHOJ GTPase cycle | Homo sapiens | True |
40 | R-HSA-9013425 | https://reactome.org/PathwayBrowser/#/R-HSA-90... | RHOT1 GTPase cycle | Homo sapiens | True |
41 | R-HSA-8936459 | https://reactome.org/PathwayBrowser/#/R-HSA-89... | RUNX1 regulates genes involved in megakaryocyt... | Homo sapiens | True |
42 | R-HSA-110314 | https://reactome.org/PathwayBrowser/#/R-HSA-11... | Recognition of DNA damage by PCNA-containing r... | Homo sapiens | True |
43 | R-HSA-6804756 | https://reactome.org/PathwayBrowser/#/R-HSA-68... | Regulation of TP53 Activity through Phosphoryl... | Homo sapiens | True |
44 | R-HSA-204174 | https://reactome.org/PathwayBrowser/#/R-HSA-20... | Regulation of pyruvate dehydrogenase (PDH) com... | Homo sapiens | True |
45 | R-HSA-72702 | https://reactome.org/PathwayBrowser/#/R-HSA-72702 | Ribosomal scanning and start codon recognition | Homo sapiens | True |
46 | R-HSA-1799339 | https://reactome.org/PathwayBrowser/#/R-HSA-17... | SRP-dependent cotranslational protein targetin... | Homo sapiens | True |
47 | R-HSA-5656169 | https://reactome.org/PathwayBrowser/#/R-HSA-56... | Termination of translesion DNA synthesis | Homo sapiens | True |
48 | R-HSA-72649 | https://reactome.org/PathwayBrowser/#/R-HSA-72649 | Translation initiation complex formation | Homo sapiens | True |
49 | R-HSA-5689880 | https://reactome.org/PathwayBrowser/#/R-HSA-56... | Ub-specific processing proteases | Homo sapiens | True |
50 | R-HSA-192823 | https://reactome.org/PathwayBrowser/#/R-HSA-19... | Viral mRNA Translation | Homo sapiens | True |
We can identify pathways that are "enriched" with the genes of interest. In other words, we can answer the question: given a set of interesting genes, which pathways contain those genes at a frequency higher than random chance? By calculating the probability that a number of target genes are contained in each pathway, we can identify pathways most likely to be related.
To do this, we can use a chi-squared test to determine if there is a statistically significant difference between the expected frequency of genes in a pathway compared to the observed frequency.
# set query parameters
lowest_level = True # only show pathways at the lowest level
A single query can be used to calculate the chi-squared statistic for all pathways. This query is rather lengthy, but can be broken up into a series of named sub-queries. We step through each query below.
First, we write a query that simply gets a list of all genes in the Reactome physical entity table:
gene_list_query = """
-- Table that contains a list of all distinct genes that map to Reactome
-- physical entities
SELECT
DISTINCT pe.uniprot_id
FROM
`isb-cgc-bq.reactome_versioned.physical_entity_v77` AS pe
WHERE
-- filter by pathways that are related to genes in list
pe.name IN ({genes})
""".format(
genes=genes
).strip("\n")
print(gene_list_query)
-- Table that contains a list of all distinct genes that map to Reactome -- physical entities SELECT DISTINCT pe.uniprot_id FROM `isb-cgc-bq.reactome_versioned.physical_entity_v77` AS pe WHERE -- filter by pathways that are related to genes in list pe.name IN ('RHOT1','MYO7A','ZBTB10','MATK','ST18','RPS23','GCNT1','DROSHA','NUAK1','CCPG1','PDGFD','KLRAP1','MTAP','RNF13','THBS1','MLX','FAP','TIMP3','PRSS1','SLC7A11','OLFML3','RPS20','MCM5','POLE','STEAP4','LRRC8D','WBP1L','ENTPD5','SYNE1','DPT','COPZ2','TRIO','PDPR')
We then create a query that counts all targets associated with each Reactome pathway. This query depends on the previous gene_list_query
sub-query.
gene_pp_query = """
-- Table that maps pathways to the total number of interesting genes within
-- that pathway
SELECT
COUNT(DISTINCT gene_list_query.uniprot_id) as num_genes,
pathway.stable_id,
pathway.name
FROM
gene_list_query
INNER JOIN `isb-cgc-bq.reactome_versioned.physical_entity_v77` AS pe
-- filter for interactions with genes that match a reactome
-- physical entity
ON gene_list_query.uniprot_id = pe.uniprot_id
INNER JOIN `isb-cgc-bq.reactome_versioned.pe_to_pathway_v77` AS pe2pathway
-- link physical entities to pathways via intermediate table
ON pe.stable_id = pe2pathway.pe_stable_id
INNER JOIN `isb-cgc-bq.reactome_versioned.pathway_v77` AS pathway
-- link physical entities to pathways
ON pe2pathway.pathway_stable_id = pathway.stable_id
WHERE
-- filter by stronger evidence: "Traceable Author Statement"
pe2pathway.evidence_code = 'TAS'
GROUP BY pathway.stable_id, pathway.name
ORDER BY num_genes DESC
""".strip("\n")
print(gene_pp_query)
-- Table that maps pathways to the total number of interesting genes within -- that pathway SELECT COUNT(DISTINCT gene_list_query.uniprot_id) as num_genes, pathway.stable_id, pathway.name FROM gene_list_query INNER JOIN `isb-cgc-bq.reactome_versioned.physical_entity_v77` AS pe -- filter for interactions with genes that match a reactome -- physical entity ON gene_list_query.uniprot_id = pe.uniprot_id INNER JOIN `isb-cgc-bq.reactome_versioned.pe_to_pathway_v77` AS pe2pathway -- link physical entities to pathways via intermediate table ON pe.stable_id = pe2pathway.pe_stable_id INNER JOIN `isb-cgc-bq.reactome_versioned.pathway_v77` AS pathway -- link physical entities to pathways ON pe2pathway.pathway_stable_id = pathway.stable_id WHERE -- filter by stronger evidence: "Traceable Author Statement" pe2pathway.evidence_code = 'TAS' GROUP BY pathway.stable_id, pathway.name ORDER BY num_genes DESC
Now we construct the same queries for genes that are NOT in the interesting gene list. These are prefixed with "not_gene". This query depends on the previous gene_list_query
sub-query.
not_gene_list_query = """
-- Table that contains a list of all genes that are NOT in the interest list
-- This query depends on the previous "gene_list_query" sub-query.
SELECT
DISTINCT pe.uniprot_id
FROM `isb-cgc-bq.reactome_versioned.physical_entity_v77` AS pe
WHERE
pe.uniprot_id NOT IN (
SELECT uniprot_id FROM gene_list_query
)
""".strip("\n")
print(not_gene_list_query)
-- Table that contains a list of all genes that are NOT in the interest list -- This query depends on the previous "gene_list_query" sub-query. SELECT DISTINCT pe.uniprot_id FROM `isb-cgc-bq.reactome_versioned.physical_entity_v77` AS pe WHERE pe.uniprot_id NOT IN ( SELECT uniprot_id FROM gene_list_query )
not_gene_pp_query = """
-- Table that maps pathways to the number of proteins that are NOT drug
-- targets in that pathway.
SELECT
COUNT(DISTINCT not_gene_list_query.uniprot_id) AS num_not_genes,
pathway.stable_id,
pathway.name
FROM not_gene_list_query
INNER JOIN `isb-cgc-bq.reactome_versioned.physical_entity_v77` AS pe
ON not_gene_list_query.uniprot_id = pe.uniprot_id
INNER JOIN `isb-cgc-bq.reactome_versioned.pe_to_pathway_v77` AS pe2pathway
ON pe.stable_id = pe2pathway.pe_stable_id
INNER JOIN `isb-cgc-bq.reactome_versioned.pathway_v77` AS pathway
ON pe2pathway.pathway_stable_id = pathway.stable_id
WHERE
-- filter by stronger evidence: "Traceable Author Statement"
pe2pathway.evidence_code = 'TAS'
GROUP BY pathway.stable_id, pathway.name
ORDER BY num_not_genes DESC
""".strip("\n")
print(not_gene_pp_query)
-- Table that maps pathways to the number of proteins that are NOT drug -- targets in that pathway. SELECT COUNT(DISTINCT not_gene_list_query.uniprot_id) AS num_not_genes, pathway.stable_id, pathway.name FROM not_gene_list_query INNER JOIN `isb-cgc-bq.reactome_versioned.physical_entity_v77` AS pe ON not_gene_list_query.uniprot_id = pe.uniprot_id INNER JOIN `isb-cgc-bq.reactome_versioned.pe_to_pathway_v77` AS pe2pathway ON pe.stable_id = pe2pathway.pe_stable_id INNER JOIN `isb-cgc-bq.reactome_versioned.pathway_v77` AS pathway ON pe2pathway.pathway_stable_id = pathway.stable_id WHERE -- filter by stronger evidence: "Traceable Author Statement" pe2pathway.evidence_code = 'TAS' GROUP BY pathway.stable_id, pathway.name ORDER BY num_not_genes DESC
For convenience, we create a sub-query to get the counts of the number of genes that are in the list and the number of genes that are not in the list.
gene_count_query = """
-- Table that contains the counts of # of genes that are/are not targets
SELECT
gene_count,
not_gene_count,
gene_count + not_gene_count AS total_count
FROM
(SELECT COUNT(*) AS gene_count FROM gene_list_query),
(SELECT COUNT(*) AS not_gene_count FROM not_gene_list_query)
""".strip("\n")
Now we can create more interesting queries for the contingency matrices that contain the observed and expected values.
First, the observed contingency table counts for each pathway:
observed_query = """
-- Table with observed values per pathway in the contingency matrix
SELECT
gene_pp_query.num_genes AS in_gene_in_pathway,
not_gene_pp_query.num_not_genes AS not_gene_in_pathway,
gene_count_query.gene_count - gene_pp_query.num_genes AS in_gene_not_pathway,
gene_count_query.not_gene_count - not_gene_pp_query.num_not_genes AS not_gene_not_pathway,
gene_pp_query.stable_id,
gene_pp_query.name
FROM
gene_pp_query,
gene_count_query
INNER JOIN not_gene_pp_query
ON gene_pp_query.stable_id = not_gene_pp_query.stable_id
""".strip("\n")
Then the observed row and column sums of the contingency table:
sum_query = """
-- Table with summed observed values per pathway in the contingency matrix
SELECT
observed_query.in_gene_in_pathway + observed_query.not_gene_in_pathway AS pathway_total,
observed_query.in_gene_not_pathway + observed_query.not_gene_not_pathway AS not_pathway_total,
observed_query.in_gene_in_pathway + observed_query.in_gene_not_pathway AS gene_total,
observed_query.not_gene_in_pathway + observed_query.not_gene_not_pathway AS not_gene_total,
observed_query.stable_id,
observed_query.name
FROM
observed_query
""".strip("\n")
And the expected contingency table values for each pathway:
expected_query = """
-- Table with the expected values per pathway in the contingency matrix
SELECT
sum_query.gene_total * sum_query.pathway_total / gene_count_query.total_count AS exp_in_gene_in_pathway,
sum_query.not_gene_total * sum_query.pathway_total / gene_count_query.total_count AS exp_not_gene_in_pathway,
sum_query.gene_total * sum_query.not_pathway_total / gene_count_query.total_count AS exp_in_gene_not_pathway,
sum_query.not_gene_total * sum_query.not_pathway_total / gene_count_query.total_count AS exp_not_gene_not_pathway,
sum_query.stable_id,
sum_query.name
FROM
sum_query, gene_count_query
""".strip("\n")
Finally, we can calculate the chi-squared statistic for each pathway:
chi_squared_query = """
-- Table with the chi-squared statistic for each pathway
SELECT
-- Chi squared statistic with Yates' correction
POW(ABS(observed_query.in_gene_in_pathway - expected_query.exp_in_gene_in_pathway) - 0.5, 2) / expected_query.exp_in_gene_in_pathway
+ POW(ABS(observed_query.not_gene_in_pathway - expected_query.exp_not_gene_in_pathway) - 0.5, 2) / expected_query.exp_not_gene_in_pathway
+ POW(ABS(observed_query.in_gene_not_pathway - expected_query.exp_in_gene_not_pathway) - 0.5, 2) / expected_query.exp_in_gene_not_pathway
+ POW(ABS(observed_query.not_gene_not_pathway - expected_query.exp_not_gene_not_pathway) - 0.5, 2) / expected_query.exp_not_gene_not_pathway
AS chi_squared_stat,
observed_query.stable_id,
observed_query.name
FROM observed_query
INNER JOIN expected_query
ON observed_query.stable_id = expected_query.stable_id
""".strip("\n")
The final piece of the query optionally adds a filter that removes all pathways that are not at the lowest level of the hierarchy. This helps remove non-specific "pathways" such as the generic "Disease" pathway.
lowest_level_filter = """
INNER JOIN `isb-cgc-bq.reactome_versioned.pathway_v77` AS pathway
ON chi_squared_query.stable_id = pathway.stable_id
WHERE pathway.lowest_level = TRUE
""".strip("\n")
print(lowest_level_filter)
INNER JOIN `isb-cgc-bq.reactome_versioned.pathway_v77` AS pathway ON chi_squared_query.stable_id = pathway.stable_id WHERE pathway.lowest_level = TRUE
Now we can combine all sub-queries to create the final query:
final_query = """
WITH
gene_list_query AS (
{gene_list_query}
),
gene_pp_query AS (
{gene_pp_query}
),
not_gene_list_query AS (
{not_gene_list_query}
),
not_gene_pp_query AS (
{not_gene_pp_query}
),
gene_count_query AS (
{gene_count_query}
),
observed_query AS (
{observed_query}
),
sum_query AS (
{sum_query}
),
expected_query AS (
{expected_query}
),
chi_squared_query AS (
{chi_squared_query}
)
SELECT
observed_query.in_gene_in_pathway,
observed_query.in_gene_not_pathway,
observed_query.not_gene_in_pathway,
observed_query.not_gene_not_pathway,
chi_squared_query.chi_squared_stat,
chi_squared_query.stable_id,
chi_squared_query.name
FROM chi_squared_query
INNER JOIN observed_query
ON chi_squared_query.stable_id = observed_query.stable_id
{lowest_level_filter}
ORDER BY chi_squared_stat DESC
""".format(
# make final query a little easier to read by removing/adding some white space
gene_list_query="\n ".join(gene_list_query.strip().splitlines()),
gene_pp_query="\n ".join(gene_pp_query.strip().splitlines()),
not_gene_list_query="\n ".join(not_gene_list_query.strip().splitlines()),
not_gene_pp_query="\n ".join(not_gene_pp_query.strip().splitlines()),
gene_count_query="\n ".join(gene_count_query.strip().splitlines()),
observed_query="\n ".join(observed_query.strip().splitlines()),
sum_query="\n ".join(sum_query.strip().splitlines()),
expected_query="\n ".join(expected_query.strip().splitlines()),
chi_squared_query="\n ".join(chi_squared_query.strip().splitlines()),
lowest_level_filter=(
"\n "+"\n".join(lowest_level_filter.strip().splitlines())+"\n" if lowest_level else ""
)
).strip("\n")
# print the formatted final query
print(final_query)
WITH gene_list_query AS ( -- Table that contains a list of all distinct genes that map to Reactome -- physical entities SELECT DISTINCT pe.uniprot_id FROM `isb-cgc-bq.reactome_versioned.physical_entity_v77` AS pe WHERE -- filter by pathways that are related to genes in list pe.name IN ('RHOT1','MYO7A','ZBTB10','MATK','ST18','RPS23','GCNT1','DROSHA','NUAK1','CCPG1','PDGFD','KLRAP1','MTAP','RNF13','THBS1','MLX','FAP','TIMP3','PRSS1','SLC7A11','OLFML3','RPS20','MCM5','POLE','STEAP4','LRRC8D','WBP1L','ENTPD5','SYNE1','DPT','COPZ2','TRIO','PDPR') ), gene_pp_query AS ( -- Table that maps pathways to the total number of interesting genes within -- that pathway SELECT COUNT(DISTINCT gene_list_query.uniprot_id) as num_genes, pathway.stable_id, pathway.name FROM gene_list_query INNER JOIN `isb-cgc-bq.reactome_versioned.physical_entity_v77` AS pe -- filter for interactions with genes that match a reactome -- physical entity ON gene_list_query.uniprot_id = pe.uniprot_id INNER JOIN `isb-cgc-bq.reactome_versioned.pe_to_pathway_v77` AS pe2pathway -- link physical entities to pathways via intermediate table ON pe.stable_id = pe2pathway.pe_stable_id INNER JOIN `isb-cgc-bq.reactome_versioned.pathway_v77` AS pathway -- link physical entities to pathways ON pe2pathway.pathway_stable_id = pathway.stable_id WHERE -- filter by stronger evidence: "Traceable Author Statement" pe2pathway.evidence_code = 'TAS' GROUP BY pathway.stable_id, pathway.name ORDER BY num_genes DESC ), not_gene_list_query AS ( -- Table that contains a list of all genes that are NOT in the interest list -- This query depends on the previous "gene_list_query" sub-query. SELECT DISTINCT pe.uniprot_id FROM `isb-cgc-bq.reactome_versioned.physical_entity_v77` AS pe WHERE pe.uniprot_id NOT IN ( SELECT uniprot_id FROM gene_list_query ) ), not_gene_pp_query AS ( -- Table that maps pathways to the number of proteins that are NOT drug -- targets in that pathway. SELECT COUNT(DISTINCT not_gene_list_query.uniprot_id) AS num_not_genes, pathway.stable_id, pathway.name FROM not_gene_list_query INNER JOIN `isb-cgc-bq.reactome_versioned.physical_entity_v77` AS pe ON not_gene_list_query.uniprot_id = pe.uniprot_id INNER JOIN `isb-cgc-bq.reactome_versioned.pe_to_pathway_v77` AS pe2pathway ON pe.stable_id = pe2pathway.pe_stable_id INNER JOIN `isb-cgc-bq.reactome_versioned.pathway_v77` AS pathway ON pe2pathway.pathway_stable_id = pathway.stable_id WHERE -- filter by stronger evidence: "Traceable Author Statement" pe2pathway.evidence_code = 'TAS' GROUP BY pathway.stable_id, pathway.name ORDER BY num_not_genes DESC ), gene_count_query AS ( -- Table that contains the counts of # of genes that are/are not targets SELECT gene_count, not_gene_count, gene_count + not_gene_count AS total_count FROM (SELECT COUNT(*) AS gene_count FROM gene_list_query), (SELECT COUNT(*) AS not_gene_count FROM not_gene_list_query) ), observed_query AS ( -- Table with observed values per pathway in the contingency matrix SELECT gene_pp_query.num_genes AS in_gene_in_pathway, not_gene_pp_query.num_not_genes AS not_gene_in_pathway, gene_count_query.gene_count - gene_pp_query.num_genes AS in_gene_not_pathway, gene_count_query.not_gene_count - not_gene_pp_query.num_not_genes AS not_gene_not_pathway, gene_pp_query.stable_id, gene_pp_query.name FROM gene_pp_query, gene_count_query INNER JOIN not_gene_pp_query ON gene_pp_query.stable_id = not_gene_pp_query.stable_id ), sum_query AS ( -- Table with summed observed values per pathway in the contingency matrix SELECT observed_query.in_gene_in_pathway + observed_query.not_gene_in_pathway AS pathway_total, observed_query.in_gene_not_pathway + observed_query.not_gene_not_pathway AS not_pathway_total, observed_query.in_gene_in_pathway + observed_query.in_gene_not_pathway AS gene_total, observed_query.not_gene_in_pathway + observed_query.not_gene_not_pathway AS not_gene_total, observed_query.stable_id, observed_query.name FROM observed_query ), expected_query AS ( -- Table with the expected values per pathway in the contingency matrix SELECT sum_query.gene_total * sum_query.pathway_total / gene_count_query.total_count AS exp_in_gene_in_pathway, sum_query.not_gene_total * sum_query.pathway_total / gene_count_query.total_count AS exp_not_gene_in_pathway, sum_query.gene_total * sum_query.not_pathway_total / gene_count_query.total_count AS exp_in_gene_not_pathway, sum_query.not_gene_total * sum_query.not_pathway_total / gene_count_query.total_count AS exp_not_gene_not_pathway, sum_query.stable_id, sum_query.name FROM sum_query, gene_count_query ), chi_squared_query AS ( -- Table with the chi-squared statistic for each pathway SELECT -- Chi squared statistic with Yates' correction POW(ABS(observed_query.in_gene_in_pathway - expected_query.exp_in_gene_in_pathway) - 0.5, 2) / expected_query.exp_in_gene_in_pathway + POW(ABS(observed_query.not_gene_in_pathway - expected_query.exp_not_gene_in_pathway) - 0.5, 2) / expected_query.exp_not_gene_in_pathway + POW(ABS(observed_query.in_gene_not_pathway - expected_query.exp_in_gene_not_pathway) - 0.5, 2) / expected_query.exp_in_gene_not_pathway + POW(ABS(observed_query.not_gene_not_pathway - expected_query.exp_not_gene_not_pathway) - 0.5, 2) / expected_query.exp_not_gene_not_pathway AS chi_squared_stat, observed_query.stable_id, observed_query.name FROM observed_query INNER JOIN expected_query ON observed_query.stable_id = expected_query.stable_id ) SELECT observed_query.in_gene_in_pathway, observed_query.in_gene_not_pathway, observed_query.not_gene_in_pathway, observed_query.not_gene_not_pathway, chi_squared_query.chi_squared_stat, chi_squared_query.stable_id, chi_squared_query.name FROM chi_squared_query INNER JOIN observed_query ON chi_squared_query.stable_id = observed_query.stable_id INNER JOIN `isb-cgc-bq.reactome_versioned.pathway_v77` AS pathway ON chi_squared_query.stable_id = pathway.stable_id WHERE pathway.lowest_level = TRUE ORDER BY chi_squared_stat DESC
Now execute the query to calculate a chi-squared statistic for each pathway:
# run query and put results in data frame
chi_squared_pathways = client.query(final_query).result().to_dataframe()
# display the data frame
chi_squared_pathways
in_gene_in_pathway | in_gene_not_pathway | not_gene_in_pathway | not_gene_not_pathway | chi_squared_stat | stable_id | name | |
---|---|---|---|---|---|---|---|
0 | 2 | 19 | 31 | 11114 | 33.477023 | R-HSA-68962 | Activation of the pre-replicative complex |
1 | 1 | 20 | 3 | 11142 | 32.311989 | R-HSA-1237112 | Methionine salvage pathway |
2 | 1 | 20 | 3 | 11142 | 32.311989 | R-HSA-9013425 | RHOT1 GTPase cycle |
3 | 2 | 19 | 50 | 11095 | 20.236790 | R-HSA-72695 | Formation of the ternary complex, and subseque... |
4 | 1 | 20 | 6 | 11139 | 18.048197 | R-HSA-163765 | ChREBP activates metabolic gene expression |
5 | 2 | 19 | 57 | 11088 | 17.513505 | R-HSA-72649 | Translation initiation complex formation |
6 | 2 | 19 | 57 | 11088 | 17.513505 | R-HSA-72702 | Ribosomal scanning and start codon recognition |
7 | 1 | 20 | 7 | 11138 | 15.671798 | R-HSA-8850843 | Phosphate bond hydrolysis by NTPDase proteins |
8 | 1 | 20 | 7 | 11138 | 15.671798 | R-HSA-68952 | DNA replication initiation |
9 | 1 | 20 | 10 | 11135 | 11.137000 | R-HSA-418885 | DCC mediated attractive signaling |
10 | 2 | 19 | 88 | 11057 | 10.567006 | R-HSA-192823 | Viral mRNA Translation |
11 | 2 | 19 | 92 | 11053 | 10.006894 | R-HSA-1799339 | SRP-dependent cotranslational protein targetin... |
12 | 2 | 19 | 94 | 11051 | 9.744550 | R-HSA-975956 | Nonsense Mediated Decay (NMD) independent of t... |
13 | 2 | 19 | 100 | 11045 | 9.020030 | R-HSA-72689 | Formation of a pool of free 40S subunits |
14 | 2 | 19 | 110 | 11035 | 7.987388 | R-HSA-156827 | L13a-mediated translational silencing of Cerul... |
15 | 2 | 19 | 111 | 11034 | 7.894339 | R-HSA-72706 | GTP hydrolysis and joining of the 60S ribosoma... |
16 | 2 | 19 | 115 | 11030 | 7.538335 | R-HSA-975957 | Nonsense Mediated Decay (NMD) enhanced by the ... |
17 | 1 | 20 | 15 | 11130 | 7.362504 | R-HSA-204174 | Regulation of pyruvate dehydrogenase (PDH) com... |
18 | 2 | 19 | 119 | 11026 | 7.206312 | R-HSA-114608 | Platelet degranulation |
19 | 1 | 20 | 20 | 11125 | 5.389681 | R-HSA-5651801 | PCNA-Dependent Long Patch Base Excision Repair |
20 | 1 | 20 | 23 | 11122 | 4.602357 | R-HSA-203927 | MicroRNA (miRNA) biogenesis |
21 | 1 | 20 | 24 | 11121 | 4.382195 | R-HSA-9660826 | Purinergic signaling in leishmaniasis infection |
22 | 1 | 20 | 24 | 11121 | 4.382195 | R-HSA-210991 | Basigin interactions |
23 | 1 | 20 | 24 | 11121 | 4.382195 | R-HSA-5223345 | Miscellaneous transport and binding events |
24 | 1 | 20 | 24 | 11121 | 4.382195 | R-HSA-5696397 | Gap-filling DNA repair synthesis and ligation ... |
25 | 2 | 19 | 181 | 10964 | 3.953753 | R-HSA-6791226 | Major pathway of rRNA processing in the nucleo... |
26 | 1 | 20 | 29 | 11116 | 3.503357 | R-HSA-110314 | Recognition of DNA damage by PCNA-containing r... |
27 | 1 | 20 | 31 | 11114 | 3.229512 | R-HSA-5656169 | Termination of translesion DNA synthesis |
28 | 1 | 20 | 33 | 11112 | 2.988310 | R-HSA-352230 | Amino acid transport across the plasma membrane |
29 | 1 | 20 | 36 | 11109 | 2.676134 | R-HSA-176187 | Activation of ATR in response to replication s... |
30 | 1 | 20 | 36 | 11109 | 2.676134 | R-HSA-8950505 | Gene and protein expression by JAK-STAT signal... |
31 | 1 | 20 | 37 | 11108 | 2.583220 | R-HSA-5083635 | Defective B3GALTL causes Peters-plus syndrome ... |
32 | 1 | 20 | 38 | 11107 | 2.495164 | R-HSA-5173214 | O-glycosylation of TSR domain-containing proteins |
33 | 1 | 20 | 40 | 11105 | 2.332203 | R-HSA-5696400 | Dual Incision in GG-NER |
34 | 1 | 20 | 54 | 11091 | 1.530727 | R-HSA-9013409 | RHOJ GTPase cycle |
35 | 1 | 20 | 54 | 11091 | 1.530727 | R-HSA-193648 | NRAGE signals death through JNK |
36 | 1 | 20 | 63 | 11082 | 1.206546 | R-HSA-6782210 | Gap-filling DNA repair synthesis and ligation ... |
37 | 1 | 20 | 64 | 11081 | 1.176348 | R-HSA-6782135 | Dual incision in TC-NER |
38 | 1 | 20 | 65 | 11080 | 1.147121 | R-HSA-8936459 | RUNX1 regulates genes involved in megakaryocyt... |
39 | 1 | 20 | 70 | 11075 | 1.014107 | R-HSA-68949 | Orc1 removal from chromatin |
40 | 1 | 20 | 73 | 11072 | 0.943520 | R-HSA-9013408 | RHOG GTPase cycle |
41 | 1 | 20 | 79 | 11066 | 0.819465 | R-HSA-416482 | G alpha (12/13) signalling events |
42 | 1 | 20 | 80 | 11065 | 0.800735 | R-HSA-216083 | Integrin cell surface interactions |
43 | 1 | 20 | 87 | 11058 | 0.682709 | R-HSA-9013404 | RAC2 GTPase cycle |
44 | 1 | 20 | 91 | 11054 | 0.624208 | R-HSA-6804756 | Regulation of TP53 Activity through Phosphoryl... |
45 | 1 | 20 | 93 | 11052 | 0.597060 | R-HSA-9013423 | RAC3 GTPase cycle |
46 | 1 | 20 | 99 | 11046 | 0.523015 | R-HSA-6811434 | COPI-dependent Golgi-to-ER retrograde traffic |
47 | 1 | 20 | 101 | 11044 | 0.500559 | R-HSA-6807878 | COPI-mediated anterograde transport |
48 | 1 | 20 | 149 | 10996 | 0.170910 | R-HSA-8980692 | RHOA GTPase cycle |
49 | 1 | 20 | 158 | 10987 | 0.137271 | R-HSA-9013148 | CDC42 GTPase cycle |
50 | 1 | 20 | 184 | 10961 | 0.067711 | R-HSA-9013149 | RAC1 GTPase cycle |
51 | 1 | 20 | 204 | 10941 | 0.034678 | R-HSA-5689880 | Ub-specific processing proteases |
BigQuery does not have statistical functions to calculate p-values, so we use the SciPy stats library and update the data frame with a new p-value column:
chi_squared_pathways['p_value'] = 1-stats.chi2.cdf(chi_squared_pathways['chi_squared_stat'], 1)
chi_squared_pathways
in_gene_in_pathway | in_gene_not_pathway | not_gene_in_pathway | not_gene_not_pathway | chi_squared_stat | stable_id | name | p_value | |
---|---|---|---|---|---|---|---|---|
0 | 2 | 19 | 31 | 11114 | 33.477023 | R-HSA-68962 | Activation of the pre-replicative complex | 7.211088e-09 |
1 | 1 | 20 | 3 | 11142 | 32.311989 | R-HSA-1237112 | Methionine salvage pathway | 1.313007e-08 |
2 | 1 | 20 | 3 | 11142 | 32.311989 | R-HSA-9013425 | RHOT1 GTPase cycle | 1.313007e-08 |
3 | 2 | 19 | 50 | 11095 | 20.236790 | R-HSA-72695 | Formation of the ternary complex, and subseque... | 6.842431e-06 |
4 | 1 | 20 | 6 | 11139 | 18.048197 | R-HSA-163765 | ChREBP activates metabolic gene expression | 2.153825e-05 |
5 | 2 | 19 | 57 | 11088 | 17.513505 | R-HSA-72649 | Translation initiation complex formation | 2.852741e-05 |
6 | 2 | 19 | 57 | 11088 | 17.513505 | R-HSA-72702 | Ribosomal scanning and start codon recognition | 2.852741e-05 |
7 | 1 | 20 | 7 | 11138 | 15.671798 | R-HSA-8850843 | Phosphate bond hydrolysis by NTPDase proteins | 7.533920e-05 |
8 | 1 | 20 | 7 | 11138 | 15.671798 | R-HSA-68952 | DNA replication initiation | 7.533920e-05 |
9 | 1 | 20 | 10 | 11135 | 11.137000 | R-HSA-418885 | DCC mediated attractive signaling | 8.462266e-04 |
10 | 2 | 19 | 88 | 11057 | 10.567006 | R-HSA-192823 | Viral mRNA Translation | 1.151240e-03 |
11 | 2 | 19 | 92 | 11053 | 10.006894 | R-HSA-1799339 | SRP-dependent cotranslational protein targetin... | 1.559553e-03 |
12 | 2 | 19 | 94 | 11051 | 9.744550 | R-HSA-975956 | Nonsense Mediated Decay (NMD) independent of t... | 1.798552e-03 |
13 | 2 | 19 | 100 | 11045 | 9.020030 | R-HSA-72689 | Formation of a pool of free 40S subunits | 2.670370e-03 |
14 | 2 | 19 | 110 | 11035 | 7.987388 | R-HSA-156827 | L13a-mediated translational silencing of Cerul... | 4.710432e-03 |
15 | 2 | 19 | 111 | 11034 | 7.894339 | R-HSA-72706 | GTP hydrolysis and joining of the 60S ribosoma... | 4.958976e-03 |
16 | 2 | 19 | 115 | 11030 | 7.538335 | R-HSA-975957 | Nonsense Mediated Decay (NMD) enhanced by the ... | 6.039983e-03 |
17 | 1 | 20 | 15 | 11130 | 7.362504 | R-HSA-204174 | Regulation of pyruvate dehydrogenase (PDH) com... | 6.659799e-03 |
18 | 2 | 19 | 119 | 11026 | 7.206312 | R-HSA-114608 | Platelet degranulation | 7.264762e-03 |
19 | 1 | 20 | 20 | 11125 | 5.389681 | R-HSA-5651801 | PCNA-Dependent Long Patch Base Excision Repair | 2.025618e-02 |
20 | 1 | 20 | 23 | 11122 | 4.602357 | R-HSA-203927 | MicroRNA (miRNA) biogenesis | 3.192804e-02 |
21 | 1 | 20 | 24 | 11121 | 4.382195 | R-HSA-9660826 | Purinergic signaling in leishmaniasis infection | 3.631620e-02 |
22 | 1 | 20 | 24 | 11121 | 4.382195 | R-HSA-210991 | Basigin interactions | 3.631620e-02 |
23 | 1 | 20 | 24 | 11121 | 4.382195 | R-HSA-5223345 | Miscellaneous transport and binding events | 3.631620e-02 |
24 | 1 | 20 | 24 | 11121 | 4.382195 | R-HSA-5696397 | Gap-filling DNA repair synthesis and ligation ... | 3.631620e-02 |
25 | 2 | 19 | 181 | 10964 | 3.953753 | R-HSA-6791226 | Major pathway of rRNA processing in the nucleo... | 4.676696e-02 |
26 | 1 | 20 | 29 | 11116 | 3.503357 | R-HSA-110314 | Recognition of DNA damage by PCNA-containing r... | 6.124455e-02 |
27 | 1 | 20 | 31 | 11114 | 3.229512 | R-HSA-5656169 | Termination of translesion DNA synthesis | 7.232223e-02 |
28 | 1 | 20 | 33 | 11112 | 2.988310 | R-HSA-352230 | Amino acid transport across the plasma membrane | 8.386764e-02 |
29 | 1 | 20 | 36 | 11109 | 2.676134 | R-HSA-176187 | Activation of ATR in response to replication s... | 1.018627e-01 |
30 | 1 | 20 | 36 | 11109 | 2.676134 | R-HSA-8950505 | Gene and protein expression by JAK-STAT signal... | 1.018627e-01 |
31 | 1 | 20 | 37 | 11108 | 2.583220 | R-HSA-5083635 | Defective B3GALTL causes Peters-plus syndrome ... | 1.080017e-01 |
32 | 1 | 20 | 38 | 11107 | 2.495164 | R-HSA-5173214 | O-glycosylation of TSR domain-containing proteins | 1.141965e-01 |
33 | 1 | 20 | 40 | 11105 | 2.332203 | R-HSA-5696400 | Dual Incision in GG-NER | 1.267224e-01 |
34 | 1 | 20 | 54 | 11091 | 1.530727 | R-HSA-9013409 | RHOJ GTPase cycle | 2.160034e-01 |
35 | 1 | 20 | 54 | 11091 | 1.530727 | R-HSA-193648 | NRAGE signals death through JNK | 2.160034e-01 |
36 | 1 | 20 | 63 | 11082 | 1.206546 | R-HSA-6782210 | Gap-filling DNA repair synthesis and ligation ... | 2.720173e-01 |
37 | 1 | 20 | 64 | 11081 | 1.176348 | R-HSA-6782135 | Dual incision in TC-NER | 2.781007e-01 |
38 | 1 | 20 | 65 | 11080 | 1.147121 | R-HSA-8936459 | RUNX1 regulates genes involved in megakaryocyt... | 2.841526e-01 |
39 | 1 | 20 | 70 | 11075 | 1.014107 | R-HSA-68949 | Orc1 removal from chromatin | 3.139209e-01 |
40 | 1 | 20 | 73 | 11072 | 0.943520 | R-HSA-9013408 | RHOG GTPase cycle | 3.313742e-01 |
41 | 1 | 20 | 79 | 11066 | 0.819465 | R-HSA-416482 | G alpha (12/13) signalling events | 3.653366e-01 |
42 | 1 | 20 | 80 | 11065 | 0.800735 | R-HSA-216083 | Integrin cell surface interactions | 3.708738e-01 |
43 | 1 | 20 | 87 | 11058 | 0.682709 | R-HSA-9013404 | RAC2 GTPase cycle | 4.086556e-01 |
44 | 1 | 20 | 91 | 11054 | 0.624208 | R-HSA-6804756 | Regulation of TP53 Activity through Phosphoryl... | 4.294877e-01 |
45 | 1 | 20 | 93 | 11052 | 0.597060 | R-HSA-9013423 | RAC3 GTPase cycle | 4.397019e-01 |
46 | 1 | 20 | 99 | 11046 | 0.523015 | R-HSA-6811434 | COPI-dependent Golgi-to-ER retrograde traffic | 4.695582e-01 |
47 | 1 | 20 | 101 | 11044 | 0.500559 | R-HSA-6807878 | COPI-mediated anterograde transport | 4.792546e-01 |
48 | 1 | 20 | 149 | 10996 | 0.170910 | R-HSA-8980692 | RHOA GTPase cycle | 6.793044e-01 |
49 | 1 | 20 | 158 | 10987 | 0.137271 | R-HSA-9013148 | CDC42 GTPase cycle | 7.110095e-01 |
50 | 1 | 20 | 184 | 10961 | 0.067711 | R-HSA-9013149 | RAC1 GTPase cycle | 7.946990e-01 |
51 | 1 | 20 | 204 | 10941 | 0.034678 | R-HSA-5689880 | Ub-specific processing proteases | 8.522717e-01 |
Since we're testing multiple pathways, we need to adjust the p-value threshold for significance accordingly. The number of pathways tested depends on whether or not we're considering all pathways, or just the lowest level pathways. That count can be obtained with the following query.
# run query and put results in data frame
num_pathways_result = client.query(("""
SELECT
COUNT (*) AS num_pathways
FROM
`isb-cgc-bq.reactome_versioned.pathway_v77` as pathway
{lowest_level_filter}
""").format(
lowest_level_filter=("WHERE lowest_level = TRUE" if lowest_level else "")
)).result().to_dataframe()
# display data frame
num_pathways_result
num_pathways | |
---|---|
0 | 1773 |
# adjust significance threshold for multiple testing, using a p-value of 0.01
num_pathways = num_pathways_result['num_pathways'][0]
significance_threshold = 0.01/num_pathways
print('Significance Threshold: {}'.format(significance_threshold))
# find all pathways that meet the significance criterion after adjusting for
# multiple testing
significant_pathway_index = chi_squared_pathways['p_value']<significance_threshold
# list of significant pathways
significant_pathways = chi_squared_pathways[significant_pathway_index]
Significance Threshold: 5.640157924421884e-06
The final result is a list of all pathways in which the targeted proteins are enriched, or over-represented, at a rate higher than random chance.
# display the final data frame
significant_pathways
in_gene_in_pathway | in_gene_not_pathway | not_gene_in_pathway | not_gene_not_pathway | chi_squared_stat | stable_id | name | p_value | |
---|---|---|---|---|---|---|---|---|
0 | 2 | 19 | 31 | 11114 | 33.477023 | R-HSA-68962 | Activation of the pre-replicative complex | 7.211088e-09 |
1 | 1 | 20 | 3 | 11142 | 32.311989 | R-HSA-1237112 | Methionine salvage pathway | 1.313007e-08 |
2 | 1 | 20 | 3 | 11142 | 32.311989 | R-HSA-9013425 | RHOT1 GTPase cycle | 1.313007e-08 |
The results of this analysis suggest that at least three pathways may be related to the genes identified.
We verify these BigQuery results by calculating the same chi-squared statistic using the SciPy package. Comparing the SciPy p-values and statistics to those of the BigQuery-derived results confirms that they are identical.
# extract observed values from bigquery result
observed = chi_squared_pathways[[
'in_gene_in_pathway',
'in_gene_not_pathway',
'not_gene_in_pathway',
'not_gene_not_pathway'
]]
# calculate the chi-squared statistic using the scipy stats package
chi2_stat = []
chi2_pvalue = []
for index, row in observed.iterrows():
stat, pvalue, _, _ = stats.chi2_contingency(
np.reshape(np.matrix(row), (2,2)), correction=True
)
chi2_stat.append(stat)
chi2_pvalue.append(pvalue)
# add columns to the original data frame
chi_squared_pathways['scipy_stat'] = chi2_stat
chi_squared_pathways['scipy_p_value'] = chi2_pvalue
# display the updated data frame
chi_squared_pathways
in_gene_in_pathway | in_gene_not_pathway | not_gene_in_pathway | not_gene_not_pathway | chi_squared_stat | stable_id | name | p_value | scipy_stat | scipy_p_value | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 2 | 19 | 31 | 11114 | 33.477023 | R-HSA-68962 | Activation of the pre-replicative complex | 7.211088e-09 | 33.477023 | 7.211088e-09 |
1 | 1 | 20 | 3 | 11142 | 32.311989 | R-HSA-1237112 | Methionine salvage pathway | 1.313007e-08 | 32.311989 | 1.313007e-08 |
2 | 1 | 20 | 3 | 11142 | 32.311989 | R-HSA-9013425 | RHOT1 GTPase cycle | 1.313007e-08 | 32.311989 | 1.313007e-08 |
3 | 2 | 19 | 50 | 11095 | 20.236790 | R-HSA-72695 | Formation of the ternary complex, and subseque... | 6.842431e-06 | 20.236790 | 6.842431e-06 |
4 | 1 | 20 | 6 | 11139 | 18.048197 | R-HSA-163765 | ChREBP activates metabolic gene expression | 2.153825e-05 | 18.048197 | 2.153825e-05 |
5 | 2 | 19 | 57 | 11088 | 17.513505 | R-HSA-72649 | Translation initiation complex formation | 2.852741e-05 | 17.513505 | 2.852741e-05 |
6 | 2 | 19 | 57 | 11088 | 17.513505 | R-HSA-72702 | Ribosomal scanning and start codon recognition | 2.852741e-05 | 17.513505 | 2.852741e-05 |
7 | 1 | 20 | 7 | 11138 | 15.671798 | R-HSA-8850843 | Phosphate bond hydrolysis by NTPDase proteins | 7.533920e-05 | 15.671798 | 7.533920e-05 |
8 | 1 | 20 | 7 | 11138 | 15.671798 | R-HSA-68952 | DNA replication initiation | 7.533920e-05 | 15.671798 | 7.533920e-05 |
9 | 1 | 20 | 10 | 11135 | 11.137000 | R-HSA-418885 | DCC mediated attractive signaling | 8.462266e-04 | 11.137000 | 8.462266e-04 |
10 | 2 | 19 | 88 | 11057 | 10.567006 | R-HSA-192823 | Viral mRNA Translation | 1.151240e-03 | 10.567006 | 1.151240e-03 |
11 | 2 | 19 | 92 | 11053 | 10.006894 | R-HSA-1799339 | SRP-dependent cotranslational protein targetin... | 1.559553e-03 | 10.006894 | 1.559553e-03 |
12 | 2 | 19 | 94 | 11051 | 9.744550 | R-HSA-975956 | Nonsense Mediated Decay (NMD) independent of t... | 1.798552e-03 | 9.744550 | 1.798552e-03 |
13 | 2 | 19 | 100 | 11045 | 9.020030 | R-HSA-72689 | Formation of a pool of free 40S subunits | 2.670370e-03 | 9.020030 | 2.670370e-03 |
14 | 2 | 19 | 110 | 11035 | 7.987388 | R-HSA-156827 | L13a-mediated translational silencing of Cerul... | 4.710432e-03 | 7.987388 | 4.710432e-03 |
15 | 2 | 19 | 111 | 11034 | 7.894339 | R-HSA-72706 | GTP hydrolysis and joining of the 60S ribosoma... | 4.958976e-03 | 7.894339 | 4.958976e-03 |
16 | 2 | 19 | 115 | 11030 | 7.538335 | R-HSA-975957 | Nonsense Mediated Decay (NMD) enhanced by the ... | 6.039983e-03 | 7.538335 | 6.039983e-03 |
17 | 1 | 20 | 15 | 11130 | 7.362504 | R-HSA-204174 | Regulation of pyruvate dehydrogenase (PDH) com... | 6.659799e-03 | 7.362504 | 6.659799e-03 |
18 | 2 | 19 | 119 | 11026 | 7.206312 | R-HSA-114608 | Platelet degranulation | 7.264762e-03 | 7.206312 | 7.264762e-03 |
19 | 1 | 20 | 20 | 11125 | 5.389681 | R-HSA-5651801 | PCNA-Dependent Long Patch Base Excision Repair | 2.025618e-02 | 5.389681 | 2.025618e-02 |
20 | 1 | 20 | 23 | 11122 | 4.602357 | R-HSA-203927 | MicroRNA (miRNA) biogenesis | 3.192804e-02 | 4.602357 | 3.192804e-02 |
21 | 1 | 20 | 24 | 11121 | 4.382195 | R-HSA-9660826 | Purinergic signaling in leishmaniasis infection | 3.631620e-02 | 4.382195 | 3.631620e-02 |
22 | 1 | 20 | 24 | 11121 | 4.382195 | R-HSA-210991 | Basigin interactions | 3.631620e-02 | 4.382195 | 3.631620e-02 |
23 | 1 | 20 | 24 | 11121 | 4.382195 | R-HSA-5223345 | Miscellaneous transport and binding events | 3.631620e-02 | 4.382195 | 3.631620e-02 |
24 | 1 | 20 | 24 | 11121 | 4.382195 | R-HSA-5696397 | Gap-filling DNA repair synthesis and ligation ... | 3.631620e-02 | 4.382195 | 3.631620e-02 |
25 | 2 | 19 | 181 | 10964 | 3.953753 | R-HSA-6791226 | Major pathway of rRNA processing in the nucleo... | 4.676696e-02 | 3.953753 | 4.676696e-02 |
26 | 1 | 20 | 29 | 11116 | 3.503357 | R-HSA-110314 | Recognition of DNA damage by PCNA-containing r... | 6.124455e-02 | 3.503357 | 6.124455e-02 |
27 | 1 | 20 | 31 | 11114 | 3.229512 | R-HSA-5656169 | Termination of translesion DNA synthesis | 7.232223e-02 | 3.229512 | 7.232223e-02 |
28 | 1 | 20 | 33 | 11112 | 2.988310 | R-HSA-352230 | Amino acid transport across the plasma membrane | 8.386764e-02 | 2.988310 | 8.386764e-02 |
29 | 1 | 20 | 36 | 11109 | 2.676134 | R-HSA-176187 | Activation of ATR in response to replication s... | 1.018627e-01 | 2.676134 | 1.018627e-01 |
30 | 1 | 20 | 36 | 11109 | 2.676134 | R-HSA-8950505 | Gene and protein expression by JAK-STAT signal... | 1.018627e-01 | 2.676134 | 1.018627e-01 |
31 | 1 | 20 | 37 | 11108 | 2.583220 | R-HSA-5083635 | Defective B3GALTL causes Peters-plus syndrome ... | 1.080017e-01 | 2.583220 | 1.080017e-01 |
32 | 1 | 20 | 38 | 11107 | 2.495164 | R-HSA-5173214 | O-glycosylation of TSR domain-containing proteins | 1.141965e-01 | 2.495164 | 1.141965e-01 |
33 | 1 | 20 | 40 | 11105 | 2.332203 | R-HSA-5696400 | Dual Incision in GG-NER | 1.267224e-01 | 2.332203 | 1.267224e-01 |
34 | 1 | 20 | 54 | 11091 | 1.530727 | R-HSA-9013409 | RHOJ GTPase cycle | 2.160034e-01 | 1.530727 | 2.160034e-01 |
35 | 1 | 20 | 54 | 11091 | 1.530727 | R-HSA-193648 | NRAGE signals death through JNK | 2.160034e-01 | 1.530727 | 2.160034e-01 |
36 | 1 | 20 | 63 | 11082 | 1.206546 | R-HSA-6782210 | Gap-filling DNA repair synthesis and ligation ... | 2.720173e-01 | 1.206546 | 2.720173e-01 |
37 | 1 | 20 | 64 | 11081 | 1.176348 | R-HSA-6782135 | Dual incision in TC-NER | 2.781007e-01 | 1.176348 | 2.781007e-01 |
38 | 1 | 20 | 65 | 11080 | 1.147121 | R-HSA-8936459 | RUNX1 regulates genes involved in megakaryocyt... | 2.841526e-01 | 1.147121 | 2.841526e-01 |
39 | 1 | 20 | 70 | 11075 | 1.014107 | R-HSA-68949 | Orc1 removal from chromatin | 3.139209e-01 | 1.014107 | 3.139209e-01 |
40 | 1 | 20 | 73 | 11072 | 0.943520 | R-HSA-9013408 | RHOG GTPase cycle | 3.313742e-01 | 0.943520 | 3.313742e-01 |
41 | 1 | 20 | 79 | 11066 | 0.819465 | R-HSA-416482 | G alpha (12/13) signalling events | 3.653366e-01 | 0.819465 | 3.653366e-01 |
42 | 1 | 20 | 80 | 11065 | 0.800735 | R-HSA-216083 | Integrin cell surface interactions | 3.708738e-01 | 0.800735 | 3.708738e-01 |
43 | 1 | 20 | 87 | 11058 | 0.682709 | R-HSA-9013404 | RAC2 GTPase cycle | 4.086556e-01 | 0.682709 | 4.086556e-01 |
44 | 1 | 20 | 91 | 11054 | 0.624208 | R-HSA-6804756 | Regulation of TP53 Activity through Phosphoryl... | 4.294877e-01 | 0.624208 | 4.294877e-01 |
45 | 1 | 20 | 93 | 11052 | 0.597060 | R-HSA-9013423 | RAC3 GTPase cycle | 4.397019e-01 | 0.597060 | 4.397019e-01 |
46 | 1 | 20 | 99 | 11046 | 0.523015 | R-HSA-6811434 | COPI-dependent Golgi-to-ER retrograde traffic | 4.695582e-01 | 0.523015 | 4.695582e-01 |
47 | 1 | 20 | 101 | 11044 | 0.500559 | R-HSA-6807878 | COPI-mediated anterograde transport | 4.792546e-01 | 0.500559 | 4.792546e-01 |
48 | 1 | 20 | 149 | 10996 | 0.170910 | R-HSA-8980692 | RHOA GTPase cycle | 6.793044e-01 | 0.170910 | 6.793044e-01 |
49 | 1 | 20 | 158 | 10987 | 0.137271 | R-HSA-9013148 | CDC42 GTPase cycle | 7.110095e-01 | 0.137271 | 7.110095e-01 |
50 | 1 | 20 | 184 | 10961 | 0.067711 | R-HSA-9013149 | RAC1 GTPase cycle | 7.946990e-01 | 0.067711 | 7.946990e-01 |
51 | 1 | 20 | 204 | 10941 | 0.034678 | R-HSA-5689880 | Ub-specific processing proteases | 8.522717e-01 | 0.034678 | 8.522717e-01 |
This notebook demonstrated usage of the Reactome BigQuery dataset for basic cancer pathway identification from a gene set, as well as a more complex pathway enrichment analysis using a chi-squared statistic.