Check out more notebooks at our Community Notebooks Repository!
Title: How to perform complex joins
Author: David L Gibbs
Created: 2020-04-21
Purpose: Basic overview of complex joins in BigQuery
URL: https://github.com/isb-cgc/Community-Notebooks/blob/master/Notebooks/How_to_perform_complex_joins.ipynb
Notes: This covers joining multiple tables based on a sample identifier.
This notebook will show you how to perform a complex join using the publicly available TCGA BigQuery tables that the ISB-CGC project has produced based on the open-access TCGA data available at the Data Portal. You will need to have access to a Google Cloud Platform (GCP) project in order to use BigQuery. If you don't already have one, you can sign up for a free-trial. You can also explore the available tables and data sets before commiting to creating a GCP project though the ISB-CGC BigQuery Table Searcher.
We are not attempting to provide a thorough BigQuery or IPython tutorial here, as a wealth of such information already exists. Here are some links to some resources that you might find useful:
There are also many tutorials and samples available on github (see, in particular, the datalab repo, the Google Genomics project), and our own Community Notebooks.
OK then, let's get started! In order to work with BigQuery, the first thing you need to do is import the bigquery module:
!pip3 install google
!pip3 install google.cloud.bigquery
import google.cloud.bigquery as bigquery
Next, we need to Authorize ourselves. For more information see 'Quick Start Guide to ISB-CGC' and alternative authentication methods can be found here.
!gcloud auth application-default login
# If you're in a notebook, this might not allow for a reponse to be entered.
# in that case, in Jupyter Lab you can open a terminal and do it there. (same command)
# Create a variable for which client to use with BigQuery
project_num = 'your_project_number' # Update with your Google Project number
if project_num == 'your_project_number':
print('Please update the project number with your Google Cloud Project')
else:
client = bigquery.Client(project_num) # Replace your_project_number with your project ID
Please update the project number with your Google Cloud Project
%load_ext google.cloud.bigquery
BigQuery tables are organized into datasets, and datasets are owned by a specific GCP project. The tables we will be working with in this notebook are found in datasets TCGA_bioclin_v0
, platform_reference
, and TCGA_hg19_data_v0
, all owned by the isb-cgc
** project. A full table identifier is of the form <project_id>.<dataset_id>.<table_id>
.
Our strategy is going to based on linking up a series of subtables, rather than trying to do all the joins in one go. It's a little like rolling up a snowball.
Let's start by looking at the clinical data table, and we'll build up the query, table by table.
Notice that in the 'group by', we use integers that reference the selected terms.
%%bigquery --project project_num
WITH
clinTab AS (
select
project_short_name, case_barcode, age_at_diagnosis, pathologic_stage
from
`isb-cgc.TCGA_bioclin_v0.Clinical` as Clin
where
project_short_name = "TCGA-BRCA" AND
Clin.age_at_diagnosis < 50
group by
1,2,3,4
)
select * from clinTab limit 5
Here, each row of the clinical table is going to get joined to each protein-and-barcode combo.
%%bigquery pdf1 --project project_num
WITH
clinTab AS (
select
project_short_name, case_barcode, age_at_diagnosis, pathologic_stage
from
`isb-cgc.TCGA_bioclin_v0.Clinical` as Clin
where
project_short_name = "TCGA-BRCA" AND
Clin.age_at_diagnosis < 50
group by
1,2,3,4
),
protTab AS (
select
Prot.case_barcode, Prot.project_short_name, Prot.protein_expression,
Prot.gene_name as gene_name, Prot.protein_name,
clinTab.age_at_diagnosis, clinTab.pathologic_stage
from
`isb-cgc.TCGA_hg19_data_v0.Protein_Expression` Prot
join
clinTab
on clinTab.case_barcode = Prot.case_barcode AND clinTab.project_short_name = Prot.project_short_name
where
regexp_contains(Prot.gene_name, '^P')
group by
1,2,3,4,5,6,7
)
select * from protTab
pdf1.head()
pdf1.shape
Now we'll bring in variant data. Notice in the 'mutTab', the features that were originally from the clinical table, are now part of Prot. We are also going to join only deletions which are called 'DEL's.
%%bigquery mdf1 --project project_num
WITH
clinTab AS (
select
project_short_name, case_barcode, age_at_diagnosis, pathologic_stage
from
`isb-cgc.TCGA_bioclin_v0.Clinical` as Clin
where
project_short_name = "TCGA-BRCA" AND
Clin.age_at_diagnosis < 50
group by
1,2,3,4
),
protTab AS (
select
clinTab.age_at_diagnosis, clinTab.pathologic_stage,
Prot.case_barcode, Prot.project_short_name, Prot.protein_expression,
Prot.gene_name as gene_name, Prot.protein_name
from
`isb-cgc.TCGA_hg19_data_v0.Protein_Expression` Prot
join
clinTab
on clinTab.case_barcode = Prot.case_barcode AND clinTab.project_short_name = Prot.project_short_name
where
regexp_contains(Prot.gene_name, '^P')
group by
1,2,3,4,5,6,7
),
mutTab AS (
select
prot.age_at_diagnosis, prot.pathologic_stage,
Prot.case_barcode, Prot.project_short_name, Prot.protein_expression,
Prot.gene_name as gene_name, Prot.protein_name,
mut.Variant_Type, mut.Genome_Change, mut.Mutation_Status
from
`isb-cgc.TCGA_hg19_data_v0.Somatic_Mutation_DCC` mut
join
protTab as prot
on prot.case_barcode = mut.case_barcode
and prot.project_short_name = mut.project_short_name
and prot.gene_name = mut.Hugo_Symbol
where
mut.Variant_Type = 'DEL'
group by 1,2,3,4,5,6,7,8,9,10
)
select * from mutTab
mdf1.head()
mdf1.shape
Now to bring in the methylation data. This data, from Illumina methylation arrays, the Human Methylation 27k and 450k platforms, is indexed by probe IDs. That means we first need to do ane extra join to bring in gene symbols. Also, observe that the methylation annotation is a nested table. In the table schema, you will notice the UCSC field is a record with three components, RefGene_Name, RefGene_Group, and RefGene_Accession. To make our work easier, we want to take this nested table and make it a 'flat' table. For some documentation on doing that see this doc.
%%bigquery prdf1 --project project_num
SELECT
IlmnID,
RefGene_Name,
RefGene_Group,
Infinium_Design_Type
FROM
`isb-cgc.platform_reference.methylation_annotation`,
UNNEST(UCSC)
limit 10
prdf1.head()
Since we made our annotation table flat, we can join it in using the regular method.
%%bigquery ardf1 --project project_num
WITH
probes AS (
SELECT
IlmnID,
RefGene_Name,
RefGene_Group,
Infinium_Design_Type
FROM
`isb-cgc.platform_reference.methylation_annotation`,
UNNEST(UCSC) ),
methAnnot as (
select
Methyl.probe_id,Methyl.beta_value,Methyl.platform,
Methyl.case_barcode, Methyl.project_short_name,
probes.RefGene_Name, probes.RefGene_Group, probes.Infinium_Design_Type
from
(select * from `isb-cgc.TCGA_hg19_data_v0.DNA_Methylation_chr22` limit 100) Methyl -- SUB-TABLE HERE!
join
probes
on probes.IlmnID = Methyl.probe_id
where
project_short_name = "TCGA-BRCA"
)
select * from methAnnot limit 10
ardf1.head()
ardf1.shape
Now that we've mapped probe ids to gene symbols, we'll bring in methylation data.
%%bigquery bigdf --project project_num
WITH
clinTab AS (
select
project_short_name, case_barcode, age_at_diagnosis, pathologic_stage
from
`isb-cgc.TCGA_bioclin_v0.Clinical` as Clin
where
project_short_name = "TCGA-BRCA" AND
Clin.age_at_diagnosis < 50
group by
1,2,3,4
),
protTab AS (
select
clinTab.age_at_diagnosis, clinTab.pathologic_stage,
Prot.case_barcode, Prot.project_short_name, Prot.protein_expression,
Prot.gene_name as gene_name, Prot.protein_name
from
`isb-cgc.TCGA_hg19_data_v0.Protein_Expression` Prot
join
clinTab
on clinTab.case_barcode = Prot.case_barcode AND clinTab.project_short_name = Prot.project_short_name
where
regexp_contains(Prot.gene_name, '^P')
group by
1,2,3,4,5,6,7
),
mutTab AS (
select
prot.age_at_diagnosis, prot.pathologic_stage,
Prot.case_barcode, Prot.project_short_name, Prot.protein_expression,
Prot.gene_name as gene_name, Prot.protein_name,
mut.Variant_Type, mut.Genome_Change, mut.Mutation_Status
from
`isb-cgc.TCGA_hg19_data_v0.Somatic_Mutation_DCC` mut
join
protTab as prot
on prot.case_barcode = mut.case_barcode
and prot.project_short_name = mut.project_short_name
and prot.gene_name = mut.Hugo_Symbol
where
mut.Variant_Type = 'DEL'
group by 1,2,3,4,5,6,7,8,9,10
),
probes AS (
SELECT
IlmnID,
RefGene_Name,
RefGene_Group,
Infinium_Design_Type
FROM
`isb-cgc.platform_reference.methylation_annotation`,
UNNEST(UCSC)
),
methAnnot as (
select
Methyl.probe_id,Methyl.beta_value,Methyl.platform,
Methyl.case_barcode, Methyl.project_short_name,
probes.RefGene_Name, probes.RefGene_Group, probes.Infinium_Design_Type
from
`isb-cgc.TCGA_hg19_data_v0.DNA_Methylation_chr22` Methyl
join
probes
on probes.IlmnID = Methyl.probe_id
where
project_short_name = "TCGA-BRCA"
),
methTab as (
select
mutTab.age_at_diagnosis, mutTab.pathologic_stage,
mutTab.case_barcode, mutTab.project_short_name, mutTab.protein_expression,
mutTab.gene_name as gene_name, mutTab.protein_name,
mutTab.Variant_Type, mutTab.Genome_Change, mutTab.Mutation_Status,
Methyl.probe_id, Methyl.beta_value, Methyl.platform,
Methyl.RefGene_Group, Methyl.Infinium_Design_Type
from
methAnnot Methyl
join
mutTab
on
mutTab.case_barcode = Methyl.case_barcode
and mutTab.project_short_name = Methyl.project_short_name
and mutTab.gene_name = Methyl.RefGene_Name
group by
1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
)
select * from mutTab
bigdf.head()
bigdf.shape
We've created a big wide table that's the result of joining 5 separate tables. Whew! Send us an email, how did it go?
feedback@isb-cgc.org