Check out more notebooks at our Community Notebooks Repository!
Title: How to use PyPika to build a query
Author: Lauren Hagen
Created: 2020-02-13
URL: https://github.com/isb-cgc/Community-Notebooks/blob/master/Notebooks/How to use PyPika_to_create_a_BigQuery_SQL_query.ipynb
Purpose: Demonstrate query creation using PyPika with BigQuery
Notes:
In this notebook, we are going to use the PyPika package to build a query and use that query with the ISB-CGC BigQuery tables. First we'll create a cohort by selecting clinical features, then we'll use the cohort to filter the RNA and miRNA expression tables from the TARGET data set.
PyPika is a Python package designed to construct SQL queries. This package is useful for people who want to integrate BigQuery into their python code.
Before we get started, we need to load the BigQuery module, authenticate ourselves, create a client variable, and load the necessary libraries.
# Load the BigQuery Module
from google.cloud import bigquery
# Authenticate ourselves
!gcloud auth application-default login
# Create the client object
project_num = 'your_project_number' # Replace with your project ID
if project_num == 'your_project_number':
print('Please update the project number with your Google Cloud Project')
else:
client = bigquery.Client(project_num)
# If needed, install PyPika
!pip install pypika
# Import from PyPika
from pypika import Query, Table, Field, Order
In this notebook, we will query each table individually first and then join them for a final query. We'll use a selection of clinical data with molecular data from the TARGET data set.
We want to write a query that filters the TARGET data set for AML with columns for the case barcode and the remission status of the patient for our cohort.
# Pass the Clincal table name to the Table class to create a variable
clin_table = Table('`isb-cgc.TARGET_bioclin_v0.Clinical`')
# Create query with PyPika
clin_query = Query.from_(clin_table) \
.select('case_barcode, CR_status_at_end_of_course_1, CR_status_at_end_of_course_2') \
.where(clin_table.disease_code=='AML')
# Print returned query
print(clin_query)
SELECT "case_barcode, CR_status_at_end_of_course_1, CR_status_at_end_of_course_2" FROM "`isb-cgc.TARGET_bioclin_v0.Clinical`" WHERE "disease_code"='AML'
PyPika returns the query with quotation marks around some of the inputs. These need to be removed before the querying BigQuery.
# Remove " from the string
clin_query_clean = str(clin_query).replace('"', "")
# Print the query that PyPika created to check that it looks good.
# We can also copy the query into the BigQuery UI to check for errors
print(clin_query_clean)
SELECT case_barcode, CR_status_at_end_of_course_1, CR_status_at_end_of_course_2 FROM `isb-cgc.TARGET_bioclin_v0.Clinical` WHERE disease_code='AML'
We can query BigQuery, then create a data frame with the results though this isn't necessary for creating the final query.
# Query BigQuery and create a data frame with the results
clin = client.query(clin_query_clean).to_dataframe()
# View the first several lines of the returned table
clin.head()
case_barcode | CR_status_at_end_of_course_1 | CR_status_at_end_of_course_2 | |
---|---|---|---|
0 | TARGET-20-PATDMY | CR | CR |
1 | TARGET-20-PARHSA | CR | CR |
2 | TARGET-20-PARLSW | CR | CR |
3 | TARGET-20-PASLHH | CR | CR |
4 | TARGET-20-PATKUG | CR | CR |
Now that we have a list of cases with some clinical information, we can join that table to one of the molecular data sets, such as the TARGET gene expression data.
We will now build the query for the molecular data set. We are not going to query BigQuery at this point, but it is good to make sure the query looks correct before joining it with another table.
# Pass the Clincal data table name to the Table class to create a variable
expr_table = Table('`isb-cgc.TARGET_hg38_data_v0.RNAseq_Gene_Expression`')
# Create a query with PyPika
expr_query = Query.from_(expr_table) \
.select('case_barcode, HTSeq__FPKM_UQ, Ensembl_gene_id, gene_name') \
.orderby('HTSeq__FPKM_UQ')
# Remove " from the string
expr_query_clean = str(expr_query).replace('"', "")
# View the created query
print(expr_query_clean)
SELECT case_barcode, HTSeq__FPKM_UQ, Ensembl_gene_id, gene_name FROM `isb-cgc.TARGET_hg38_data_v0.RNAseq_Gene_Expression` ORDER BY HTSeq__FPKM_UQ
Finally, we will create a query to join the two tables. BigQuery and PyPika support all join types, though, for this query, we are using the standard inner join. This query returns a large number of lines and can be slow to bring into Collaboratory, so we will limit the number of lines returned to 100.
# Create a query with PyPika
join_query = Query.from_(expr_table) \
.join(clin_table).on(expr_table.case_barcode == clin_table.case_barcode) \
.select('case_barcode', 'HTSeq__FPKM_UQ', 'Ensembl_gene_id', 'gene_name') \
.select(clin_table.CR_status_at_end_of_course_1, clin_table.CR_status_at_end_of_course_2) \
.where(clin_table.disease_code=='AML') \
.limit(100)
# Replace the " in the query
join_query_clean = str(join_query).replace('"', "")
# View the created query
print(join_query_clean)
SELECT `isb-cgc.TARGET_hg38_data_v0.RNAseq_Gene_Expression`.case_barcode,`isb-cgc.TARGET_hg38_data_v0.RNAseq_Gene_Expression`.HTSeq__FPKM_UQ,`isb-cgc.TARGET_hg38_data_v0.RNAseq_Gene_Expression`.Ensembl_gene_id,`isb-cgc.TARGET_hg38_data_v0.RNAseq_Gene_Expression`.gene_name,`isb-cgc.TARGET_bioclin_v0.Clinical`.CR_status_at_end_of_course_1,`isb-cgc.TARGET_bioclin_v0.Clinical`.CR_status_at_end_of_course_2 FROM `isb-cgc.TARGET_hg38_data_v0.RNAseq_Gene_Expression` JOIN `isb-cgc.TARGET_bioclin_v0.Clinical` ON `isb-cgc.TARGET_hg38_data_v0.RNAseq_Gene_Expression`.case_barcode=`isb-cgc.TARGET_bioclin_v0.Clinical`.case_barcode WHERE `isb-cgc.TARGET_bioclin_v0.Clinical`.disease_code='AML' LIMIT 100
There! We now have a query that joins the two tables. We can now query BigQuery and view the results.
# Query BigQuery and create a data frame
final = client.query(join_query_clean).to_dataframe()
# View the first several lines of the data frame
final.head()
case_barcode | HTSeq__FPKM_UQ | Ensembl_gene_id | gene_name | CR_status_at_end_of_course_1 | CR_status_at_end_of_course_2 | |
---|---|---|---|---|---|---|
0 | TARGET-20-PAPWHS | 392731.987867 | ENSG00000173598 | NUDT4 | CR | CR |
1 | TARGET-20-PAPWHS | 13617.578556 | ENSG00000204745 | AC083899.3 | CR | CR |
2 | TARGET-20-PAPWHS | 87978.648259 | ENSG00000235183 | RP11-613C6.4 | CR | CR |
3 | TARGET-20-PAPWHS | 258701.117197 | ENSG00000177917 | ARL6IP6 | CR | CR |
4 | TARGET-20-PAPWHS | 8689.796007 | ENSG00000181085 | MAPK15 | CR | CR |
It's that simple! Please let us know if you have any questions at feedback@isb-cgc.org.