Check out more notebooks at our Community Notebooks Repository!
Title: How to create cohorts
Author: Lauren Hagen
Created: 2019-06-20
Updated: 2023-12
Purpose: Basic overview of creating cohorts with IDB-CGC BigQuery
URL: https://github.com/isb-cgc/Community-Notebooks/blob/master/Notebooks/How_to_create_cohorts.ipynb
Notes: This notebook was adapted from work by Sheila Reynolds, 'How to Create TCGA Cohorts part 1' https://github.com/isb-cgc/examples-Python/blob/master/notebooks/Creating%20TCGA%20cohorts%20--%20part%201.ipynb.
This notebook demonstrates how to create a cohort (list) of patients from the Genomic Data Commons (GDC) using ISB-CGC public BigQuery tables. We will use the clinical and file tables to create a curated list of patients to discover associated files and data. More information on the tables and data that this notebook explores can be found in our Documentation at:
Before beginning, we first need to load dependencies and authenticate to BigQuery. You will need to have access to a Google Cloud Platform (GCP) project in order to use BigQuery.
# GCP libraries
from google.cloud import bigquery
from google.colab import auth
In order to utilize BigQuery, we must obtain authorization to BigQuery and Google Cloud.
# if you're using Google Colab, authenticate to gcloud with the following
auth.authenticate_user()
# alternatively, use the gcloud SDK
#!gcloud auth application-default login
Set your own Google project ID for use with this notebook.
# Create a variable for which client to use with BigQuery
project_id = 'YOUR_PROJECT_ID_CHANGE_ME' # Update with your Google Project number
if project_id == 'YOUR_PROJECT_ID_CHANGE_ME': # checking that project id was changed
print('Please update the project number with your Google Cloud Project')
else: client = bigquery.Client(project_id)
To create this cohort, we are going to explore TCGA data to find projects with patients (cases) that have a history of smoking and the number of years they smoked. We then will explore the data types available in the top three projects.
We will use the TCGA clinical table to find projects with the most data for the number of years smoked. This query can be used to look at any feature in the clinical table. The query uses COUNT and GROUP BY to calculate the number of patients with data.
# Create query string
query_clinical = """
SELECT
proj__project_id,
COUNT(proj__project_id) AS n
FROM
`isb-cgc-bq.TCGA_versioned.clinical_gdc_r37`
WHERE
exp__years_smoked is not Null
GROUP BY
proj__project_id
ORDER BY
n DESC
"""
# Query BigQuery with an output to a dataframe
result_clinical_projects = client.query(query_clinical).to_dataframe()
# Create a variable of the three most populated projects
project_ids = "', '".join(result_clinical_projects['proj__project_id'][0:3])
# Print results
print(result_clinical_projects)
print(f"The number of total rows: {len(result_clinical_projects)}")
The TCGA-LUSC (Lung Squamous Cell Carcinoma), TCGA-LUAD (Lung Adenocarcinoma), and TCGA-HNSC (Head and Neck Squamous Cell Carcinoma) projects are the top three projects with patients that have data for number of years smoking. The patients in these projects will form the base of our cohort. A cohort can be created by a number of different filters such as patients that smoked more than 10 years or the primary disease type.
# Create query string
query_case = f"""
SELECT
proj__project_id,
case_id,
exp__years_smoked
FROM
`isb-cgc-bq.TCGA_versioned.clinical_gdc_r37`
WHERE
proj__project_id IN ('{project_ids}') AND exp__years_smoked is not Null
"""
# Query BigQuery with an output to a dataframe
clinical_case_ids = client.query(query_case).to_dataframe()
# Create a variable of the case ids
case_ids = "', '".join(clinical_case_ids['case_id'])
# Print results
print(clinical_case_ids.head(5))
print(f"The number of total rows: {len(clinical_case_ids)}")
Now that we have a cohort of case ids, we can use that list to discover the available data types with the file table.
query_file_types = f"""
SELECT
file.project_short_name,
file.data_category,
file.data_type,
file.data_format,
COUNT(file.file_gdc_id) as n
FROM
`isb-cgc-bq.GDC_case_file_metadata.fileData_active_current` AS file
WHERE
associated_entities__case_gdc_id IN ('{case_ids}')
GROUP BY
file.project_short_name,
file.data_category,
file.data_type,
file.data_format
ORDER BY file.data_type, n DESC
"""
result_file_types = client.query(query_file_types).to_dataframe().head(5)
print(result_file_types)
We can then use the cohort of case ids to create a table with RNA seq data for the associated aliquots.
Note: We used LIMIT 50
in the query to limit the amount of data that is returned by the query
query_rna = f"""
SELECT
aliquot_gdc_id,
fpkm_unstranded
FROM
`isb-cgc-bq.TCGA_versioned.RNAseq_hg38_gdc_r35`
WHERE
case_gdc_id IN ('{case_ids}')
LIMIT 50
"""
results_rna_seq = client.query(query_rna).to_dataframe()
print(results_rna_seq.head(5))
The final query to create the table of RNA seq data can also be created with one query with a join between the clinical table and the RNA seq table.
SELECT
c.case_id,
c.exp__years_smoked,
r.fpkm_unstranded
FROM
`isb-cgc-bq.TCGA_versioned.clinical_gdc_r37` AS c
JOIN
`isb-cgc-bq.TCGA_versioned.RNAseq_hg38_gdc_r35` AS r
ON
c.case_id = r.case_gdc_id
WHERE
proj__project_id IN ('TCGA-LUSC',
'TCGA-LUAD',
'TCGA-HNSC')
AND exp__years_smoked IS NOT NULL
Thank you for working through this notebook. We hope that you found this exercise to be helpful in finding relevant cohorts for your studies. Please explore the ISB-CGC ecosystem at isb-cgc.org.
For questions, comments, or troubleshooting, please contact us at feedback@isb-cgc.org. We are especially keen on learning about your particular use-cases, and how we can help you take advantage of the latest in cloud-computing technologies to answer your research questions. Also, check out our virtual Office Hours on Tuesdays and Thursdays.