Notebook

ISB-CGC Community Notebooks¶

Check out more notebooks at our Community Notebooks Repository!

Title:   How to create cohorts
Author:  Lauren Hagen
Created: 2019-06-20
Updated: 2023-12
Purpose: Basic overview of creating cohorts with IDB-CGC BigQuery
URL:     https://github.com/isb-cgc/Community-Notebooks/blob/master/Notebooks/How_to_create_cohorts.ipynb
Notes:   This notebook was adapted from work by Sheila Reynolds, 'How to Create TCGA Cohorts part 1' https://github.com/isb-cgc/examples-Python/blob/master/notebooks/Creating%20TCGA%20cohorts%20--%20part%201.ipynb.

How to Create Cohorts¶

This notebook demonstrates how to create a cohort (list) of patients from the Genomic Data Commons (GDC) using ISB-CGC public BigQuery tables. We will use the clinical and file tables to create a curated list of patients to discover associated files and data. More information on the tables and data that this notebook explores can be found in our Documentation at:

Initialize Notebook Environment¶

Before beginning, we first need to load dependencies and authenticate to BigQuery. You will need to have access to a Google Cloud Platform (GCP) project in order to use BigQuery.

Install Dependencies¶

In [ ]:

# GCP libraries
from google.cloud import bigquery
from google.colab import auth

Authenticate¶

In order to utilize BigQuery, we must obtain authorization to BigQuery and Google Cloud.

Alternative authentication methods

In [ ]:

# if you're using Google Colab, authenticate to gcloud with the following
auth.authenticate_user()

# alternatively, use the gcloud SDK
#!gcloud auth application-default login

Google project ID¶

Set your own Google project ID for use with this notebook.

In [ ]:

# Create a variable for which client to use with BigQuery
project_id = 'YOUR_PROJECT_ID_CHANGE_ME' # Update with your Google Project number

BigQuery Client¶

In [ ]:

if project_id == 'YOUR_PROJECT_ID_CHANGE_ME': # checking that project id was changed
  print('Please update the project number with your Google Cloud Project')
else: client = bigquery.Client(project_id)

Create a Simple Cohort¶

To create this cohort, we are going to explore TCGA data to find projects with patients (cases) that have a history of smoking and the number of years they smoked. We then will explore the data types available in the top three projects.

Determine which Projects to Use¶

We will use the TCGA clinical table to find projects with the most data for the number of years smoked. This query can be used to look at any feature in the clinical table. The query uses COUNT and GROUP BY to calculate the number of patients with data.

In [ ]:

# Create query string
query_clinical = """
  SELECT
    proj__project_id,
    COUNT(proj__project_id) AS n
  FROM
    `isb-cgc-bq.TCGA_versioned.clinical_gdc_r37`
  WHERE
    exp__years_smoked is not Null
  GROUP BY
    proj__project_id
  ORDER BY
    n DESC
 """

# Query BigQuery with an output to a dataframe
result_clinical_projects = client.query(query_clinical).to_dataframe()

# Create a variable of the three most populated projects
project_ids = "', '".join(result_clinical_projects['proj__project_id'][0:3])

# Print results
print(result_clinical_projects)
print(f"The number of total rows: {len(result_clinical_projects)}")

Create a List of Cases with Selected Projects¶

The TCGA-LUSC (Lung Squamous Cell Carcinoma), TCGA-LUAD (Lung Adenocarcinoma), and TCGA-HNSC (Head and Neck Squamous Cell Carcinoma) projects are the top three projects with patients that have data for number of years smoking. The patients in these projects will form the base of our cohort. A cohort can be created by a number of different filters such as patients that smoked more than 10 years or the primary disease type.

In [ ]:

# Create query string
query_case = f"""
  SELECT
    proj__project_id,
    case_id,
    exp__years_smoked
  FROM
    `isb-cgc-bq.TCGA_versioned.clinical_gdc_r37`
  WHERE
    proj__project_id IN ('{project_ids}') AND exp__years_smoked is not Null
  """

# Query BigQuery with an output to a dataframe
clinical_case_ids = client.query(query_case).to_dataframe()

# Create a variable of the case ids
case_ids = "', '".join(clinical_case_ids['case_id'])

# Print results
print(clinical_case_ids.head(5))
print(f"The number of total rows: {len(clinical_case_ids)}")

View the Data Types Available¶

Now that we have a cohort of case ids, we can use that list to discover the available data types with the file table.

In [ ]:

query_file_types = f"""
  SELECT
    file.project_short_name,
    file.data_category,
    file.data_type,
    file.data_format,
    COUNT(file.file_gdc_id) as n
  FROM
    `isb-cgc-bq.GDC_case_file_metadata.fileData_active_current` AS file
  WHERE
    associated_entities__case_gdc_id IN ('{case_ids}')
  GROUP BY
    file.project_short_name,
    file.data_category,
    file.data_type,
    file.data_format
  ORDER BY file.data_type, n DESC
  """
result_file_types = client.query(query_file_types).to_dataframe().head(5)
print(result_file_types)

We can then use the cohort of case ids to create a table with RNA seq data for the associated aliquots.

Note: We used LIMIT 50 in the query to limit the amount of data that is returned by the query

In [ ]:

query_rna = f"""
  SELECT
    aliquot_gdc_id,
    fpkm_unstranded
  FROM
    `isb-cgc-bq.TCGA_versioned.RNAseq_hg38_gdc_r35`
  WHERE
    	case_gdc_id IN ('{case_ids}')
  LIMIT 50
"""
results_rna_seq = client.query(query_rna).to_dataframe()
print(results_rna_seq.head(5))

The final query to create the table of RNA seq data can also be created with one query with a join between the clinical table and the RNA seq table.

  SELECT
    c.case_id,
    c.exp__years_smoked,
    r.fpkm_unstranded
  FROM
    `isb-cgc-bq.TCGA_versioned.clinical_gdc_r37` AS c
  JOIN
    `isb-cgc-bq.TCGA_versioned.RNAseq_hg38_gdc_r35` AS r
  ON
    c.case_id = r.case_gdc_id
  WHERE
    proj__project_id IN ('TCGA-LUSC',
      'TCGA-LUAD',
      'TCGA-HNSC')
    AND exp__years_smoked IS NOT NULL

Closing¶

Thank you for working through this notebook. We hope that you found this exercise to be helpful in finding relevant cohorts for your studies. Please explore the ISB-CGC ecosystem at isb-cgc.org.

For questions, comments, or troubleshooting, please contact us at feedback@isb-cgc.org. We are especially keen on learning about your particular use-cases, and how we can help you take advantage of the latest in cloud-computing technologies to answer your research questions. Also, check out our virtual Office Hours on Tuesdays and Thursdays.