Check out more notebooks at our Community Notebooks Repository!
Title: Quick Start Guide to ISB-CGC
Author: Lauren Hagen
Created: 2019-06-20
Updated: 2023-08
Purpose: Painless intro to working with ISB-CGC in the cloud
URL: https://github.com/isb-cgc/Community-Notebooks/blob/master/Notebooks/Quick_Start_Guide_to_ISB_CGC.ipynb
Notes: This Quick Start Guide gives an overview of the data available in ISB-CGC and getting started with a basic example in python.
To run this notebook, you will need to have your Google Cloud Account set up. If you need to set up a Google Cloud Account, follow the "Obtain a Google identity" and "Set up a Google Cloud Project" steps on our Quick-Start Guide documentation page.
This notebook requires the BigQuery API to be loaded (click here for more information) allowing access to BigQuery programmatically.
# GCP libraries
from google.cloud import bigquery
from google.colab import auth
The ISB-CGC provides interactive and programmatic access to data hosted by institutes such as the Genomic Data Commons (GDC) and Proteomic Data Commons (PDC) from the National Cancer Institute (NCI) while leveraging many aspects of the Google Cloud Platform. You can also import your data, analyze it side by side with the datasets, and share your data when you see fit. The ISB-CGC hosts carefully curated high-level clinical, biospecimen, and molecular datasets and tables in Google BigQuery, including data from programs such as The Cancer Genome Atlas (TCGA), Therapeutically Applicable Research to Generate Effective Treatments (TARGET), and Clinical Proteomic Tumor Analysis Consortium (CPTAC). For more information can be found at our Programs and Data Sets page. This data can be explored via python, Google Cloud Console and/or our BigQuery Table Search tool.
Steps to authenticate yourself:
# if you're using Google Colab, authenticate to gcloud with the following
auth.authenticate_user()
# alternatively, use the gcloud SDK
#!gcloud auth application-default login
To access BigQuery, you will need a Google Cloud Project for queries to be billed to. If you need to create a Project, instructions on how to create one can be found on our Quick-Start Guide page.
A BigQuery Client object with the billing Project needs to be created to interface with BigQuery.
Note: Any costs that you incur are charged under your current project, so you will want to make sure you are on the correct one if you are part of multiple projects.
# Create a variable for which client to use with BigQuery
project_id = 'YOUR_PROJECT_ID_CHANGE_ME' # Update with your Google Project Id
# Create a BigQuery Client
if project_id == 'YOUR_PROJECT_ID_CHANGE_ME': # checking that project id was changed
print('Please update the project number with your Google Cloud Project')
else: client = bigquery.Client(project_id)
Let us look at the datasets available through ISB-CGC that are in BigQuery.
# Which project to view datasets
project_with_data = 'isb-cgc-bq'
# Create a variable of datasets
datasets = list(client.list_datasets(project_with_data))
# If there are datasets available then print their names,
# else print that there are no datasets available
if datasets:
print(f"Datasets in project {project_with_data}:")
for dataset in datasets: # API request(s)
print("\t{}".format(dataset.dataset_id))
else:
print(f"{project_with_data} project does not contain any datasets.")
The ISB-CGC has two datasets for each Program or source. One dataset contains the most current data, and the other contains versioned tables, which serve as an archive for reproducibility. The current tables are labeled with "_current" and are updated when new data is released. For more information, visit our ISB-CGC BigQuery Projects page. Let's see which tables are under the TCGA dataset.
dataset_with_data = 'TCGA_versioned'
print("Tables:")
# Create a variable with the list of tables in the dataset
tables = list(client.list_tables(f'{project_with_data}.{dataset_with_data}'))
# If there are tables then print their names,
# else print that there are no tables
if tables:
for table in tables:
print("\t{}".format(table.table_id))
else:
print("\tThis dataset does not contain any tables.")
In this section, we will create a string variable with our SQL then call to BigQuery and save the result to a dataframe.
SELECT # Select a few columns to view
proj__project_id, # GDC project
submitter_id, # case barcode
proj__name # GDC project name
FROM # Which table in BigQuery in the format of `project.dataset.table`
`project_name.dataset_name.table_name` # From the GDC TCGA Clinical Dataset
LIMIT
5 # Limit to 5 rows as the dataset is very large and we only want to see a few results
Note:
LIMIT
only limits the number of rows returned and not the number of rows that the query looks at
query = ("""
SELECT
proj__project_id,
submitter_id,
proj__name
FROM
`isb-cgc-bq.TCGA_versioned.clinical_gdc_r37`
LIMIT
5""")
result = client.query(query).to_dataframe() # API request
print(result)
There are several ways to access and explore the data hosted by ISB-CGC.
ISB-CGC
Google Cloud
Suggested Programming Languages and Programs to use
SQL
Command Line Interfaces
Getting Started for Free:
Useful ISB-CGC Links:
Useful Google Tutorials: