Open In Colab

ISB-CGC Community Notebooks

Check out more notebooks at our Community Notebooks Repository!

Title:   Quick Start Guide to ISB-CGC
Author:  Lauren Hagen
Created: 2019-06-20
Updated: 2021-07-27
Purpose: Painless intro to working in the cloud
URL:     https://github.com/isb-cgc/Community-Notebooks/blob/master/Notebooks/Quick_Start_Guide_to_ISB_CGC.ipynb
Notes:

Quick Start Guide to ISB-CGC

ISB-CGC

This Quick Start Guide gives an overview of the data available, account set-up overview, and getting started with a basic example in python. If you have read the R version, you can skip to the Example section.

Access Requirements

Access Suggestions

  • Favored Programming Language (R or Python)
  • Favored IDE (RStudio or Jupyter)
  • Some knowledge of SQL

Outline for this Notebook

  • Libraries Needed for this Notebook
  • Overview of ISB-CGC
  • Overview How to Access Data
  • Example of Accessing Data with Python
  • Where to go next

Libraries needed for the Notebook

This notebook requires the BigQuery API to be loaded (click here for more information). This library will allow you to access BigQuery programmatically.

In [ ]:
# Load BigQuery API
from google.cloud import bigquery

Overview of ISB-CGC

The ISB-CGC provides interactive and programmatic access to data hosted by institutes such as the Genomic Data Commons (GDC) and Proteomic Data Commons (PDC) from the National Cancer Institute (NCI) while leveraging many aspects of the Google Cloud Platform. You can also import your data, analyze it side by side with the datasets, and share your data when you see fit.

About the ISB-CGC Data in the Cloud

ISB-CGC hosts carefully curated, high-level clinical, biospecimen, and molecular datasets and tables in Google BigQuery, including data from programs such as The Cancer Genome Atlas (TCGA), Therapeutically Applicable Research to Generate Effective Treatments (TARGET), and Clinical Proteomic Tumor Analysis Consortium (CPTAC). For more information about hosted data, please visit: Programs and DataSets

Overview of How to Access Data

There are several ways to access and explore the data hosted by ISB-CGC. Though in this notebook, we will cover using Python and SQL to access the data.

Account Set-up

To run this notebook, you will need to have your Google Cloud Account set up. If you need to set up a Google Cloud Account, follow the "Obtain a Google identity" and "Set up a Google Cloud Project" steps on our Quick-Start Guide documentation page.

ISB-CGC Web Interface

The ISB-CGC Web Interface is an interactive web-based application to access and explore the rich TCGA, TARGET, and CCLE datasets with more datasets regularly added. Through WebApp, you can create Cohorts, lists of Favorite Genes, miRNA, and Variables. The Cohorts and Variables can be used in Workbooks to allow you to quickly analyze and export datasets by mixing and matching the selections.

Google Cloud Platform and BigQuery Overview

The Google Cloud Platform Console is the web-based interface to your GCP Project. From the Console, you can check the overall status of your project, create and delete Cloud Storage buckets, upload and download files, spin up and shut down VMs, add members to your project, access the Cloud Shell command line, etc. You'll want to remember that any costs that you incur are charged under your current project, so you will want to make sure you are on the correct one if you are part of multiple projects.

ISB-CGC has uploaded multiple cancer genomic and proteomic datasets into BigQuery tables that are open-source such as TCGA and TARGET Clinical, Biospecimen, and Molecular Data, along with case and file data. This data can be accessed from the Google Cloud Platform Console User Interface (UI), programmatically with R and python, or explored with our BigQuery Table Search tool.

Example of Accessing BigQuery Data with Python

Log into Google Cloud Storage and Authenticate ourselves

  1. Authenticate yourself with your Google Cloud Login
  2. A second tab will open or follow the link provided
  3. Follow prompts to Authorize your account to use Google Cloud SDK
  4. Copy code provided and paste into the box under the Command
  5. Press Enter

Alternative authentication methods

In [ ]:
!gcloud auth application-default login

View ISB-CGC Datasets and Tables in BigQuery

Let us look at the datasets available through ISB-CGC that are in BigQuery.

In [ ]:
# Create a client to access the data within BigQuery
# Note: you cannot use the project below as a billing project,
# it can only be used to view the tables and table schema
client = bigquery.Client('isb-cgc-bq')

# Create a variable of datasets 
datasets = list(client.list_datasets())
# Create a variable for the name of the project
project = client.project

# If there are datasets available then print their names,
# else print that there are no datasets available
if datasets:
    print("Datasets in project {}:".format(project))
    for dataset in datasets:  # API request(s)
        print("\t{}".format(dataset.dataset_id))
else:
    print("{} project does not contain any datasets.".format(project))
/usr/local/lib/python3.7/dist-packages/google/auth/_default.py:70: UserWarning: Your application has authenticated using end user credentials from Google Cloud SDK without a quota project. You might receive a "quota exceeded" or "API not enabled" error. We recommend you rerun `gcloud auth application-default login` and make sure a quota project is added. Or you can use service accounts instead. For more information about service accounts, see https://cloud.google.com/docs/authentication/
  warnings.warn(_CLOUD_SDK_CREDENTIALS_WARNING)
Datasets in project isb-cgc-bq:
	0_README
	BEATAML1_0
	BEATAML1_0_versioned
	CBTTC
	CBTTC_versioned
	CCLE
	CCLE_versioned
	CGCI
	CGCI_versioned
	CMI
	CMI_versioned
	CPTAC
	CPTAC_versioned
	CTSP
	CTSP_versioned
	FM
	FM_versioned
	GDC_case_file_metadata
	GDC_case_file_metadata_versioned
	GENCODE
	GENCODE_versioned
	GENIE
	GENIE_versioned
	GPRP
	GPRP_versioned
	HCMI
	HCMI_versioned
	ICPC
	ICPC_versioned
	MMRF
	MMRF_versioned
	NCICCR
	NCICCR_versioned
	OHSU
	OHSU_versioned
	ORGANOID
	ORGANOID_versioned
	PDC_metadata
	PDC_metadata_versioned
	Quant_Maps_Tissue_Biopsies
	Quant_Maps_Tissue_Biopsies_versioned
	TARGET
	TARGET_versioned
	TCGA
	TCGA_versioned
	VAREPOP
	VAREPOP_versioned
	WCDT
	WCDT_versioned
	annotations
	annotations_versioned
	functions
	pancancer_atlas
	supplementary_tables

The ISB-CGC has two datasets for each Program. One dataset contains the most current data, and the other contains versioned tables, which serve as an archive for reproducibility. The current tables are labeled with "_current" and are updated when new data is released. For more information, visit our ISB-CGC BigQuery Projects page.

Now, let us see which tables are under the TCGA dataset.

In [ ]:
print("Tables:")
# Create a variable with the list of tables in the dataset
tables = list(client.list_tables('isb-cgc-bq.TCGA'))

# If there are tables then print their names,
# else print that there are no tables
if tables:
    for table in tables:
        print("\t{}".format(table.table_id))
else:
    print("\tThis dataset does not contain any tables.")
Tables:
	DNA_methylation_chr10_hg19_gdc_current
	DNA_methylation_chr10_hg38_gdc_current
	DNA_methylation_chr11_hg19_gdc_current
	DNA_methylation_chr11_hg38_gdc_current
	DNA_methylation_chr12_hg19_gdc_current
	DNA_methylation_chr12_hg38_gdc_current
	DNA_methylation_chr13_hg19_gdc_current
	DNA_methylation_chr13_hg38_gdc_current
	DNA_methylation_chr14_hg19_gdc_current
	DNA_methylation_chr14_hg38_gdc_current
	DNA_methylation_chr15_hg19_gdc_current
	DNA_methylation_chr15_hg38_gdc_current
	DNA_methylation_chr16_hg19_gdc_current
	DNA_methylation_chr16_hg38_gdc_current
	DNA_methylation_chr17_hg19_gdc_current
	DNA_methylation_chr17_hg38_gdc_current
	DNA_methylation_chr18_hg19_gdc_current
	DNA_methylation_chr18_hg38_gdc_current
	DNA_methylation_chr19_hg19_gdc_current
	DNA_methylation_chr19_hg38_gdc_current
	DNA_methylation_chr1_hg19_gdc_current
	DNA_methylation_chr1_hg38_gdc_current
	DNA_methylation_chr20_hg19_gdc_current
	DNA_methylation_chr20_hg38_gdc_current
	DNA_methylation_chr21_hg19_gdc_current
	DNA_methylation_chr21_hg38_gdc_current
	DNA_methylation_chr22_hg19_gdc_current
	DNA_methylation_chr22_hg38_gdc_current
	DNA_methylation_chr2_hg19_gdc_current
	DNA_methylation_chr2_hg38_gdc_current
	DNA_methylation_chr3_hg19_gdc_current
	DNA_methylation_chr3_hg38_gdc_current
	DNA_methylation_chr4_hg19_gdc_current
	DNA_methylation_chr4_hg38_gdc_current
	DNA_methylation_chr5_hg19_gdc_current
	DNA_methylation_chr5_hg38_gdc_current
	DNA_methylation_chr6_hg19_gdc_current
	DNA_methylation_chr6_hg38_gdc_current
	DNA_methylation_chr7_hg19_gdc_current
	DNA_methylation_chr7_hg38_gdc_current
	DNA_methylation_chr8_hg19_gdc_current
	DNA_methylation_chr8_hg38_gdc_current
	DNA_methylation_chr9_hg19_gdc_current
	DNA_methylation_chr9_hg38_gdc_current
	DNA_methylation_chrX_hg19_gdc_current
	DNA_methylation_chrX_hg38_gdc_current
	DNA_methylation_chrY_hg19_gdc_current
	DNA_methylation_chrY_hg38_gdc_current
	DNA_methylation_hg19_gdc_current
	DNA_methylation_hg38_gdc_current
	RNAseq_hg19_gdc_current
	RNAseq_hg38_gdc_current
	annotations_gdc_current
	biospecimen_gdc_current
	clinical_CPTAC_TCGA_pdc_current
	clinical_diagnoses_treatments_gdc_current
	clinical_gdc_current
	copy_number_segment_masked_hg19_gdc_current
	copy_number_segment_masked_hg38_gdc_current
	miRNAseq_hg19_gdc_current
	miRNAseq_hg38_gdc_current
	miRNAseq_isoform_hg19_gdc_current
	miRNAseq_isoform_hg38_gdc_current
	per_sample_file_metadata_CPTAC_TCGA_pdc_current
	per_sample_file_metadata_hg19_gdc_current
	per_sample_file_metadata_hg38_gdc_current
	protein_expression_hg19_gdc_current
	protein_expression_hg38_gdc_current
	quant_phosphoproteome_TCGA_breast_cancer_pdc_current
	quant_phosphoproteome_TCGA_ovarian_PNNL_velos_qexactive_pdc_current
	quant_proteome_TCGA_breast_cancer_pdc_current
	quant_proteome_TCGA_ovarian_JHU_pdc_current
	quant_proteome_TCGA_ovarian_PNNL_pdc_current
	radiology_images_tcia_current
	slide_images_gdc_current
	somatic_mutation_hg38_gdc_current

Query ISB-CGC BigQuery Tables

First, use a magic command to call to BigQuery. Then we can use Standard SQL to write your query. Click here for more on IPython Magic Commands for BigQuery. The result will be a Pandas Dataframe.

Note: you will need to update PROJECT_ID in the next cell to your Google Cloud Project ID.

In [ ]:
# Call to BigQuery with a magic command
# and replace PROJECT_ID with your project ID Number
%%bigquery --project PROJECT_ID
SELECT # Select a few columns to view
  proj__project_id, # GDC project
  submitter_id, # case barcode
  proj__name # GDC project name
FROM # From the GDC TCGA Clinical Dataset
  `isb-cgc-bq.TCGA.clinical_gdc_current`
LIMIT # Limit to 5 rows as the dataset is very large and we only want to see a few results
  5

# Syntax for the above query
# SELECT * 
# FROM `project_name.dataset_name.INFORMATION_SCHEMA.COLUMNS`
# Limit to the first 5 fields
Out[ ]:
proj__project_id submitter_id proj__name
0 TCGA-HNSC TCGA-CN-5363 Head and Neck Squamous Cell Carcinoma
1 TCGA-HNSC TCGA-CN-5365 Head and Neck Squamous Cell Carcinoma
2 TCGA-HNSC TCGA-CN-A642 Head and Neck Squamous Cell Carcinoma
3 TCGA-HNSC TCGA-CR-7380 Head and Neck Squamous Cell Carcinoma
4 TCGA-HNSC TCGA-CV-5978 Head and Neck Squamous Cell Carcinoma

Now that wasn't so difficult! Have fun exploring and analyzing the ISB-CGC Data!