#!/usr/bin/env python # coding: utf-8 # # Tutorial 7: Pancan data access # # This tutorial shows how to use the `cptac.pancan` submodule to access data from the harmonized pipelines for all cancer types. # # Before the harmonized pipelines, the team working on each cancer type had their own pipeline for each data type. So for example, the ccRCC team ran the ccRCC data through their own transcriptomics pipeline, and the HNSCC team ran the HNSCC data through a different transcriptomics pipeline. However, this made it hard to study trends across multiple cancer types, since each cancer type's data had been processed differently. # # To fix this problem, all data for all cancer types was run through the same pipelines for each data type. These are the harmonized pipelines. Now, for example, you can get transcriptomics data for both ccRCC and HNSCC (and all other cancer types) that came from the same pipeline. # # For some data types, multiple harmonized pipelines were available. In this cases, all cancers were run through each pipeline, and you can choose which one to use. For example, you can get transcriptomics data from either the BCM pipeline, the Broad pipeline, or the WashU pipeline. But whichever pipeline you choose, you can get transcriptomics data for all cancer types through that one pipeline. # # First, we'll import the package. # In[1]: import cptac.pancan as pc # We can list which cancers we have data for. # In[2]: pc.list_datasets() # ## Download # # Authentication through your Box account is required when you download files. Pass the name of the dataset you want, as listed by `list_datasets`. Capitalization does not matter. # # See the end of this tutorial for how to download files on a remote computer that doesn't have a web browser for logging into Box. # In[3]: pc.download("pancanbrca") # ## Load the BRCA dataset # In[4]: br = pc.PancanBrca() # We can list which data types are available from which sources. # In[5]: br.list_data_sources() # Let's get some data tables. # In[6]: br.get_clinical(source="mssm") # In[7]: br.get_somatic_mutation(source="washu") # In[8]: br.get_proteomics(source="umich") # ## Box authentication for remote downloads # # Normally, when you download the `cptac.pancan` data files you're required to log into your Box account, as these files are not released publicly. However, there may be situations where the computer you're running your analysis on doesn't have a web browser you can use to log in to Box. For example, you may be running your code in a remotely hosted notebook (e.g. Google Colabs), or on a computer cluster that you access using ssh. # # In these situations, follow these steps to take care of Box authenication: # 1. On a computer where you do have access to a web browser to log in to Box, load the `cptac.pancan` module. # 2. Call the `cptac.pancan.get_box_token` function. This will return a temporary access token that gives permission to download files from Box with your credentials. The token expires 1 hour after it's created. # 3. On the remote computer, when you call the `cptac.pancan.download` function, copy and paste the access token you generated on your local machine into the `box_token` parameter of the function. The program will then be able to download the data files. # # Below is all the code you would need to call for this process on each machine. For security, we will not actually run it in this notebook. # # On your local machine: # ``` # import cptac.pancan as pc # pc.get_box_token() # ``` # # On the remote machine: # ``` # import cptac.pancan as pc # pc.download("pancanbrca", box_token=[INSERT TOKEN HERE]) # ```