The National Cancer Institute’s Clinical Proteomic Tumor Analysis Consortium (CPTAC) is a national effort to accelerate the understanding of the molecular basis of cancer through the application of large-scale proteome and genome analysis, or proteogenomics. CPTAC generates comprehensive proteomics and genomics data from clinical cohorts, typically with ~100 samples per tumor type. The graphic below summarizes the structure of each CPTAC dataset. For more information, visit the NIH website.
This Python package makes accessing CPTAC data easy with Python code and Jupyter notebooks. The package contains several tutorials which demonstrate data access and usage. This first tutorial serves as an introduction to the data to help users become familiar with what is included and how it is presented.
Our package provides data access in a Python programming environment. If you have not installed Python or have not installed the package, see our installation documentation here.
Once we have the package installed and we're in our Python environment, we begin by importing the package with a standard Python import statement:
import cptac
To view the available datasets, call the cptac.list_datasets()
function:
cptac.list_datasets()
Description | Data reuse status | Publication link | |
---|---|---|---|
Dataset name | |||
Brca | breast cancer | no restrictions | https://pubmed.ncbi.nlm.nih.gov/33212010/ |
Ccrcc | clear cell renal cell carcinoma (kidney) | no restrictions | https://pubmed.ncbi.nlm.nih.gov/31675502/ |
Colon | colorectal cancer | no restrictions | https://pubmed.ncbi.nlm.nih.gov/31031003/ |
Endometrial | endometrial carcinoma (uterine) | no restrictions | https://pubmed.ncbi.nlm.nih.gov/32059776/ |
Gbm | glioblastoma | no restrictions | https://pubmed.ncbi.nlm.nih.gov/33577785/ |
Hnscc | head and neck squamous cell carcinoma | no restrictions | https://pubmed.ncbi.nlm.nih.gov/33417831/ |
Lscc | lung squamous cell carcinoma | no restrictions | https://pubmed.ncbi.nlm.nih.gov/34358469/ |
Luad | lung adenocarcinoma | no restrictions | https://pubmed.ncbi.nlm.nih.gov/32649874/ |
Ovarian | high grade serous ovarian cancer | no restrictions | https://pubmed.ncbi.nlm.nih.gov/27372738/ |
Pdac | pancreatic ductal adenocarcinoma | no restrictions | https://pubmed.ncbi.nlm.nih.gov/34534465/ |
UcecConf | endometrial confirmatory carcinoma | password access only | unpublished |
GbmConf | glioblastoma confirmatory | password access only | unpublished |
The goals of CPTAC as a consortium include the broad and open dissemination of cancer proteogenomic data. The timing of the a dataset's public release generally follows three stages: internal release to CPTAC investigators, public release with a publication embargo, and full public release. Each of the cancer types may be at a different data availability stage, depending on the date of data creation. In the Python cptac
package, these three stages are dealt with as follows:
Internally released data requires a password to download.
Embargoed release data is publicly available, but prints an embargo statement every time you interact with the data.
Public data is fully released without restrictions.
The cptac package stores the data files for each dataset on a remote server. When you first install cptac, you will have no data files. To install the latest version of the data files for a particular dataset, simply call the cptac.download
function, passing the name of your desired dataset for the dataset
parameter:
cptac.download(dataset="endometrial")
True
Once you've downloaded a dataset, cptac
allows you to load the dataset into a Python variable, and you can use that variable to access and work with the data. To load a particular dataset into a variable, type the name you want to give the variable, followed by =
, and then type cptac.
and the name of the dataset in UpperCamelCase followed by two parentheses, e.g. cptac.Endometrial()
or cptac.Ccrcc()
:
en = cptac.Endometrial()
To see what data is available, use the en.list_data()
function. This displays the different types of data included in the dataset for this particular cancer type, each stored in a pandas dataframe. It also prints the dimensions of each dataframe.
en.list_data()
Below are the dataframes contained in this dataset and their dimensions: acetylproteomics 144 rows 10862 columns circular_RNA 109 rows 4945 columns clinical 144 rows 27 columns CNV 95 rows 28057 columns derived_molecular 144 rows 125 columns experimental_design 144 rows 26 columns followup 396 rows 49 columns miRNA 99 rows 2337 columns phosphoproteomics 144 rows 73212 columns proteomics 144 rows 10999 columns somatic_mutation 52560 rows 3 columns somatic_mutation_binary 95 rows 51559 columns transcriptomics 109 rows 28057 columns
Data can be accessed through several "get" functions. For example, we can look at the proteomics data by using en.get_proteomics()
. This returns a pandas dataframe containing the proteomic data. Each column in the proteomics dataframe is the quantitiative measurement for a particular protein. Each row in the proteomics dataframe is a sample of either a tumor or non-tumor from a cancer patient.
proteomics = en.get_proteomics()
samples = proteomics.index
proteins = proteomics.columns
print("Samples:",samples[0:20].tolist()) #the first twenty samples
print("Proteins:",proteins[0:20].tolist()) #the first twenty proteins
Samples: ['C3L-00006', 'C3L-00008', 'C3L-00032', 'C3L-00090', 'C3L-00098', 'C3L-00136', 'C3L-00137', 'C3L-00139', 'C3L-00143', 'C3L-00145', 'C3L-00156', 'C3L-00161', 'C3L-00358', 'C3L-00361', 'C3L-00362', 'C3L-00413', 'C3L-00449', 'C3L-00563', 'C3L-00586', 'C3L-00601'] Proteins: ['A1BG', 'A2M', 'A2ML1', 'A4GALT', 'AAAS', 'AACS', 'AADAT', 'AAED1', 'AAGAB', 'AAK1', 'AAMDC', 'AAMP', 'AAR2', 'AARS', 'AARS2', 'AARSD1', 'AASDHPPT', 'AASS', 'AATF', 'ABAT']
Values in the dataframe are protein abundance values. Values that read "NaN" mean that particular sample from that patient had no data for that particular protein. For the endometrial CPTAC proteomics data, a TMT-reference channel strategy was used. A detailed description of this strategy can be found at Nature Protocols and also at PubMed Central. This strategy ratios each sample's abundance to a pooled reference. The ratio is then log transformed. Therefore positive values indicate a measurement higher than the pooled reference; negative values are lower than the pooled reference.
proteomics.head()
Name | A1BG | A2M | A2ML1 | A4GALT | AAAS | AACS | AADAT | AAED1 | AAGAB | AAK1 | ... | ZSWIM8 | ZSWIM9 | ZW10 | ZWILCH | ZWINT | ZXDC | ZYG11B | ZYX | ZZEF1 | ZZZ3 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Patient_ID | |||||||||||||||||||||
C3L-00006 | -1.180 | -0.8630 | -0.802 | 0.222 | 0.2560 | 0.6650 | 1.2800 | -0.3390 | 0.412 | -0.664 | ... | -0.08770 | NaN | 0.0229 | 0.1090 | NaN | -0.332 | -0.43300 | -1.020 | -0.1230 | -0.0859 |
C3L-00008 | -0.685 | -1.0700 | -0.684 | 0.984 | 0.1350 | 0.3340 | 1.3000 | 0.1390 | 1.330 | -0.367 | ... | -0.03560 | NaN | 0.3630 | 1.0700 | 0.737 | -0.564 | -0.00461 | -1.130 | -0.0757 | -0.4730 |
C3L-00032 | -0.528 | -1.3200 | 0.435 | NaN | -0.2400 | 1.0400 | -0.0213 | -0.0479 | 0.419 | -0.500 | ... | 0.00112 | -0.1450 | 0.0105 | -0.1160 | NaN | 0.151 | -0.07400 | -0.540 | 0.3200 | -0.4190 |
C3L-00090 | -1.670 | -1.1900 | -0.443 | 0.243 | -0.0993 | 0.7570 | 0.7400 | -0.9290 | 0.229 | -0.223 | ... | 0.07250 | -0.0552 | -0.0714 | 0.0933 | 0.156 | -0.398 | -0.07520 | -0.797 | -0.0301 | -0.4670 |
C3L-00098 | -0.374 | -0.0206 | -0.537 | 0.311 | 0.3750 | 0.0131 | -1.1000 | NaN | 0.565 | -0.101 | ... | -0.17600 | NaN | -1.2200 | -0.5620 | 0.937 | -0.646 | 0.20700 | -1.850 | -0.1760 | 0.0513 |
5 rows × 10999 columns
As seen in en.list_data()
, other omics data are also available (e.g. transcriptomics, copy number variation, phoshoproteomics).
The transcriptomics looks almost identical to the proteomics data, available in a pandas dataframe with the same convention. Each set of samples is consitent, meaning samples found in the endometrial proteomics data will be the same samples in all other endometrial dataframes.
transcriptomics = en.get_transcriptomics()
transcriptomics.head()
Name | A1BG | A1BG-AS1 | A1CF | A2M | A2M-AS1 | A2ML1 | A2MP1 | A3GALT2 | A4GALT | A4GNT | ... | ZWILCH | ZWINT | ZXDA | ZXDB | ZXDC | ZYG11A | ZYG11B | ZYX | ZZEF1 | ZZZ3 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Patient_ID | |||||||||||||||||||||
C3L-00006 | 4.02 | 2.16 | 3.27 | 13.39 | 5.88 | 6.79 | 1.55 | 0.97 | 10.34 | 1.96 | ... | 11.06 | 10.73 | 8.40 | 9.78 | 10.88 | 5.93 | 11.52 | 10.23 | 11.50 | 11.47 |
C3L-00008 | 4.81 | 2.21 | 4.86 | 13.24 | 5.93 | 6.33 | 0.93 | 0.00 | 10.83 | 0.00 | ... | 10.87 | 11.43 | 8.39 | 9.14 | 10.38 | 7.25 | 11.64 | 10.64 | 11.26 | 11.57 |
C3L-00032 | 6.24 | 6.43 | 3.68 | 14.32 | 6.53 | 9.42 | 2.79 | 0.00 | 10.98 | 2.13 | ... | 10.06 | 10.13 | 8.35 | 9.27 | 10.46 | 6.85 | 11.60 | 10.21 | 11.51 | 11.09 |
C3L-00090 | 5.31 | 4.87 | 5.59 | 13.77 | 6.35 | 4.22 | 2.97 | 0.00 | 8.68 | 1.98 | ... | 10.29 | 10.41 | 9.10 | 9.59 | 10.15 | 7.89 | 11.90 | 10.21 | 11.34 | 11.51 |
C3L-00098 | 9.84 | 8.83 | 7.00 | 13.12 | 6.49 | 6.83 | 1.80 | 0.00 | 11.42 | 3.28 | ... | 10.36 | 11.24 | 8.60 | 9.44 | 11.80 | 9.32 | 11.97 | 9.77 | 11.37 | 12.35 |
5 rows × 28057 columns
The clinical dataframe lists clinical information for the patient associated with each sample (e.g. age, race, diabetes status, tumor size).
clinical = en.get_clinical()
clinical.head()
Name | Sample_ID | Sample_Tumor_Normal | Proteomics_Tumor_Normal | Country | Histologic_Grade_FIGO | Myometrial_invasion_Specify | Histologic_type | Treatment_naive | Tumor_purity | Path_Stage_Primary_Tumor-pT | ... | Age | Diabetes | Race | Ethnicity | Gender | Tumor_Site | Tumor_Site_Other | Tumor_Focality | Tumor_Size_cm | Num_full_term_pregnancies |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Patient_ID | |||||||||||||||||||||
C3L-00006 | S001 | Tumor | Tumor | United States | FIGO grade 1 | under 50 % | Endometrioid | YES | Normal | pT1a (FIGO IA) | ... | 64.0 | No | White | Not-Hispanic or Latino | Female | Anterior endometrium | NaN | Unifocal | 2.9 | 1 |
C3L-00008 | S002 | Tumor | Tumor | United States | FIGO grade 1 | under 50 % | Endometrioid | YES | Normal | pT1a (FIGO IA) | ... | 58.0 | No | White | Not-Hispanic or Latino | Female | Posterior endometrium | NaN | Unifocal | 3.5 | 1 |
C3L-00032 | S003 | Tumor | Tumor | United States | FIGO grade 2 | under 50 % | Endometrioid | YES | Normal | pT1a (FIGO IA) | ... | 50.0 | Yes | White | Not-Hispanic or Latino | Female | Other, specify | Anterior and Posterior endometrium | Unifocal | 4.5 | 4 or more |
C3L-00090 | S005 | Tumor | Tumor | United States | FIGO grade 2 | under 50 % | Endometrioid | YES | Normal | pT1a (FIGO IA) | ... | 75.0 | No | White | Not-Hispanic or Latino | Female | Other, specify | Anterior and Posterior endometrium | Unifocal | 3.5 | 4 or more |
C3L-00098 | S006 | Tumor | Tumor | United States | NaN | under 50 % | Serous | YES | Normal | pT1a (FIGO IA) | ... | 63.0 | No | White | Not-Hispanic or Latino | Female | Other, specify | Anterior and Posterior endometrium | Unifocal | 6.0 | 2 |
5 rows × 27 columns
In addition to donating a tumor sample, some patients also had a normal sample taken for control and comparison. We can identify these samples by looking for samples marked "Normal" in the "Sample_Tumor_Normal" column, and whose Patient IDs are the same as the Patient IDs of tumor samples, but with a ".N" appended to the ID. For example, patient C3L-00006 provided both a tumor sample (marked C3L-00006) and a normal sample (marked C3L-00006.N). Note that the normal samples do not have many values in the clinical columns, because much of the information does not apply to non-tumor samples. Additionally, in cases where a column would have identical values for tumor and normal samples from the same patient (e.g., patient age and gender), the information is recorded only for the tumor sample.
clinical.loc[["C3L-00006","C3L-00361","C3L-01246", "C3L-00006.N","C3L-00361.N","C3L-01246.N"]]
Name | Sample_ID | Sample_Tumor_Normal | Proteomics_Tumor_Normal | Country | Histologic_Grade_FIGO | Myometrial_invasion_Specify | Histologic_type | Treatment_naive | Tumor_purity | Path_Stage_Primary_Tumor-pT | ... | Age | Diabetes | Race | Ethnicity | Gender | Tumor_Site | Tumor_Site_Other | Tumor_Focality | Tumor_Size_cm | Num_full_term_pregnancies |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Patient_ID | |||||||||||||||||||||
C3L-00006 | S001 | Tumor | Tumor | United States | FIGO grade 1 | under 50 % | Endometrioid | YES | Normal | pT1a (FIGO IA) | ... | 64.0 | No | White | Not-Hispanic or Latino | Female | Anterior endometrium | NaN | Unifocal | 2.9 | 1 |
C3L-00361 | S017 | Tumor | Tumor | United States | FIGO grade 1 | Not identified | Endometrioid | YES | Normal | pT1a (FIGO IA) | ... | 64.0 | Yes | White | Not-Hispanic or Latino | Female | Anterior endometrium | NaN | Unifocal | 2.7 | None |
C3L-01246 | S042 | Tumor | Tumor | Other_specify | NaN | under 50 % | Serous | YES | Normal | pT1a (FIGO IA) | ... | 62.0 | No | White | Not reported | Female | Posterior endometrium | NaN | Unifocal | 2.3 | 1 |
C3L-00006.N | S105 | Normal | Adjacent_normal | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
C3L-00361.N | S106 | Normal | Adjacent_normal | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
C3L-01246.N | S114 | Normal | Adjacent_normal | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
6 rows × 27 columns
Each cancer dataset contains mutation data for the cohort. The data consists of all somatic mutations found for each sample (meaning there will be many lines for each sample). Each row lists the specific gene that was mutated, the type of mutation, and the location of the mutation. This data is a direct import of a MAF file.
somatic_mutations = en.get_somatic_mutation()
somatic_mutations.head()
Name | Gene | Mutation | Location |
---|---|---|---|
Patient_ID | |||
C3L-00006 | AAK1 | Missense_Mutation | p.A592V |
C3L-00006 | AANAT | Missense_Mutation | p.R176W |
C3L-00006 | ABCA12 | Frame_Shift_Del | p.N1671Ifs*4 |
C3L-00006 | ABCC4 | Missense_Mutation | p.R691H |
C3L-00006 | ABL1 | Missense_Mutation | p.G273R |
If you wish to export a dataframe to a file, simply call the dataframe's to_csv
method, passing the path you wish to save the file to, and the value separator you want:
clinical = en.get_clinical()
clinical.to_csv(path_or_buf="clinical_dataframe.tsv", sep='\t')
To view the documentation for a dataset, pass it to the Python help
function, e.g. help(en)
. You can also view the documentation for just a specific function: help(en.join_omics_to_omics)
.
help(en.join_omics_to_omics)
Help on method join_omics_to_omics in module cptac.dataset: join_omics_to_omics(df1_name, df2_name, genes1=None, genes2=None, how='outer', quiet=False, tissue_type='both') method of cptac.endometrial.Endometrial instance Take specified column(s) from one omics dataframe, and join to specified columns(s) from another omics dataframe. Intersection (inner join) of indices is used. Parameters: df1_name (str): Name of first omics dataframe to select columns from. df2_name (str): Name of second omics dataframe to select columns from. genes1 (str, or list or array-like of str, optional): Gene(s) for column(s) to select from df1_name. str if one key, list or array-like of str if multiple. Default of None will select entire dataframe. genes2 (str, or list or array-like of str, optional): Gene(s) for Column(s) to select from df2_name. str if one key, list or array-like of str if multiple. Default of None will select entire dataframe. how (str, optional): How to perform the join, acceptable values are from ['outer', 'inner', 'left', 'right']. Defaults to 'outer'. quiet (bool, optional): Whether to warn when inserting NaNs. Defaults to False. tissue_type (str): Acceptable values in ["tumor","normal","both"]. Specifies the desired tissue type desired in the dataframe. Defaults to "both". Returns: pandas.DataFrame: The selected columns from the two omics dataframes, joined into one dataframe.