Tutorial 1: CPTAC Data Introduction¶

The National Cancer Institute’s Clinical Proteomic Tumor Analysis Consortium (CPTAC) is a national effort to accelerate the understanding of the molecular basis of cancer through the application of large-scale proteome and genome analysis, or proteogenomics. CPTAC generates comprehensive proteomics and genomics data from clinical cohorts, typically with ~100 samples per tumor type. The graphic below summarizes the structure of each CPTAC dataset. For more information, visit the NIH website.

This Python package makes accessing CPTAC data easy with Python code and Jupyter notebooks. The package contains several tutorials which demonstrate data access and usage. This first tutorial serves as an introduction to the data to help users become familiar with what is included and how it is presented.

Data Overview¶

Our package provides data access in a Python programming environment. If you have not installed Python or have not installed the package, see our installation documentation here.

Once we have the package installed and we're in our Python environment, we begin by importing the package with a standard Python import statement:

In [1]:

import cptac

To view the available datasets, call the cptac.list_datasets() function:

In [2]:

cptac.list_datasets()

Out[2]:

	Description	Data reuse status	Publication link
Dataset name
Brca	breast cancer	no restrictions	https://pubmed.ncbi.nlm.nih.gov/33212010/
Ccrcc	clear cell renal cell carcinoma (kidney)	no restrictions	https://pubmed.ncbi.nlm.nih.gov/31675502/
Colon	colorectal cancer	no restrictions	https://pubmed.ncbi.nlm.nih.gov/31031003/
Endometrial	endometrial carcinoma (uterine)	no restrictions	https://pubmed.ncbi.nlm.nih.gov/32059776/
Gbm	glioblastoma	no restrictions	https://pubmed.ncbi.nlm.nih.gov/33577785/
Hnscc	head and neck squamous cell carcinoma	no restrictions	https://pubmed.ncbi.nlm.nih.gov/33417831/
Lscc	lung squamous cell carcinoma	no restrictions	https://pubmed.ncbi.nlm.nih.gov/34358469/
Luad	lung adenocarcinoma	no restrictions	https://pubmed.ncbi.nlm.nih.gov/32649874/
Ovarian	high grade serous ovarian cancer	no restrictions	https://pubmed.ncbi.nlm.nih.gov/27372738/
Pdac	pancreatic ductal adenocarcinoma	no restrictions	https://pubmed.ncbi.nlm.nih.gov/34534465/
UcecConf	endometrial confirmatory carcinoma	password access only	unpublished
GbmConf	glioblastoma confirmatory	password access only	unpublished

Data Availability¶

The goals of CPTAC as a consortium include the broad and open dissemination of cancer proteogenomic data. The timing of the a dataset's public release generally follows three stages: internal release to CPTAC investigators, public release with a publication embargo, and full public release. Each of the cancer types may be at a different data availability stage, depending on the date of data creation. In the Python cptac package, these three stages are dealt with as follows:

Internally released data requires a password to download.

Embargoed release data is publicly available, but prints an embargo statement every time you interact with the data.

Public data is fully released without restrictions.

Downloading data¶

The cptac package stores the data files for each dataset on a remote server. When you first install cptac, you will have no data files. To install the latest version of the data files for a particular dataset, simply call the cptac.download function, passing the name of your desired dataset for the dataset parameter:

In [3]:

cptac.download(dataset="endometrial")

Out[3]:

True

Exploring the data¶

Once you've downloaded a dataset, cptac allows you to load the dataset into a Python variable, and you can use that variable to access and work with the data. To load a particular dataset into a variable, type the name you want to give the variable, followed by =, and then type cptac. and the name of the dataset in UpperCamelCase followed by two parentheses, e.g. cptac.Endometrial() or cptac.Ccrcc():

In [4]:

en = cptac.Endometrial()

To see what data is available, use the en.list_data() function. This displays the different types of data included in the dataset for this particular cancer type, each stored in a pandas dataframe. It also prints the dimensions of each dataframe.

In [5]:

en.list_data()

Below are the dataframes contained in this dataset and their dimensions:

acetylproteomics
	144 rows
	10862 columns
circular_RNA
	109 rows
	4945 columns
clinical
	144 rows
	27 columns
CNV
	95 rows
	28057 columns
derived_molecular
	144 rows
	125 columns
experimental_design
	144 rows
	26 columns
followup
	396 rows
	49 columns
miRNA
	99 rows
	2337 columns
phosphoproteomics
	144 rows
	73212 columns
proteomics
	144 rows
	10999 columns
somatic_mutation
	52560 rows
	3 columns
somatic_mutation_binary
	95 rows
	51559 columns
transcriptomics
	109 rows
	28057 columns

Molecular Omics¶

Data can be accessed through several "get" functions. For example, we can look at the proteomics data by using en.get_proteomics(). This returns a pandas dataframe containing the proteomic data. Each column in the proteomics dataframe is the quantitiative measurement for a particular protein. Each row in the proteomics dataframe is a sample of either a tumor or non-tumor from a cancer patient.

In [6]:

proteomics = en.get_proteomics()
samples = proteomics.index
proteins = proteomics.columns
print("Samples:",samples[0:20].tolist()) #the first twenty samples
print("Proteins:",proteins[0:20].tolist()) #the first twenty proteins

Samples: ['C3L-00006', 'C3L-00008', 'C3L-00032', 'C3L-00090', 'C3L-00098', 'C3L-00136', 'C3L-00137', 'C3L-00139', 'C3L-00143', 'C3L-00145', 'C3L-00156', 'C3L-00161', 'C3L-00358', 'C3L-00361', 'C3L-00362', 'C3L-00413', 'C3L-00449', 'C3L-00563', 'C3L-00586', 'C3L-00601']
Proteins: ['A1BG', 'A2M', 'A2ML1', 'A4GALT', 'AAAS', 'AACS', 'AADAT', 'AAED1', 'AAGAB', 'AAK1', 'AAMDC', 'AAMP', 'AAR2', 'AARS', 'AARS2', 'AARSD1', 'AASDHPPT', 'AASS', 'AATF', 'ABAT']

Dataframe values¶

Values in the dataframe are protein abundance values. Values that read "NaN" mean that particular sample from that patient had no data for that particular protein. For the endometrial CPTAC proteomics data, a TMT-reference channel strategy was used. A detailed description of this strategy can be found at Nature Protocols and also at PubMed Central. This strategy ratios each sample's abundance to a pooled reference. The ratio is then log transformed. Therefore positive values indicate a measurement higher than the pooled reference; negative values are lower than the pooled reference.

In [7]:

proteomics.head()

Out[7]:

Name	A1BG	A2M	A2ML1	A4GALT	AAAS	AACS	AADAT	AAED1	AAGAB	AAK1	...	ZSWIM8	ZSWIM9	ZW10	ZWILCH	ZWINT	ZXDC	ZYG11B	ZYX	ZZEF1	ZZZ3
Patient_ID
C3L-00006	-1.180	-0.8630	-0.802	0.222	0.2560	0.6650	1.2800	-0.3390	0.412	-0.664	...	-0.08770	NaN	0.0229	0.1090	NaN	-0.332	-0.43300	-1.020	-0.1230	-0.0859
C3L-00008	-0.685	-1.0700	-0.684	0.984	0.1350	0.3340	1.3000	0.1390	1.330	-0.367	...	-0.03560	NaN	0.3630	1.0700	0.737	-0.564	-0.00461	-1.130	-0.0757	-0.4730
C3L-00032	-0.528	-1.3200	0.435	NaN	-0.2400	1.0400	-0.0213	-0.0479	0.419	-0.500	...	0.00112	-0.1450	0.0105	-0.1160	NaN	0.151	-0.07400	-0.540	0.3200	-0.4190
C3L-00090	-1.670	-1.1900	-0.443	0.243	-0.0993	0.7570	0.7400	-0.9290	0.229	-0.223	...	0.07250	-0.0552	-0.0714	0.0933	0.156	-0.398	-0.07520	-0.797	-0.0301	-0.4670
C3L-00098	-0.374	-0.0206	-0.537	0.311	0.3750	0.0131	-1.1000	NaN	0.565	-0.101	...	-0.17600	NaN	-1.2200	-0.5620	0.937	-0.646	0.20700	-1.850	-0.1760	0.0513

5 rows × 10999 columns

As seen in en.list_data(), other omics data are also available (e.g. transcriptomics, copy number variation, phoshoproteomics).

The transcriptomics looks almost identical to the proteomics data, available in a pandas dataframe with the same convention. Each set of samples is consitent, meaning samples found in the endometrial proteomics data will be the same samples in all other endometrial dataframes.

In [8]:

transcriptomics = en.get_transcriptomics()
transcriptomics.head()

Out[8]:

Name	A1BG	A1BG-AS1	A1CF	A2M	A2M-AS1	A2ML1	A2MP1	A3GALT2	A4GALT	A4GNT	...	ZWILCH	ZWINT	ZXDA	ZXDB	ZXDC	ZYG11A	ZYG11B	ZYX	ZZEF1	ZZZ3
Patient_ID
C3L-00006	4.02	2.16	3.27	13.39	5.88	6.79	1.55	0.97	10.34	1.96	...	11.06	10.73	8.40	9.78	10.88	5.93	11.52	10.23	11.50	11.47
C3L-00008	4.81	2.21	4.86	13.24	5.93	6.33	0.93	0.00	10.83	0.00	...	10.87	11.43	8.39	9.14	10.38	7.25	11.64	10.64	11.26	11.57
C3L-00032	6.24	6.43	3.68	14.32	6.53	9.42	2.79	0.00	10.98	2.13	...	10.06	10.13	8.35	9.27	10.46	6.85	11.60	10.21	11.51	11.09
C3L-00090	5.31	4.87	5.59	13.77	6.35	4.22	2.97	0.00	8.68	1.98	...	10.29	10.41	9.10	9.59	10.15	7.89	11.90	10.21	11.34	11.51
C3L-00098	9.84	8.83	7.00	13.12	6.49	6.83	1.80	0.00	11.42	3.28	...	10.36	11.24	8.60	9.44	11.80	9.32	11.97	9.77	11.37	12.35

5 rows × 28057 columns

Clinical Data¶

The clinical dataframe lists clinical information for the patient associated with each sample (e.g. age, race, diabetes status, tumor size).

In [9]:

clinical = en.get_clinical()
clinical.head()

Out[9]:

Name	Sample_ID	Sample_Tumor_Normal	Proteomics_Tumor_Normal	Country	Histologic_Grade_FIGO	Myometrial_invasion_Specify	Histologic_type	Treatment_naive	Tumor_purity	Path_Stage_Primary_Tumor-pT	...	Age	Diabetes	Race	Ethnicity	Gender	Tumor_Site	Tumor_Site_Other	Tumor_Focality	Tumor_Size_cm	Num_full_term_pregnancies
Patient_ID
C3L-00006	S001	Tumor	Tumor	United States	FIGO grade 1	under 50 %	Endometrioid	YES	Normal	pT1a (FIGO IA)	...	64.0	No	White	Not-Hispanic or Latino	Female	Anterior endometrium	NaN	Unifocal	2.9	1
C3L-00008	S002	Tumor	Tumor	United States	FIGO grade 1	under 50 %	Endometrioid	YES	Normal	pT1a (FIGO IA)	...	58.0	No	White	Not-Hispanic or Latino	Female	Posterior endometrium	NaN	Unifocal	3.5	1
C3L-00032	S003	Tumor	Tumor	United States	FIGO grade 2	under 50 %	Endometrioid	YES	Normal	pT1a (FIGO IA)	...	50.0	Yes	White	Not-Hispanic or Latino	Female	Other, specify	Anterior and Posterior endometrium	Unifocal	4.5	4 or more
C3L-00090	S005	Tumor	Tumor	United States	FIGO grade 2	under 50 %	Endometrioid	YES	Normal	pT1a (FIGO IA)	...	75.0	No	White	Not-Hispanic or Latino	Female	Other, specify	Anterior and Posterior endometrium	Unifocal	3.5	4 or more
C3L-00098	S006	Tumor	Tumor	United States	NaN	under 50 %	Serous	YES	Normal	pT1a (FIGO IA)	...	63.0	No	White	Not-Hispanic or Latino	Female	Other, specify	Anterior and Posterior endometrium	Unifocal	6.0	2

5 rows × 27 columns

In addition to donating a tumor sample, some patients also had a normal sample taken for control and comparison. We can identify these samples by looking for samples marked "Normal" in the "Sample_Tumor_Normal" column, and whose Patient IDs are the same as the Patient IDs of tumor samples, but with a ".N" appended to the ID. For example, patient C3L-00006 provided both a tumor sample (marked C3L-00006) and a normal sample (marked C3L-00006.N). Note that the normal samples do not have many values in the clinical columns, because much of the information does not apply to non-tumor samples. Additionally, in cases where a column would have identical values for tumor and normal samples from the same patient (e.g., patient age and gender), the information is recorded only for the tumor sample.

In [10]:

clinical.loc[["C3L-00006","C3L-00361","C3L-01246", "C3L-00006.N","C3L-00361.N","C3L-01246.N"]]

Out[10]:

Name	Sample_ID	Sample_Tumor_Normal	Proteomics_Tumor_Normal	Country	Histologic_Grade_FIGO	Myometrial_invasion_Specify	Histologic_type	Treatment_naive	Tumor_purity	Path_Stage_Primary_Tumor-pT	...	Age	Diabetes	Race	Ethnicity	Gender	Tumor_Site	Tumor_Site_Other	Tumor_Focality	Tumor_Size_cm	Num_full_term_pregnancies
Patient_ID
C3L-00006	S001	Tumor	Tumor	United States	FIGO grade 1	under 50 %	Endometrioid	YES	Normal	pT1a (FIGO IA)	...	64.0	No	White	Not-Hispanic or Latino	Female	Anterior endometrium	NaN	Unifocal	2.9	1
C3L-00361	S017	Tumor	Tumor	United States	FIGO grade 1	Not identified	Endometrioid	YES	Normal	pT1a (FIGO IA)	...	64.0	Yes	White	Not-Hispanic or Latino	Female	Anterior endometrium	NaN	Unifocal	2.7	None
C3L-01246	S042	Tumor	Tumor	Other_specify	NaN	under 50 %	Serous	YES	Normal	pT1a (FIGO IA)	...	62.0	No	White	Not reported	Female	Posterior endometrium	NaN	Unifocal	2.3	1
C3L-00006.N	S105	Normal	Adjacent_normal	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
C3L-00361.N	S106	Normal	Adjacent_normal	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
C3L-01246.N	S114	Normal	Adjacent_normal	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

6 rows × 27 columns

Mutation data¶

Each cancer dataset contains mutation data for the cohort. The data consists of all somatic mutations found for each sample (meaning there will be many lines for each sample). Each row lists the specific gene that was mutated, the type of mutation, and the location of the mutation. This data is a direct import of a MAF file.

In [11]:

somatic_mutations = en.get_somatic_mutation()
somatic_mutations.head()

Out[11]:

Name	Gene	Mutation	Location
Patient_ID
C3L-00006	AAK1	Missense_Mutation	p.A592V
C3L-00006	AANAT	Missense_Mutation	p.R176W
C3L-00006	ABCA12	Frame_Shift_Del	p.N1671Ifs*4
C3L-00006	ABCC4	Missense_Mutation	p.R691H
C3L-00006	ABL1	Missense_Mutation	p.G273R

Exporting dataframes¶

If you wish to export a dataframe to a file, simply call the dataframe's to_csv method, passing the path you wish to save the file to, and the value separator you want:

In [12]:

clinical = en.get_clinical()
clinical.to_csv(path_or_buf="clinical_dataframe.tsv", sep='\t')

Getting help with a dataset or function¶

To view the documentation for a dataset, pass it to the Python help function, e.g. help(en). You can also view the documentation for just a specific function: help(en.join_omics_to_omics).

In [13]:

help(en.join_omics_to_omics)

Help on method join_omics_to_omics in module cptac.dataset:

join_omics_to_omics(df1_name, df2_name, genes1=None, genes2=None, how='outer', quiet=False, tissue_type='both') method of cptac.endometrial.Endometrial instance
    Take specified column(s) from one omics dataframe, and join to specified columns(s) from another omics dataframe. Intersection (inner join) of indices is used.
    
    Parameters:
    df1_name (str): Name of first omics dataframe to select columns from.
    df2_name (str): Name of second omics dataframe to select columns from.
    genes1 (str, or list or array-like of str, optional): Gene(s) for column(s) to select from df1_name. str if one key, list or array-like of str if multiple. Default of None will select entire dataframe.
    genes2 (str, or list or array-like of str, optional): Gene(s) for Column(s) to select from df2_name. str if one key, list or array-like of str if multiple. Default of None will select entire dataframe.
    how (str, optional): How to perform the join, acceptable values are from ['outer', 'inner', 'left', 'right']. Defaults to 'outer'.
    quiet (bool, optional): Whether to warn when inserting NaNs. Defaults to False.
    tissue_type (str): Acceptable values in ["tumor","normal","both"]. Specifies the desired tissue type desired in the dataframe. Defaults to "both".
    
    Returns:
    pandas.DataFrame: The selected columns from the two omics dataframes, joined into one dataframe.