cptac
¶In this tutorial, we provide several examples of how to use the built-in cptac
functions for joining different dataframes.
import cptac
cptac.download(dataset="endometrial", version="latest")
en = cptac.Endometrial()
In all of the join functions, you specify the dataframes you want to join by passing their names to the appropriate parameters in the function call. The function will automatically check that the dataframes whose names you provided are valid for the join function, and print an error message if they aren't.
Whenever a column from an -omics dataframe is included in a joined table, the name of the -omics dataframe it came from is joined to the column header, to avoid confusion.
If you wish to only include particular columns in the join, pass them to the appropriate parameters in the join function. All such parameters will accept either a single column name as a string, or a list of column name strings. In this use case, we will usually only select specific columns for readability, but you could select the whole dataframe in all these cases, except for the mutations dataframe.
The join functions use logic analogous to an SQL INNER JOIN.
join_omics_to_omics
¶The join_omics_to_omics
function joins two -omics dataframes to each other. Types of -omics data valid for use with this function are acetylproteomics, CNV, phosphoproteomics, phosphoproteomics_gene, proteomics, and transcriptomics.
prot_and_phos = en.join_omics_to_omics(df1_name="proteomics", df2_name="phosphoproteomics")
prot_and_phos.head()
Name | A1BG_proteomics | A2M_proteomics | A2ML1_proteomics | A4GALT_proteomics | AAAS_proteomics | AACS_proteomics | AADAT_proteomics | AAED1_proteomics | AAGAB_proteomics | AAK1_proteomics | ... | ZZZ3_phosphoproteomics | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Site | ... | S397 | S411 | S420 | S424 | S426 | S468 | S89 | T415 | T418 | Y399 | ||||||||||
Patient_ID | |||||||||||||||||||||
C3L-00006 | -1.180 | -0.8630 | -0.802 | 0.222 | 0.2560 | 0.6650 | 1.2800 | -0.3390 | 0.412 | -0.664 | ... | 0.18400 | NaN | NaN | NaN | -0.20500 | NaN | NaN | NaN | NaN | NaN |
C3L-00008 | -0.685 | -1.0700 | -0.684 | 0.984 | 0.1350 | 0.3340 | 1.3000 | 0.1390 | 1.330 | -0.367 | ... | -0.17100 | NaN | NaN | -0.393 | -0.17100 | NaN | 0.29 | NaN | 0.1605 | -0.0635 |
C3L-00032 | -0.528 | -1.3200 | 0.435 | NaN | -0.2400 | 1.0400 | -0.0213 | -0.0479 | 0.419 | -0.500 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
C3L-00090 | -1.670 | -1.1900 | -0.443 | 0.243 | -0.0993 | 0.7570 | 0.7400 | -0.9290 | 0.229 | -0.223 | ... | 0.13970 | NaN | NaN | NaN | -0.55900 | NaN | NaN | NaN | NaN | 0.2980 |
C3L-00098 | -0.374 | -0.0206 | -0.537 | 0.311 | 0.3750 | 0.0131 | -1.1000 | NaN | 0.565 | -0.101 | ... | -0.15875 | NaN | NaN | 0.196 | 0.06175 | NaN | NaN | NaN | NaN | -0.2900 |
5 rows × 84211 columns
Joining only specific columns. (Note that when a gene is selected from the phosphoproteomics dataframe, data for all sites of the gene are selected. The same is done for acetylproteomics data.)
prot_and_phos_selected = en.join_omics_to_omics(
df1_name="proteomics",
df2_name="phosphoproteomics",
genes1="A1BG",
genes2="PIK3CA")
prot_and_phos_selected.head()
Name | A1BG_proteomics | PIK3CA_phosphoproteomics | |
---|---|---|---|
Site | S312 | T313 | |
Patient_ID | |||
C3L-00006 | -1.180 | -0.00615 | 0.0731 |
C3L-00008 | -0.685 | -0.02220 | NaN |
C3L-00032 | -0.528 | NaN | 0.0830 |
C3L-00090 | -1.670 | NaN | -0.8460 |
C3L-00098 | -0.374 | 0.43600 | NaN |
join_metadata_to_omics
¶The join_metadata_to_omics
function joins a metadata dataframe (e.g. clinical or derived_molecular) with an -omics dataframe:
clin_and_tran = en.join_metadata_to_omics(metadata_df_name="clinical", omics_df_name="transcriptomics")
clin_and_tran.head()
cptac warning: transcriptomics data was not found for the following samples, so transcriptomics data columns were filled with NaN for these samples: C3L-00563.N, C3L-00605.N, C3L-00769.N, C3L-00770.N, C3L-00771.N, C3L-00930.N, C3L-00947.N, C3L-00963.N, C3L-01246.N, C3L-01249.N, C3L-01252.N, C3L-01256.N, C3L-01257.N, C3L-01744.N, C3N-00200.N, C3N-00729.N, C3N-01211.N, NX1.N, NX10.N, NX11.N, NX12.N, NX13.N, NX14.N, NX15.N, NX16.N, NX17.N, NX18.N, NX2.N, NX3.N, NX4.N, NX5.N, NX6.N, NX7.N, NX8.N, NX9.N (<ipython-input-4-d1e2c09aae48>, line 1)
Name | Sample_ID | Sample_Tumor_Normal | Proteomics_Tumor_Normal | Country | Histologic_Grade_FIGO | Myometrial_invasion_Specify | Histologic_type | Treatment_naive | Tumor_purity | Path_Stage_Primary_Tumor-pT | ... | ZWILCH_transcriptomics | ZWINT_transcriptomics | ZXDA_transcriptomics | ZXDB_transcriptomics | ZXDC_transcriptomics | ZYG11A_transcriptomics | ZYG11B_transcriptomics | ZYX_transcriptomics | ZZEF1_transcriptomics | ZZZ3_transcriptomics |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Patient_ID | |||||||||||||||||||||
C3L-00006 | S001 | Tumor | Tumor | United States | FIGO grade 1 | under 50 % | Endometrioid | YES | Normal | pT1a (FIGO IA) | ... | 11.06 | 10.73 | 8.40 | 9.78 | 10.88 | 5.93 | 11.52 | 10.23 | 11.50 | 11.47 |
C3L-00008 | S002 | Tumor | Tumor | United States | FIGO grade 1 | under 50 % | Endometrioid | YES | Normal | pT1a (FIGO IA) | ... | 10.87 | 11.43 | 8.39 | 9.14 | 10.38 | 7.25 | 11.64 | 10.64 | 11.26 | 11.57 |
C3L-00032 | S003 | Tumor | Tumor | United States | FIGO grade 2 | under 50 % | Endometrioid | YES | Normal | pT1a (FIGO IA) | ... | 10.06 | 10.13 | 8.35 | 9.27 | 10.46 | 6.85 | 11.60 | 10.21 | 11.51 | 11.09 |
C3L-00090 | S005 | Tumor | Tumor | United States | FIGO grade 2 | under 50 % | Endometrioid | YES | Normal | pT1a (FIGO IA) | ... | 10.29 | 10.41 | 9.10 | 9.59 | 10.15 | 7.89 | 11.90 | 10.21 | 11.34 | 11.51 |
C3L-00098 | S006 | Tumor | Tumor | United States | NaN | under 50 % | Serous | YES | Normal | pT1a (FIGO IA) | ... | 10.36 | 11.24 | 8.60 | 9.44 | 11.80 | 9.32 | 11.97 | 9.77 | 11.37 | 12.35 |
5 rows × 28084 columns
Joining only specific columns:
clin_and_tran = en.join_metadata_to_omics(
metadata_df_name="clinical",
omics_df_name="transcriptomics",
metadata_cols = ["Age", "Histologic_type"],
omics_genes="ZZZ3")
clin_and_tran.head()
cptac warning: transcriptomics data was not found for the following samples, so transcriptomics data columns were filled with NaN for these samples: C3L-00563.N, C3L-00605.N, C3L-00769.N, C3L-00770.N, C3L-00771.N, C3L-00930.N, C3L-00947.N, C3L-00963.N, C3L-01246.N, C3L-01249.N, C3L-01252.N, C3L-01256.N, C3L-01257.N, C3L-01744.N, C3N-00200.N, C3N-00729.N, C3N-01211.N, NX1.N, NX10.N, NX11.N, NX12.N, NX13.N, NX14.N, NX15.N, NX16.N, NX17.N, NX18.N, NX2.N, NX3.N, NX4.N, NX5.N, NX6.N, NX7.N, NX8.N, NX9.N (<ipython-input-5-14d24d9173cd>, line 1)
Name | Age | Histologic_type | ZZZ3_transcriptomics |
---|---|---|---|
Patient_ID | |||
C3L-00006 | 64.0 | Endometrioid | 11.47 |
C3L-00008 | 58.0 | Endometrioid | 11.57 |
C3L-00032 | 50.0 | Endometrioid | 11.09 |
C3L-00090 | 75.0 | Endometrioid | 11.51 |
C3L-00098 | 63.0 | Serous | 12.35 |
join_metadata_to_metadata
¶The join_metadata_to_metadata
function joins two metadata dataframes (e.g. clinical or derived_molecular) to each other. Note how we passed a column name to select from the clinical dataframe, but passing None
for the column parameter for the derived_molecular dataframe caused the entire dataframe to be selected. We could have omitted the cols2
parameter altogether, as it is assigned to None by default.
hist_and_derived_molecular = en.join_metadata_to_metadata(
df1_name="clinical",
df2_name="derived_molecular",
cols1="Histologic_type") # Note that we can omit the cols2 parameter, and it will by default select all of df2.
# We could have also omitted cols1, if we wanted to select all of df1.
hist_and_derived_molecular.head()
Name | Histologic_type | Estrogen_Receptor | Estrogen_Receptor_% | Progesterone_Receptor | Progesterone_Receptor_% | MLH1 | MLH2 | MSH6 | PMS2 | p53 | ... | Log2_variant_total | Log2_SNP_total | Log2_INDEL_total | Genomics_subtype | Mutation_signature_C>A | Mutation_signature_C>G | Mutation_signature_C>T | Mutation_signature_T>C | Mutation_signature_T>A | Mutation_signature_T>G |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Patient_ID | |||||||||||||||||||||
C3L-00006 | Endometrioid | Cannot be determined | NaN | Cannot be determined | NaN | Intact nuclear expression | Intact nuclear expression | Loss of nuclear expression | Intact nuclear expression | Cannot be determined | ... | 10.062046 | 9.984418 | 5.832890 | MSI-H | 8.300395 | 1.482213 | 72.529644 | 14.426877 | 1.383399 | 1.877470 |
C3L-00008 | Endometrioid | Cannot be determined | NaN | Cannot be determined | NaN | Intact nuclear expression | Intact nuclear expression | Intact nuclear expression | Loss of nuclear expression | Cannot be determined | ... | 8.861087 | 8.330917 | 7.169925 | MSI-H | 14.641745 | 2.803738 | 64.485981 | 15.264798 | 0.934579 | 1.869159 |
C3L-00032 | Endometrioid | Cannot be determined | NaN | Cannot be determined | NaN | Intact nuclear expression | Intact nuclear expression | Intact nuclear expression | Intact nuclear expression | Cannot be determined | ... | 5.321928 | 5.000000 | 3.169925 | CNV_low | 16.129032 | 3.225806 | 70.967742 | 3.225806 | 3.225806 | 3.225806 |
C3L-00090 | Endometrioid | Cannot be determined | NaN | Cannot be determined | NaN | Intact nuclear expression | Intact nuclear expression | Intact nuclear expression | Intact nuclear expression | Cannot be determined | ... | 5.672425 | 5.523562 | 2.584963 | CNV_low | 17.777778 | 8.888889 | 62.222222 | 8.888889 | 2.222222 | 0.000000 |
C3L-00098 | Serous | Cannot be determined | NaN | Cannot be determined | NaN | Intact nuclear expression | Intact nuclear expression | Intact nuclear expression | Intact nuclear expression | Normal | ... | 6.108524 | 5.954196 | 3.000000 | CNV_high | 9.836066 | 13.114754 | 62.295082 | 3.278689 | 8.196721 | 3.278689 |
5 rows × 126 columns
join_omics_to_mutations
¶The join_omics_to_mutations
function joins an -omics dataframe with the mutation data for a specified gene or genes. Because there may be multiple mutations for one gene in a single sample, the mutation type and location data are returned in lists by default, even if there is only one mutation. If there is no mutation for the gene in a particular sample, the list contains either "Wildtype_Tumor" or "Wildtype_Normal", depending on whether it's a tumor or normal sample. The mutation status column contains either "Single_mutation", "Multiple_mutation", "Wildtype_Tumor", or "Wildtype_Normal", for help with parsing.
(Note: You can hide the Location columns by passing False
to the optional show_location
parameter.)
selected_acet_and_PTEN_mut = en.join_omics_to_mutations(
omics_df_name="proteomics",
mutations_genes="PTEN",
omics_genes=["AURKA", "TP53"])
selected_acet_and_PTEN_mut.head(10)
cptac warning: In joining the somatic_mutation table, no mutations were found for the following samples, so they were filled with Wildtype_Tumor or Wildtype_Normal: 69 samples for the PTEN gene (<ipython-input-7-1af2d743858d>, line 1)
Name | AURKA_proteomics | TP53_proteomics | PTEN_Mutation | PTEN_Location | PTEN_Mutation_Status | Sample_Status |
---|---|---|---|---|---|---|
Patient_ID | ||||||
C3L-00006 | NaN | 0.2950 | [Missense_Mutation, Nonsense_Mutation] | [p.R130Q, p.R233*] | Multiple_mutation | Tumor |
C3L-00008 | 0.31100 | 0.2770 | [Missense_Mutation] | [p.G127R] | Single_mutation | Tumor |
C3L-00032 | NaN | -0.8710 | [Nonsense_Mutation] | [p.W111*] | Single_mutation | Tumor |
C3L-00090 | -0.79800 | -0.3430 | [Missense_Mutation] | [p.R130G] | Single_mutation | Tumor |
C3L-00098 | 3.11000 | 3.0100 | [Wildtype_Tumor] | [No_mutation] | Wildtype_Tumor | Tumor |
C3L-00136 | -1.65000 | -0.1480 | [Missense_Mutation, Missense_Mutation] | [p.Y68C, p.R130G] | Multiple_mutation | Tumor |
C3L-00137 | NaN | 0.4410 | [Frame_Shift_Ins, Nonsense_Mutation] | [p.H118Qfs*8, p.Y180*] | Multiple_mutation | Tumor |
C3L-00139 | 0.84800 | -1.2200 | [Wildtype_Tumor] | [No_mutation] | Wildtype_Tumor | Tumor |
C3L-00143 | -1.73000 | -0.0825 | [Missense_Mutation] | [p.R130G] | Single_mutation | Tumor |
C3L-00145 | -0.00513 | -0.1810 | [Missense_Mutation, Frame_Shift_Ins] | [p.H93R, p.E242*] | Multiple_mutation | Tumor |
The function has the ability to filter multiple mutations down to just one mutation. It allows you to specify particular mutation types or locations to prioritize, and also provides a default sorting hierarchy for all other mutations. The default hierarchy chooses truncation mutations over missense mutations, and silent mutations last of all. If there are multiple mutations of the same type, it chooses the mutation occurring earlier in the sequence.
To filter all mutations based on this default hierarchy, simply pass an empty list to the optional mutations_filter
parameter. Notice how in sample S001, the nonsense mutation was chosen over the missense mutation, because it's a type of trucation mutation, even though the missense mutation occurs earlier in the peptide sequence. In sample S008, both mutations were types of truncation mutations, so the function just chose the earlier one.
PTEN_default_filter = en.join_omics_to_mutations(omics_df_name="proteomics", mutations_genes="PTEN",
omics_genes=["AURKA", "TP53"],
mutations_filter=[])
PTEN_default_filter.loc[["C3L-00006", "C3L-00137"]]
cptac warning: In joining the somatic_mutation table, no mutations were found for the following samples, so they were filled with Wildtype_Tumor or Wildtype_Normal: 69 samples for the PTEN gene (<ipython-input-8-0692bb2a0dd6>, line 1)
Name | AURKA_proteomics | TP53_proteomics | PTEN_Mutation | PTEN_Location | PTEN_Mutation_Status | Sample_Status |
---|---|---|---|---|---|---|
Patient_ID | ||||||
C3L-00006 | NaN | 0.295 | Nonsense_Mutation | p.R233* | Multiple_mutation | Tumor |
C3L-00137 | NaN | 0.441 | Frame_Shift_Ins | p.H118Qfs*8 | Multiple_mutation | Tumor |
To prioritize a particular type of mutation, or a particular location, include it in the mutations_filter
list. Below, we tell the function to prioritize nonsense mutations over all other mutations. Notice how in sample S008, the nonsense mutation is now selected instead of the frameshift insertion, even though the nonsense mutation occurs later in the peptide sequence.
PTEN_simple_filter = en.join_omics_to_mutations(omics_df_name="proteomics", mutations_genes="PTEN",
omics_genes=["AURKA", "TP53"],
mutations_filter=["Nonsense_Mutation"])
PTEN_simple_filter.loc[["C3L-00006", "C3L-00137"]]
cptac warning: In joining the somatic_mutation table, no mutations were found for the following samples, so they were filled with Wildtype_Tumor or Wildtype_Normal: 69 samples for the PTEN gene (<ipython-input-9-04fd0c2b203d>, line 1)
Name | AURKA_proteomics | TP53_proteomics | PTEN_Mutation | PTEN_Location | PTEN_Mutation_Status | Sample_Status |
---|---|---|---|---|---|---|
Patient_ID | ||||||
C3L-00006 | NaN | 0.295 | Nonsense_Mutation | p.R233* | Multiple_mutation | Tumor |
C3L-00137 | NaN | 0.441 | Nonsense_Mutation | p.Y180* | Multiple_mutation | Tumor |
You can include multiple mutation types and/or locations in the mutations_filter
list. Values earlier in the list will be prioritized over values later in the list. For example, with the filter we specify below, the function first selects sample S001's missense mutation over its nonsense mutation, because we put the location of S001's missense mutation as the first value in our filter list. We still included Nonsense_Mutation in the filter list, but it comes after the location of S001's missense mutation, which is why S001's missense mutation is still prioritized. However, on all other samples, unless they also have a mutation at that same location, the function will continue prioritizing nonsense mutations, as we see in sample S008.
PTEN_complex_filter = en.join_omics_to_mutations(omics_df_name="proteomics", mutations_genes="PTEN",
omics_genes=["AURKA", "TP53"],
mutations_filter=["p.R130Q", "Nonsense_Mutation"])
PTEN_complex_filter.loc[["C3L-00006", "C3L-00137"]]
cptac warning: In joining the somatic_mutation table, no mutations were found for the following samples, so they were filled with Wildtype_Tumor or Wildtype_Normal: 69 samples for the PTEN gene (<ipython-input-10-9d937b1d89b9>, line 1)
Name | AURKA_proteomics | TP53_proteomics | PTEN_Mutation | PTEN_Location | PTEN_Mutation_Status | Sample_Status |
---|---|---|---|---|---|---|
Patient_ID | ||||||
C3L-00006 | NaN | 0.295 | Missense_Mutation | p.R130Q | Multiple_mutation | Tumor |
C3L-00137 | NaN | 0.441 | Nonsense_Mutation | p.Y180* | Multiple_mutation | Tumor |
join_metadata_to_mutations
¶The join_metadata_to_mutations
function works exactly like join_omics_to_mutations
, except that it works with metadata dataframes (e.g. clinical and derived molecular) instead of omics dataframes. It also can filter multiple mutations, which you control through the mutations_filter
parameter, and has the ability to hide the location colunms.
hist_and_PTEN = en.join_metadata_to_mutations(
metadata_df_name="clinical",
mutations_genes="PTEN",
metadata_cols="Histologic_type")
hist_and_PTEN.head()
cptac warning: In joining the somatic_mutation table, no mutations were found for the following samples, so they were filled with Wildtype_Tumor or Wildtype_Normal: 69 samples for the PTEN gene (<ipython-input-11-bbbd4b25af77>, line 1)
Name | Histologic_type | PTEN_Mutation | PTEN_Location | PTEN_Mutation_Status | Sample_Status |
---|---|---|---|---|---|
Patient_ID | |||||
C3L-00006 | Endometrioid | [Missense_Mutation, Nonsense_Mutation] | [p.R130Q, p.R233*] | Multiple_mutation | Tumor |
C3L-00008 | Endometrioid | [Missense_Mutation] | [p.G127R] | Single_mutation | Tumor |
C3L-00032 | Endometrioid | [Nonsense_Mutation] | [p.W111*] | Single_mutation | Tumor |
C3L-00090 | Endometrioid | [Missense_Mutation] | [p.R130G] | Single_mutation | Tumor |
C3L-00098 | Serous | [Wildtype_Tumor] | [No_mutation] | Wildtype_Tumor | Tumor |
With multiple mutations filtered:
hist_and_PTEN = en.join_metadata_to_mutations(
metadata_df_name="clinical",
mutations_genes="PTEN",
metadata_cols="Histologic_type",
mutations_filter=["Nonsense_Mutation"])
hist_and_PTEN.head()
cptac warning: In joining the somatic_mutation table, no mutations were found for the following samples, so they were filled with Wildtype_Tumor or Wildtype_Normal: 69 samples for the PTEN gene (<ipython-input-12-274c163a987c>, line 1)
Name | Histologic_type | PTEN_Mutation | PTEN_Location | PTEN_Mutation_Status | Sample_Status |
---|---|---|---|---|---|
Patient_ID | |||||
C3L-00006 | Endometrioid | Nonsense_Mutation | p.R233* | Multiple_mutation | Tumor |
C3L-00008 | Endometrioid | Missense_Mutation | p.G127R | Single_mutation | Tumor |
C3L-00032 | Endometrioid | Nonsense_Mutation | p.W111* | Single_mutation | Tumor |
C3L-00090 | Endometrioid | Missense_Mutation | p.R130G | Single_mutation | Tumor |
C3L-00098 | Serous | Wildtype_Tumor | No_mutation | Wildtype_Tumor | Tumor |
If you wish to export a dataframe to a file, simply call the dataframe's to_csv method, passing the path you wish to save the file to, and the value separator you want:
hist_and_PTEN.to_csv(path_or_buf="histologic_type_and_PTEN_mutation.tsv", sep='\t')