Tutorial 3: Joining dataframes with `cptac`¶

In this tutorial, we provide several examples of how to use the built-in cptac functions for joining different dataframes.

We will do this on data for Endometrial carcinoma. First we need to import the package and create an endometrial data object, which we call 'en'.

In [1]:

import cptac
en = cptac.Ucec()
en.list_data_sources()

Out[1]:

	Data type	Available sources
0	CNV	awg, washu
1	CNV_gistic	awgconf
2	CNV_log2ratio	awgconf
3	acetylproteomics	awg, awgconf, pdc
4	acetylproteomics_gene	awgconf
5	circular_RNA	awg, awgconf, bcm
6	clinical	awg, awgconf, mssm, pdc
7	deconvolution_cibersort	washu
8	deconvolution_xcell	washu
9	derived_molecular	awg
10	experimental_design	awg
11	followup	awg
12	gene_fusion	awgconf
13	methylation	awgconf
14	miRNA	awg, awgconf, washu
15	phosphoproteomics	awg, awgconf, pdc, umich
16	phosphoproteomics_gene	awgconf
17	proteomics	awg, awgconf, pdc, umich
18	somatic_mutation	awg, awgconf, harmonized, washu
19	somatic_mutation_binary	awg, awgconf
20	targeted_phosphoproteomics	awgconf
21	targeted_proteomics	awgconf
22	transcriptomics	awg, awgconf, bcm, broad, washu
23	tumor_purity	washu

General format¶

cptac has a helpful function called multi_join. It allows data from several different cptac dataframes to be joined at the same time.

To use multi_join, you specify the dataframes you want to join by passing a dictionary of their names to the function call. The function will automatically check that the dataframes whose names you provided are valid for the join function, and print an error message if they aren't.

Whenever a column from an -omics dataframe is included in a joined table, the name of the -omics dataframe it came from is joined to the column header, to avoid confusion.

If you wish to only include particular columns in the join, include them as values in the dictionary. All values will accept either a single column name as a string, or a list of column name strings. In this use case, we will usually only select specific columns for readability, but you could select the whole dataframe in all these cases, except for the mutations dataframe.

The join functions use logic analogous to an SQL INNER JOIN.

Join dictionary¶

The main parameter for the multi_join function is a dictionary with source and datatype as a key, and specific columns as a value. Because there are multiple sources for each datatype, the desired source needs to be included. This can be done in two different ways. The first is by using a string that contains the source, a space, and then the datatype. The second is by using a tuple formatted (source, datatype). For example, using:

{('awg', 'proteomics'): ''}

or

{"awg proteomics": ''}

as the join dictionary would each result in multi_join returning a dataframe containing only awg proteomics data.

You'll notice the value in the key:value pair is an empty string. Because a dictionary needs to have a value for each key, the empty string or an empty list mean we want everything from the specified dataframe. If a string or list of strings is specified, the joined dataframe will only contain the specified columns. See below for more examples.

Join omics to omics¶

multi_join can join two -omics dataframes to each other. Types of -omics data valid for use with this function are acetylproteomics, CNV, phosphoproteomics, phosphoproteomics_gene, proteomics, and transcriptomics.

In [2]:

prot_and_phos = en.multi_join({"awg proteomics":'', "awg phosphoproteomics":''})
prot_and_phos.head()

Out[2]:

Name	A1BG_awg_proteomics	A2M_awg_proteomics	A2ML1_awg_proteomics	A4GALT_awg_proteomics	AAAS_awg_proteomics	AACS_awg_proteomics	AADAT_awg_proteomics	AAED1_awg_proteomics	AAGAB_awg_proteomics	AAK1_awg_proteomics	...	ZZZ3_awg_phosphoproteomics
Site											...	S397	S411	S420	S424	S426	S468	S89	T415	T418	Y399
Patient_ID
C3L-00006	-1.180	-0.8630	-0.802	0.222	0.2560	0.6650	1.2800	-0.3390	0.412	-0.664	...	0.18400	NaN	NaN	NaN	-0.20500	NaN	NaN	NaN	NaN	NaN
C3L-00008	-0.685	-1.0700	-0.684	0.984	0.1350	0.3340	1.3000	0.1390	1.330	-0.367	...	-0.17100	NaN	NaN	-0.393	-0.17100	NaN	0.29	NaN	0.1605	-0.0635
C3L-00032	-0.528	-1.3200	0.435	NaN	-0.2400	1.0400	-0.0213	-0.0479	0.419	-0.500	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
C3L-00090	-1.670	-1.1900	-0.443	0.243	-0.0993	0.7570	0.7400	-0.9290	0.229	-0.223	...	0.13970	NaN	NaN	NaN	-0.55900	NaN	NaN	NaN	NaN	0.2980
C3L-00098	-0.374	-0.0206	-0.537	0.311	0.3750	0.0131	-1.1000	NaN	0.565	-0.101	...	-0.15875	NaN	NaN	0.196	0.06175	NaN	NaN	NaN	NaN	-0.2900

5 rows × 84211 columns

Joining only specific columns. (Note that when a gene is selected from the phosphoproteomics dataframe, data for all sites of the gene are selected. The same is done for acetylproteomics data.)

In [3]:

prot_and_phos_selected = en.multi_join({"awg proteomics":'A1BG', "awg phosphoproteomics":'PIK3CA'})
prot_and_phos_selected.head()

Out[3]:

Name	A1BG_awg_proteomics	PIK3CA_awg_phosphoproteomics
Site		S312	T313
Patient_ID
C3L-00006	-1.180	-0.00615	0.0731
C3L-00008	-0.685	-0.02220	NaN
C3L-00032	-0.528	NaN	0.0830
C3L-00090	-1.670	NaN	-0.8460
C3L-00098	-0.374	0.43600	NaN

Join metadata to omics¶

The multi_join function can also join a metadata dataframe (e.g. clinical or derived_molecular) with an -omics dataframe:

In [4]:

clin_and_tran = en.multi_join({"awg clinical":'', "awg transcriptomics":''})
clin_and_tran.head()

Out[4]:

Name	Sample_ID	Sample_Tumor_Normal	Proteomics_Tumor_Normal	Country	Histologic_Grade_FIGO	Myometrial_invasion_Specify	Histologic_type	Treatment_naive	Tumor_purity	Path_Stage_Primary_Tumor-pT	...	ZWILCH_awg_transcriptomics	ZWINT_awg_transcriptomics	ZXDA_awg_transcriptomics	ZXDB_awg_transcriptomics	ZXDC_awg_transcriptomics	ZYG11A_awg_transcriptomics	ZYG11B_awg_transcriptomics	ZYX_awg_transcriptomics	ZZEF1_awg_transcriptomics	ZZZ3_awg_transcriptomics
Patient_ID
C3L-00006	S001	Tumor	Tumor	United States	FIGO grade 1	under 50 %	Endometrioid	YES	Normal	pT1a (FIGO IA)	...	11.06	10.73	8.40	9.78	10.88	5.93	11.52	10.23	11.50	11.47
C3L-00008	S002	Tumor	Tumor	United States	FIGO grade 1	under 50 %	Endometrioid	YES	Normal	pT1a (FIGO IA)	...	10.87	11.43	8.39	9.14	10.38	7.25	11.64	10.64	11.26	11.57
C3L-00032	S003	Tumor	Tumor	United States	FIGO grade 2	under 50 %	Endometrioid	YES	Normal	pT1a (FIGO IA)	...	10.06	10.13	8.35	9.27	10.46	6.85	11.60	10.21	11.51	11.09
C3L-00084	S004	Tumor	Tumor	NaN	NaN	NaN	Carcinosarcoma	YES	Normal	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
C3L-00090	S005	Tumor	Tumor	United States	FIGO grade 2	under 50 %	Endometrioid	YES	Normal	pT1a (FIGO IA)	...	10.29	10.41	9.10	9.59	10.15	7.89	11.90	10.21	11.34	11.51

5 rows × 28084 columns

Joining only specific columns:

In [5]:

clin_and_tran = en.multi_join({"awg clinical": ["Age", "Histologic_type"], "awg transcriptomics": "ZZZ3"})
clin_and_tran.head()

Out[5]:

Name	Age	Histologic_type	ZZZ3_awg_transcriptomics
Patient_ID
C3L-00006	64.0	Endometrioid	11.47
C3L-00008	58.0	Endometrioid	11.57
C3L-00032	50.0	Endometrioid	11.09
C3L-00084	NaN	Carcinosarcoma	NaN
C3L-00090	75.0	Endometrioid	11.51

Join metadata to metadata¶

Of course two metadata dataframes (e.g. clinical or derived_molecular) can also be joined together. Note how we passed a column name to select from the clinical dataframe, but passing an empty string '' or an empty list [] for the column parameter for the derived_molecular dataframe caused the entire dataframe to be selected.

In [6]:

hist_and_derived_molecular = en.multi_join({
    "awg clinical": "Histologic_type",
    "awg derived_molecular": '' # Note that by using an empty string or list as the value, we join the entire dataframe
})

hist_and_derived_molecular.head()

Out[6]:

Name	Histologic_type	Estrogen_Receptor	Estrogen_Receptor_%	Progesterone_Receptor	Progesterone_Receptor_%	MLH1	MLH2	MSH6	PMS2	p53	...	Log2_variant_total	Log2_SNP_total	Log2_INDEL_total	Genomics_subtype	Mutation_signature_C>A	Mutation_signature_C>G	Mutation_signature_C>T	Mutation_signature_T>C	Mutation_signature_T>A	Mutation_signature_T>G
Patient_ID
C3L-00006	Endometrioid	Cannot be determined	NaN	Cannot be determined	NaN	Intact nuclear expression	Intact nuclear expression	Loss of nuclear expression	Intact nuclear expression	Cannot be determined	...	10.062046	9.984418	5.832890	MSI-H	8.300395	1.482213	72.529644	14.426877	1.383399	1.877470
C3L-00008	Endometrioid	Cannot be determined	NaN	Cannot be determined	NaN	Intact nuclear expression	Intact nuclear expression	Intact nuclear expression	Loss of nuclear expression	Cannot be determined	...	8.861087	8.330917	7.169925	MSI-H	14.641745	2.803738	64.485981	15.264798	0.934579	1.869159
C3L-00032	Endometrioid	Cannot be determined	NaN	Cannot be determined	NaN	Intact nuclear expression	Intact nuclear expression	Intact nuclear expression	Intact nuclear expression	Cannot be determined	...	5.321928	5.000000	3.169925	CNV_low	16.129032	3.225806	70.967742	3.225806	3.225806	3.225806
C3L-00084	Carcinosarcoma	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
C3L-00090	Endometrioid	Cannot be determined	NaN	Cannot be determined	NaN	Intact nuclear expression	Intact nuclear expression	Intact nuclear expression	Intact nuclear expression	Cannot be determined	...	5.672425	5.523562	2.584963	CNV_low	17.777778	8.888889	62.222222	8.888889	2.222222	0.000000

5 rows × 126 columns

Join many datatypes together¶

If you need data from three or more dataframes, they can all simply be added to the joining dictionary. The only limit to the number of dataframes the joining dictionary parameter for multi_join can take is your imagination.

In [7]:

joining_dictionary = {"awg proteomics": ["AURKA", "TP53"], "awg phosphoproteomics": ["AURKA", "TP53"], "awg clinical": [], "awg somatic_mutation": "PTEN"}
en.multi_join(joining_dictionary).head()

cptac warning: The following columns were not found in the awg phosphoproteomics dataframe, so they were inserted into joined table, but filled with NaN: AURKA (<ipython-input-7-8c248f83a0d2>, line 2)

cptac warning: In joining the somatic_mutation table, no mutations were found for the following samples, so they were filled with Wildtype_Tumor or Wildtype_Normal: 78 samples for the PTEN gene (<ipython-input-7-8c248f83a0d2>, line 2)

Out[7]:

Name	AURKA_awg_proteomics	TP53_awg_proteomics	AURKA_awg_phosphoproteomics	TP53_awg_phosphoproteomics		Sample_ID	Sample_Tumor_Normal	Proteomics_Tumor_Normal	Country	Histologic_Grade_FIGO	...	Gender	Tumor_Site	Tumor_Site_Other	Tumor_Focality	Tumor_Size_cm	Num_full_term_pregnancies	PTEN_Mutation	PTEN_Location	PTEN_Mutation_Status	Sample_Status
Site			NaN	S315	T150						...
Patient_ID
C3L-00006	NaN	0.295	NaN	NaN	NaN	S001	Tumor	Tumor	United States	FIGO grade 1	...	Female	Anterior endometrium	NaN	Unifocal	2.9	1	[Missense_Mutation, Nonsense_Mutation]	[p.R130Q, p.R233*]	Multiple_mutation	Tumor
C3L-00008	0.311	0.277	NaN	0.646	NaN	S002	Tumor	Tumor	United States	FIGO grade 1	...	Female	Posterior endometrium	NaN	Unifocal	3.5	1	[Missense_Mutation]	[p.G127R]	Single_mutation	Tumor
C3L-00032	NaN	-0.871	NaN	-0.800	NaN	S003	Tumor	Tumor	United States	FIGO grade 2	...	Female	Other, specify	Anterior and Posterior endometrium	Unifocal	4.5	4 or more	[Nonsense_Mutation]	[p.W111*]	Single_mutation	Tumor
C3L-00084	NaN	NaN	NaN	NaN	NaN	S004	Tumor	Tumor	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	[Wildtype_Tumor]	[No_mutation]	Wildtype_Tumor	Tumor
C3L-00090	-0.798	-0.343	NaN	NaN	NaN	S005	Tumor	Tumor	United States	FIGO grade 2	...	Female	Other, specify	Anterior and Posterior endometrium	Unifocal	3.5	4 or more	[Missense_Mutation]	[p.R130G]	Single_mutation	Tumor

5 rows × 36 columns

multi_join does not necessarily need to join different dataframes. If you just want a small amount of information from a dataframe, this function is useful for that as well.

In [8]:

histologic_type_and_grade = en.multi_join({"awg clinical": ['Histologic_type', 'Histologic_Grade_FIGO']})
histologic_type_and_grade.head()

Out[8]:

Name	Histologic_type	Histologic_Grade_FIGO
Patient_ID
C3L-00006	Endometrioid	FIGO grade 1
C3L-00008	Endometrioid	FIGO grade 1
C3L-00032	Endometrioid	FIGO grade 2
C3L-00084	Carcinosarcoma	NaN
C3L-00090	Endometrioid	FIGO grade 2

Join omics to mutations¶

Joining an -omics dataframe with the mutation data for a specified gene or genes is slightly different than other types of joins using multi_join. Because there may be multiple mutations for one gene in a single sample, the mutation type and location data are returned in lists by default, even if there is only one mutation. If there is no mutation for the gene in a particular sample, the list contains either "Wildtype_Tumor" or "Wildtype_Normal", depending on whether it's a tumor or normal sample. The mutation status column contains either "Single_mutation", "Multiple_mutation", "Wildtype_Tumor", or "Wildtype_Normal", for help with parsing.

In [9]:

selected_acet_and_PTEN_mut_mult = en.multi_join({"awg proteomics": ["AURKA", "TP53"], "awg somatic_mutation": "PTEN"})
selected_acet_and_PTEN_mut_mult.head(10)

cptac warning: In joining the somatic_mutation table, no mutations were found for the following samples, so they were filled with Wildtype_Tumor or Wildtype_Normal: 69 samples for the PTEN gene (<ipython-input-9-0ffecace8e23>, line 1)

Out[9]:

Name	AURKA_awg_proteomics	TP53_awg_proteomics	PTEN_Mutation	PTEN_Location	PTEN_Mutation_Status	Sample_Status
Patient_ID
C3L-00006	NaN	0.2950	[Missense_Mutation, Nonsense_Mutation]	[p.R130Q, p.R233*]	Multiple_mutation	Tumor
C3L-00008	0.31100	0.2770	[Missense_Mutation]	[p.G127R]	Single_mutation	Tumor
C3L-00032	NaN	-0.8710	[Nonsense_Mutation]	[p.W111*]	Single_mutation	Tumor
C3L-00090	-0.79800	-0.3430	[Missense_Mutation]	[p.R130G]	Single_mutation	Tumor
C3L-00098	3.11000	3.0100	[Wildtype_Tumor]	[No_mutation]	Wildtype_Tumor	Tumor
C3L-00136	-1.65000	-0.1480	[Missense_Mutation, Missense_Mutation]	[p.Y68C, p.R130G]	Multiple_mutation	Tumor
C3L-00137	NaN	0.4410	[Frame_Shift_Ins, Nonsense_Mutation]	[p.H118Qfs8, p.Y180]	Multiple_mutation	Tumor
C3L-00139	0.84800	-1.2200	[Wildtype_Tumor]	[No_mutation]	Wildtype_Tumor	Tumor
C3L-00143	-1.73000	-0.0825	[Missense_Mutation]	[p.R130G]	Single_mutation	Tumor
C3L-00145	-0.00513	-0.1810	[Missense_Mutation, Frame_Shift_Ins]	[p.H93R, p.E242*]	Multiple_mutation	Tumor

In [10]:

selected_acet_and_PTEN_mut = en.join_omics_to_mutations(
    omics_name="proteomics",
    mutations_genes="PTEN", 
    omics_genes=["AURKA", "TP53"])

selected_acet_and_PTEN_mut.head(10)

/Users/robertoldroyd/opt/anaconda3/lib/python3.8/site-packages/IPython/core/interactiveshell.py:3437: UserWarning: No source specified for proteomics data. Source awg used, pass a source to the omics_source parameter to prevent this warning
  exec(code_obj, self.user_global_ns, self.user_ns)
/Users/robertoldroyd/opt/anaconda3/lib/python3.8/site-packages/IPython/core/interactiveshell.py:3437: UserWarning: No source specified for mutations data. Source awg used, pass a source to the mutations_source parameter to prevent this warning
  exec(code_obj, self.user_global_ns, self.user_ns)
cptac warning: In joining the somatic_mutation table, no mutations were found for the following samples, so they were filled with Wildtype_Tumor or Wildtype_Normal: 69 samples for the PTEN gene (/Users/robertoldroyd/opt/anaconda3/lib/python3.8/site-packages/cptac/cancers/cancer.py, line 387)

Out[10]:

Name	AURKA_awg_proteomics	TP53_awg_proteomics	PTEN_Mutation	PTEN_Location	PTEN_Mutation_Status	Sample_Status
Patient_ID
C3L-00006	NaN	0.2950	[Missense_Mutation, Nonsense_Mutation]	[p.R130Q, p.R233*]	Multiple_mutation	Tumor
C3L-00008	0.31100	0.2770	[Missense_Mutation]	[p.G127R]	Single_mutation	Tumor
C3L-00032	NaN	-0.8710	[Nonsense_Mutation]	[p.W111*]	Single_mutation	Tumor
C3L-00090	-0.79800	-0.3430	[Missense_Mutation]	[p.R130G]	Single_mutation	Tumor
C3L-00098	3.11000	3.0100	[Wildtype_Tumor]	[No_mutation]	Wildtype_Tumor	Tumor
C3L-00136	-1.65000	-0.1480	[Missense_Mutation, Missense_Mutation]	[p.Y68C, p.R130G]	Multiple_mutation	Tumor
C3L-00137	NaN	0.4410	[Frame_Shift_Ins, Nonsense_Mutation]	[p.H118Qfs8, p.Y180]	Multiple_mutation	Tumor
C3L-00139	0.84800	-1.2200	[Wildtype_Tumor]	[No_mutation]	Wildtype_Tumor	Tumor
C3L-00143	-1.73000	-0.0825	[Missense_Mutation]	[p.R130G]	Single_mutation	Tumor
C3L-00145	-0.00513	-0.1810	[Missense_Mutation, Frame_Shift_Ins]	[p.H93R, p.E242*]	Multiple_mutation	Tumor

Filtering multiple mutations¶

The function has the ability to filter multiple mutations down to just one mutation. It allows you to specify particular mutation types or locations to prioritize, and also provides a default sorting hierarchy for all other mutations. The default hierarchy chooses truncation mutations over missense mutations, and silent mutations last of all. If there are multiple mutations of the same type, it chooses the mutation occurring earlier in the sequence.

To filter all mutations based on this default hierarchy, simply pass an empty list to the optional mutations_filter parameter. Notice how in sample S001, the nonsense mutation was chosen over the missense mutation, because it's a type of trucation mutation, even though the missense mutation occurs earlier in the peptide sequence. In sample S008, both mutations were types of truncation mutations, so the function just chose the earlier one.

In [11]:

PTEN_default_filter = en.multi_join({"awg proteomics": ["AURKA", "TP53"],
                                     "awg somatic_mutation": "PTEN"},
                                    mutations_filter=[])
PTEN_default_filter.loc[["C3L-00006", "C3L-00137"]]

cptac warning: In joining the somatic_mutation table, no mutations were found for the following samples, so they were filled with Wildtype_Tumor or Wildtype_Normal: 69 samples for the PTEN gene (<ipython-input-11-520a06c702d8>, line 1)

Out[11]:

Name	AURKA_awg_proteomics	TP53_awg_proteomics	PTEN_Mutation	PTEN_Location	PTEN_Mutation_Status	Sample_Status
Patient_ID
C3L-00006	NaN	0.295	Nonsense_Mutation	p.R233*	Multiple_mutation	Tumor
C3L-00137	NaN	0.441	Frame_Shift_Ins	p.H118Qfs*8	Multiple_mutation	Tumor

To prioritize a particular type of mutation, or a particular location, include it in the mutations_filter list. Below, we tell the function to prioritize nonsense mutations over all other mutations. Notice how in sample S008, the nonsense mutation is now selected instead of the frameshift insertion, even though the nonsense mutation occurs later in the peptide sequence.

In [12]:

PTEN_simple_filter = en.multi_join({"awg proteomics": ["AURKA", "TP53"],
                                    "awg somatic_mutation": "PTEN"},
                                   mutations_filter=["Nonsense_Mutation"])
PTEN_simple_filter.loc[["C3L-00006", "C3L-00137"]]

cptac warning: In joining the somatic_mutation table, no mutations were found for the following samples, so they were filled with Wildtype_Tumor or Wildtype_Normal: 69 samples for the PTEN gene (<ipython-input-12-e925d3d5980f>, line 1)

Out[12]:

Name	AURKA_awg_proteomics	TP53_awg_proteomics	PTEN_Mutation	PTEN_Location	PTEN_Mutation_Status	Sample_Status
Patient_ID
C3L-00006	NaN	0.295	Nonsense_Mutation	p.R233*	Multiple_mutation	Tumor
C3L-00137	NaN	0.441	Nonsense_Mutation	p.Y180*	Multiple_mutation	Tumor

You can include multiple mutation types and/or locations in the mutations_filter list. Values earlier in the list will be prioritized over values later in the list. For example, with the filter we specify below, the function first selects sample S001's missense mutation over its nonsense mutation, because we put the location of S001's missense mutation as the first value in our filter list. We still included Nonsense_Mutation in the filter list, but it comes after the location of S001's missense mutation, which is why S001's missense mutation is still prioritized. However, on all other samples, unless they also have a mutation at that same location, the function will continue prioritizing nonsense mutations, as we see in sample S008.

In [13]:

PTEN_complex_filter = en.multi_join({"awg proteomics": ["AURKA", "TP53"],
                                    "awg somatic_mutation": "PTEN"}, 
                                    mutations_filter=["p.R130Q", "Nonsense_Mutation"])
PTEN_complex_filter.loc[["C3L-00006", "C3L-00137"]]

cptac warning: In joining the somatic_mutation table, no mutations were found for the following samples, so they were filled with Wildtype_Tumor or Wildtype_Normal: 69 samples for the PTEN gene (<ipython-input-13-3cf83de88378>, line 1)

Out[13]:

Name	AURKA_awg_proteomics	TP53_awg_proteomics	PTEN_Mutation	PTEN_Location	PTEN_Mutation_Status	Sample_Status
Patient_ID
C3L-00006	NaN	0.295	Missense_Mutation	p.R130Q	Multiple_mutation	Tumor
C3L-00137	NaN	0.441	Nonsense_Mutation	p.Y180*	Multiple_mutation	Tumor

Join metadata to mutations¶

Joining metadata to mutation data works exactly like joining other datatypes. Just like any time you are using somatic_mutation data, you can filter multiple mutations with the mutations_filter parameter. Here are some examples:

In [14]:

hist_and_PTEN = en.multi_join(
    {"awg clinical": 'Histologic_type',
    "awg somatic_mutation": "PTEN"})

hist_and_PTEN.head()

cptac warning: In joining the somatic_mutation table, no mutations were found for the following samples, so they were filled with Wildtype_Tumor or Wildtype_Normal: 78 samples for the PTEN gene (<ipython-input-14-176be404a675>, line 1)

Out[14]:

Name	Histologic_type	PTEN_Mutation	PTEN_Location	PTEN_Mutation_Status	Sample_Status
Patient_ID
C3L-00006	Endometrioid	[Missense_Mutation, Nonsense_Mutation]	[p.R130Q, p.R233*]	Multiple_mutation	Tumor
C3L-00008	Endometrioid	[Missense_Mutation]	[p.G127R]	Single_mutation	Tumor
C3L-00032	Endometrioid	[Nonsense_Mutation]	[p.W111*]	Single_mutation	Tumor
C3L-00084	Carcinosarcoma	[Wildtype_Tumor]	[No_mutation]	Wildtype_Tumor	Tumor
C3L-00090	Endometrioid	[Missense_Mutation]	[p.R130G]	Single_mutation	Tumor

With multiple mutations filtered:

In [15]:

hist_and_PTEN = en.multi_join(
    {"awg clinical": "Histologic_type",
    "awg somatic_mutation": "PTEN"},
    mutations_filter=["Nonsense_Mutation"])

hist_and_PTEN.head()

cptac warning: In joining the somatic_mutation table, no mutations were found for the following samples, so they were filled with Wildtype_Tumor or Wildtype_Normal: 78 samples for the PTEN gene (<ipython-input-15-c858f3802047>, line 1)

Out[15]:

Name	Histologic_type	PTEN_Mutation	PTEN_Location	PTEN_Mutation_Status	Sample_Status
Patient_ID
C3L-00006	Endometrioid	Nonsense_Mutation	p.R233*	Multiple_mutation	Tumor
C3L-00008	Endometrioid	Missense_Mutation	p.G127R	Single_mutation	Tumor
C3L-00032	Endometrioid	Nonsense_Mutation	p.W111*	Single_mutation	Tumor
C3L-00084	Carcinosarcoma	Wildtype_Tumor	No_mutation	Wildtype_Tumor	Tumor
C3L-00090	Endometrioid	Missense_Mutation	p.R130G	Single_mutation	Tumor

Exporting dataframes¶

If you wish to export a dataframe to a file, simply call the dataframe's to_csv method, passing the path you wish to save the file to, and the value separator you want:

In [16]:

hist_and_PTEN.to_csv(path_or_buf="histologic_type_and_PTEN_mutation.tsv", sep='\t')

Tutorial 3: Joining dataframes with cptac¶