Regulome Explorer Notebook¶

This notebook computes significant association scores between pairwise data types available in the PanCancer Atlas dataset of ISB-CGC. The specific statistical tests implmeneted are described 'here', and a description of the original Regulomen Explorer is avaiable 'here'.

The output of the notebook is a table of significacnt associations specified by correltions and p-values. This notebook also performs a more detailed analysis from a user specified pair of features names generating figures and additional statistics.

Authentication¶

The first step is to authorize access to BigQuery and the Google Cloud. For more information see 'Quick Start Guide to ISB-CGC' and alternative authentication methods can be found here.

Import Python libraries¶

In [1]:

%load_ext autoreload
%autoreload 2
%matplotlib inline
from google.cloud import bigquery
import pandas as pd
import re_module.bq_functions as regulome

Specify Parameters¶

The parameters for this experiment are the cancer type (study), a list of genes, a couple of molecular features (), the significance level, and the minimum number of samples required for the statistical analysis.

In [2]:

[study, feature1, feature2, gene_names, size, cohortlist, significance] = regulome.makeWidgets()

HBox(children=(HTML(value='<em>Select Feature1 </em>'), Dropdown(options=('Gene Expression', 'Somatic Copy Num…

HBox(children=(HTML(value='<em>Feature1 labels </em>'), Text(value='IGF2, ADAM6', placeholder='Type gene names…

HBox(children=(HTML(value='<em>Select Feature2 </em>'), Dropdown(options=('Gene Expression', 'Somatic Mutation…

HBox(children=(HTML(value='<em>Select a study </em>'), Dropdown(index=30, options=('ACC', 'BLCA', 'BRCA', 'CES…

HBox(children=(HTML(value='<em>Significance level </em>'), SelectionSlider(continuous_update=False, index=1, o…

HBox(children=(HTML(value='<em>Minimum number of samples</em>'), IntSlider(value=25, max=50, min=5)))

HBox(children=(HTML(value='<em>Cohort list</em>'), FileUpload(value={}, description='Upload')))

Build the query¶

The Bigquery query to compute associations between feature 1 and 2 are created using functions in the 'regulome' module. Please refer to our github repository to access the notebooks with description of the methods used for each possible combination of features available in TCGA: https://github.com/isb-cgc/Community-Notebooks/tree/master/RegulomeExplorer.

In [16]:

SampleList, PatientList  = regulome.readcohort( cohortlist )
LabelList = [ x.strip() for x in gene_names.value.split(',') ]

funct1 = regulome.approx_significant_level( )
table1, table2 = regulome.get_feature_tables(study.value,feature1.value,feature2.value,SampleList,PatientList,LabelList)
str_summarized = regulome.get_summarized_pancanatlas( feature1.value, feature2.value )
str_stats = regulome.get_stat_pancanatlas(feature1.value, feature2.value, size.value, significance.value  )

sql = (funct1 +  'WITH' + table1 + ',' + table2 + ',' + str_summarized + str_stats)
print(sql)

CREATE TEMP FUNCTION erfcc(x  FLOAT64)
RETURNS FLOAT64
LANGUAGE js AS """
  
  var t; 
  var z; 
  var ans;
  z = Math.abs(x) ;
  t = 1.0 / (1.0 + 0.5*z ) ;
  
  ans= t * Math.exp(-z*z-1.26551223+t*(1.00002368+t*(0.37409196+t*(0.09678418+
t*(-0.18628806+t*(0.27886807+t*(-1.13520398+t*(1.48851587+
t*(-0.82215223+t*0.17087277)))))))));
  
  if ( x >= 0 ) {
    return ans ;
  } else {
    return 2.0 - ans;
  }
""";

WITH
table1 AS (
SELECT
   symbol,
   (RANK() OVER (PARTITION BY symbol ORDER BY data ASC)) + (COUNT(*) OVER ( PARTITION BY symbol, CAST(data as STRING)) - 1)/2.0 AS rnkdata,
   ParticipantBarcode
FROM (
   SELECT
      Symbol AS symbol, 
      AVG( LOG10( normalized_count + 1 ) )  AS data,
      ParticipantBarcode AS ParticipantBarcode
   FROM `pancancer-atlas.Filtered.EBpp_AdjustPANCAN_IlluminaHiSeq_RNASeqV2_genExp_filtered`
   WHERE Study = 'PAAD'    # cohort 
         AND Symbol IN UNNEST(@GENELIST)  # labels 
         AND normalized_count IS NOT NULL  
   GROUP BY
      ParticipantBarcode, symbol
   )
)
,
table2 AS (
SELECT
   symbol,
   (RANK() OVER (PARTITION BY symbol ORDER BY data ASC)) + (COUNT(*) OVER ( PARTITION BY symbol, CAST(data as STRING)) - 1)/2.0 AS rnkdata,
   ParticipantBarcode
FROM (
   SELECT
      Symbol AS symbol, 
      AVG( LOG10( normalized_count + 1 ) )  AS data,
      ParticipantBarcode AS ParticipantBarcode
   FROM `pancancer-atlas.Filtered.EBpp_AdjustPANCAN_IlluminaHiSeq_RNASeqV2_genExp_filtered`
   WHERE Study = 'PAAD'    # cohort 
         AND Symbol IS NOT NULL  # labels 
         AND normalized_count IS NOT NULL  
   GROUP BY
      ParticipantBarcode, symbol
   )
)
,
summ_table AS (
SELECT 
   n1.symbol as symbol1,
   n2.symbol as symbol2,
   COUNT( n1.ParticipantBarcode ) as n,
   CORR(n1.rnkdata , n2.rnkdata) as correlation
    
FROM
   table1 AS n1
INNER JOIN
   table2 AS n2
ON
   n1.ParticipantBarcode = n2.ParticipantBarcode
   AND n2.symbol NOT IN UNNEST(@GENELIST)
GROUP BY
   symbol1, symbol2
UNION ALL
SELECT 
   n1.symbol as symbol1,
   n2.symbol as symbol2,
   COUNT( n1.ParticipantBarcode ) as n,
   CORR(n1.rnkdata , n2.rnkdata) as correlation
    
FROM
   table1 AS n1
INNER JOIN
   table1 AS n2
ON
   n1.ParticipantBarcode = n2.ParticipantBarcode
   AND n1.symbol < n2.symbol
GROUP BY
   symbol1, symbol2
)
SELECT symbol1, symbol2, n, correlation
FROM summ_table
WHERE 
    n > 25 AND n < 500 AND NOT IS_NAN( correlation)
GROUP BY 1,2,3,4
HAVING  `cgc-05-0042.functions.significance_level_ttest2`(n-2, ABS(correlation)*SQRT((n-2)/((1+correlation)*(1-correlation)))) <= 0.05
UNION ALL
SELECT symbol1, symbol2, n, correlation 
FROM summ_table
WHERE 
    n >= 500 AND NOT IS_NAN( correlation)
GROUP BY 1,2,3,4
HAVING erfcc( ABS(correlation)*SQRT(n)/1.414213562373095 ) <= 0.05
ORDER BY ABS(correlation) DESC

Run the Bigquery¶

In [17]:

bqclient = bigquery.Client()
df_results = regulome.runQuery ( bqclient, sql, LabelList, SampleList, PatientList, dryRun=False )
regulome.pvalues_dataframe( df_results )
df_results

 in runQuery ... 
    this query processed 7757877633 bytes 
    Approx. elpased time : 5605 miliseconds

Out[17]:

	symbol1	symbol2	n	correlation	p-value
0	SMAD4	WDR7	151	0.738707	0.000000
1	SMAD4	ZNF24	151	0.732436	0.000000
2	SMAD4	TSHZ1	151	0.727285	0.000000
3	SMAD4	DTNA	151	0.723987	0.000000
4	SMAD4	ELAC1	151	0.714381	0.000000
...	...	...	...	...	...
11757	SMAD4	FAM111A	151	-0.159906	0.049848
11758	SMAD4	ACER3	151	0.159878	0.049888
11759	SMAD4	LBR	151	0.159840	0.049944
11760	SMAD4	TIMD4	151	0.159830	0.049957
11761	SMAD4	SSX4	151	-0.159820	0.049972

11762 rows × 5 columns

Analyze a pair of labels¶

From the table above please select a pair of features names to perform a statistical analysis and display the data. You can print the variable 'pair_query' to obtain the query used to retrieve the data. pair_query is the query used to retreive the necessary data for the statistical test.

In [7]:

[name1 , name2 ] = regulome.makeWidgetsPair()

HBox(children=(HTML(value='<em>Type label 1 </em>'), Text(value='', placeholder='label name')))

HBox(children=(HTML(value='<em>Type label 2 </em>'), Text(value='', placeholder='label name')))

In [18]:

pair_query = regulome.get_query_pair(name1.value,name2.value,study.value,SampleList,feature1.value,feature2.value)
#print(pair_query)
df_pair = regulome.runQuery( bqclient, pair_query, LabelList, SampleList, PatientList, dryRun=False )
regulome.plot_statistics_pair ( df_pair, feature2.value, name1.value, name2.value, size.value )

 in runQuery ... 
    this query processed 7757877633 bytes 
    Approx. elpased time : 1264 miliseconds 
SpearmanrResult(correlation=0.7387068665040083, pvalue=2.6102587012639067e-27)

In [ ]: