Regulome Explorer Kruskal-Wallis test for numerical and categorical data

CCheck out more notebooks at our 'Regulome Explorer Repository'!

In this notebook we describe how Regulome Explorer uses Kruskal-Wallis test to compute the significance of associations between a numerical feature (Gene expression, Somatic copy number, etc.) and a categorical feature. Details of the Kruskal-Wallist test can be found in the following link:

To describe the implementation of the test using BigQuery, we will use Gene expresion data of a user defined gene and a user defined clinical feature. This data is read from a BigQuery table in the pancancer-atlas dataset.

Authenticate with Google (IMPORTANT)

The first step is to authorize access to BigQuery and the Google Cloud. For more information see 'Quick Start Guide to ISB-CGC' and alternative authentication methods can be found here.

Import Python libraries

In [1]:
%matplotlib inline
from import bigquery
import numpy as np
import pandas as pd
from scipy import stats
from scipy.stats import mstats
import seaborn as sns
import re_module.bq_functions as regulome

User defined Parameters

The parameters for this experiment are the cancer type, the name of gene for which gene expression data will be obtained, and the clinical feature name. Categorical groups with number of samples smaller than 'MinSampleSize' will be ignored in the test.

In [2]:
cancer_type = 'LGG'
gene_name = 'IGF2'
clinical_feature = 'icd_o_3_histology'
MinSampleSize = 26

bqclient = bigquery.Client()

Data from BigQuery tables

Gene expression data from the BigQuery: The following query string retrieves the gene expression data of the user specified gene ('gene_name') from the 'Filtered.EBpp_AdjustPANCAN_IlluminaHiSeq_RNASeqV2_genExp_filtered' table available in pancancer-atlas dataset.

In [3]:
query_table1 = """table1 AS (
SELECT  symbol, data, ParticipantBarcode
         Symbol AS symbol, AVG( LOG10( normalized_count + 1 )) AS data, ParticipantBarcode
   FROM  `pancancer-atlas.Filtered.EBpp_AdjustPANCAN_IlluminaHiSeq_RNASeqV2_genExp_filtered` 
   WHERE Study = '{0}' AND Symbol ='{1}' AND normalized_count IS NOT NULL
         ParticipantBarcode, symbol
""".format(cancer_type, gene_name )

Clinical data from the BigQuery: The following string query will retrieve clinical data fromthe 'pancancer-atlas.Filtered.clinical_PANCAN_patient_with_followup_filtered' table available in pancancer-atlas dataset. It is worth noting that some of the values of the clinical feature may be 'indetermined' or 'not-evaluated'; typically these values are inside square brackets. The 'REGEXP_CONTAINS' command is used to avoid using those values in the test.

In [4]:
query_table2 = """table2 AS (
   avgdata AS data,
      '{0}' AS symbol, 
      {0} AS avgdata,
      bcr_patient_barcode AS ParticipantBarcode
   FROM `pancancer-atlas.Filtered.clinical_PANCAN_patient_with_followup_filtered`
   WHERE acronym = '{1}' AND {0} IS NOT NULL  
         AND NOT REGEXP_CONTAINS({0},r"^(\[.*\]$)")     
""".format(clinical_feature, cancer_type)

The following query combines the two tables based on Participant barcodes. T

In [5]:
table_data = """table_data AS (
SELECT as data1, as data2,
   table1 AS n1
   table2 AS n2
   n1.ParticipantBarcode = n2.ParticipantBarcode

At this point we can take a look at output table

In [6]:
sql_data = 'WITH\n' +query_table1+','+query_table2+','+table_data 

sql = (sql_data + '\n' +
"""SELECT * FROM table_data 
   ORDER BY  data2

df_data = regulome.runQuery ( bqclient, sql, [] , dryRun=False )
 in runQuery ... 
    the results for this query were previously cached 
data1 data2 ParticipantBarcode
1 2.359532 9382/3 TCGA-E1-A7YW
2 2.868692 9382/3 TCGA-FG-7637
3 2.713119 9382/3 TCGA-TQ-A7RG
4 3.064997 9382/3 TCGA-DB-5280
5 2.554518 9382/3 TCGA-IK-8125
6 2.799724 9382/3 TCGA-DU-8163
7 2.800062 9382/3 TCGA-HT-8018
8 2.558207 9382/3 TCGA-DU-7019
9 2.793514 9382/3 TCGA-DU-5852

We can use a 'violinplot' to visualize the populations in each category.

In [7]:
new_data = df_data[ df_data.data2.str.contains('^\[.*\]$',na=True,regex=True) == False ]
new_data.rename(columns={ "data1": gene_name , "data2": clinical_feature }, inplace=True)
sns.violinplot( x=new_data[clinical_feature], y=new_data[gene_name], palette="Pastel1")
<matplotlib.axes._subplots.AxesSubplot at 0x7f3e4421a5f8>

BigQuery to Compute statistical association

The Kruskal-Wallis score (H) is computed by using the following equation:

$$H = \frac{(N-1)\sum_{i=1}^{g} n_i (\bar{r_{i}} -\bar{r} )^2 }{ \sum_{i=1}^{g} \sum_{j=1}^{n_i} (r_{ij}-\bar{r})^2 }$$


  • $n_i$ is the number of observations in category $i$
  • $r_{ij}$ is the rank (among all participants) of the gene expression of participant $j$ that belongs to category $i$
  • $N$ is the total number of participants considered in the test
  • $\bar{r_i}$ is the averange rank of gene expression values for particpants in category $i$
  • $\bar{r}$ is the average of all $r_{ij}$

To avoid reading that table multiple times, we rearranged the equations above as :

$$H = (N-1)\frac{ \sum_{i=1}^{g}S_i^2/n_i - (\sum_{i=1}^{g}S_i)^2 / N }{ \sum_{i=1}^{g}Q_i - (\sum_{i=1}^{g}S_i)^2 / N }$$

Where $S_i = \sum_{j=1}^{n_i}r_{ij}$ and $Q_i = \sum_{j=1}^{n_i}r_{ij}^2$

The following query string computes $S_i$ and $Q_i$:

In [8]:
summ_table = """
summ_table  AS (
   COUNT( ParticipantBarcode) AS ni,
   SUM( rnkdata ) AS Si,
   SUM( rnkdata * rnkdata ) AS Qi,
FROM (    
      (RANK() OVER (ORDER BY data1 ASC)) + (COUNT(*) OVER ( PARTITION BY CAST(data1 as STRING)) - 1)/2.0 AS rnkdata,
      data2, ParticipantBarcode
   WHERE data2 IN ( SELECT data2 from table_data GROUP BY data2 HAVING count(data2)>{0} )   
""".format( str(MinSampleSize) )

The query above ingnores the categories that have a number of participants smaller or equal than 'MinSampleSize'. Moreover, the gene expression is ranked, assigning average of ranks to the similar values( Finally, The Kruskall-Wallis score ($H$) is computed by the following BigQuery string.

In [9]:
query_hscore = """
    N as Nsamples,        
    (N-1)*( sumSi2overni - (sumSi *sumSi)/N ) / (  sumQi  - (sumSi *sumSi)/N )    AS  Hscore 
      SUM( ni ) As N, 
      SUM( Si ) AS sumSi,
      SUM( Qi ) AS sumQi,
      SUM( Si * Si  / ni ) AS sumSi2overni,
      COUNT ( data2 ) AS Ngroups    
  FROM  summ_table
   Ngroups > 1

sql = ( sql_data + ',\n' + summ_table + query_hscore )
df_hscore = regulome.runQuery ( bqclient, sql, [] , dryRun=False )
 in runQuery ... 
    the results for this query were previously cached 
Ngroups Nsamples Hscore
0 5 513 7.857342

To test our implementation we can use the 'kruskalwallis' function available in python:

In [10]:
CategoryData = []
CategoryNames = [] 

for name, group in new_data.groupby( clinical_feature ) :
    data =  group[ gene_name ].values 
    if ( len( data ) > MinSampleSize ) :
         CategoryData.append( data )
         CategoryNames.append( name )
if len( CategoryData ) > 1 :
    print( mstats.kruskalwallis( *[ mydata for mydata in CategoryData   ] ) )
KruskalResult(statistic=7.8573421918435997, pvalue=0.096945947430261636)
In [ ]: