Regulome Explorer T test for numerical and binary data

Check out more notebooks at our 'Regulome Explorer Repository'!

In this notebook we describe the implementation of a student's t test to compute the significance of associations between a numerical feature (Gene expression, Somatic copy number, etc.) and Somatic mutation data which can be categorized into two groups according to the presence or absence of somatic mutations in a user defined gene. Detail of the test can be found the following link:

To describe the implementation of the test using BigQuery, we will use Gene expresion data of a user defined gene and the precense or absence of somatic mutation of a user defined gene. This data is read from a BigQuery table in the pancancer-atlas dataset.

Authenticate with Google (IMPORTANT)

The first step is to authorize access to BigQuery and the Google Cloud. For more information see 'Quick Start Guide to ISB-CGC' and alternative authentication methods can be found here.

Import Python libraries

In [1]:
from import bigquery
import numpy as np
import pandas as pd
from scipy import stats
from scipy.stats import mstats
import seaborn as sns
import re_module.bq_functions as regulome

User defined Parameters

The parameters for this experiment are the cancer type, the name of gene for which gene expression data will be obtained, and the clinical feature name. Categorical groups with number of samples smaller than 'MinSampleSize' will be ignored in the test.

In [2]:
cancer_type = 'LGG'
gene_expre = 'DRG2'
gene_mutation = 'TP53'
MinSampleSize = 10

bqclient = bigquery.Client()

Data from BigQuery tables

The first step is to select all participants in the selected study with a least one mutation.

In [3]:
barcode_set = """barcodes AS (
   SELECT Tumor_SampleBarcode AS SampleBarcode   
   FROM `pancancer-atlas.Filtered.MC3_MAF_V5_one_per_tumor_sample`
   WHERE Study = '{0}'       

Gene expression data from the BigQuery: The following query string retrieves the gene expression data of the user specified gene ('gene_expre') from the 'Filtered.EBpp_AdjustPANCAN_IlluminaHiSeq_RNASeqV2_genExp_filtered' table available in pancancer-atlas dataset. The gene expression of a participant is computed as the average gene expression of the tumor samples of the participant. Moreover, we are considering only samples with a least somatic mutation.

In [4]:
query_table1 = """table1 AS (
SELECT Symbol, data, ParticipantBarcode
         Symbol AS symbol, AVG( LOG10( normalized_count + 1 )) AS data, ParticipantBarcode
   FROM  `pancancer-atlas.Filtered.EBpp_AdjustPANCAN_IlluminaHiSeq_RNASeqV2_genExp_filtered` 
   WHERE Study = '{0}' AND Symbol ='{1}' AND normalized_count IS NOT NULL
         AND SampleBarcode  IN (SELECT * FROM barcodes)
         ParticipantBarcode, symbol
""".format(cancer_type, gene_expre )

Somatic mutation data from the BigQuery: The following string query will retrieve a table with patients with at least one Somatic mutation in the user defined gene ('mutation_name'). This information is extracted from the 'pancancer-atlas.Filtered.MC3_MAF_V5_one_per_tumor_sample' table, available in pancancer-atlas dataset. Notice that we only use samples in which FILTER = 'PASS'.

In [5]:
query_table2 = """table2 AS (
      ParticipantBarcode AS ParticipantBarcode
   FROM `pancancer-atlas.Filtered.MC3_MAF_V5_one_per_tumor_sample`
   WHERE Study = '{0}' AND Hugo_Symbol = '{1}'
         AND FILTER = 'PASS'  
   GROUP BY ParticipantBarcode
""".format(cancer_type, gene_mutation )

At this point we can take a look at the combined data (Gene expression and Somatic mutation) by using a simple LEFT JOIN command. Participants with and without somatic mutations are labeled as 'YES' and 'NO' respectively.

In [6]:
sql_data = 'WITH\n'+ barcode_set+','+ query_table1+','+query_table2  

sql = (sql_data + 
SELECT as data1,  
    IF( n2.ParticipantBarcode is null, 'NO', 'YES') as data2,  
    table1 n1  
LEFT JOIN  table2   n2 
ON  n1.ParticipantBarcode = n2.ParticipantBarcode

df_data = regulome.runQuery ( bqclient, sql, [] , dryRun=False )
 in runQuery ... 
    the results for this query were previously cached 
data1 data2 ParticipantBarcode
1 2.986545 NO TCGA-VW-A7QS
2 2.398801 YES TCGA-DB-5280
3 2.513401 YES TCGA-HT-7611
4 3.063967 NO TCGA-DH-5141
5 2.549122 NO TCGA-HT-A616
6 2.769918 NO TCGA-HT-7694
7 2.527377 YES TCGA-CS-6290
8 2.832638 NO TCGA-HW-7495
9 2.373629 YES TCGA-WY-A85C

To visualize the gene expression data in both groups with or without somatic mutation, we can use a 'violinplot' plot:

In [8]:
df_data.rename(columns={ "data1": gene_expre, "data2": gene_mutation }, inplace=True)
sns.violinplot( x=df_data[gene_mutation], y=df_data[gene_expre], palette="Pastel1")
<matplotlib.axes._subplots.AxesSubplot at 0x1a256eb2e8>

BigQuery to Compute statistical association

The T-score (T), assuming the distributions of gene expression with and without mutation have unequal variances, is computed by using the following equation:

$$T = \frac{ \bar{g}_y - \bar{g}_n }{\sqrt{ \frac{s_y^2}{N_y} + \frac{s_n^2}{N_n} } }$$


  • $\bar{g}_y$ and $\bar{g}_n$ are mean gene expression of participants with and without mutation.
  • $N_y$ and $N_n$ are the number of participants in the group with and without mutation.
  • $s_y^2$ and $s_n^2$ are the variance of gene expression for the participants with and without mutation, respectively.

Since the Somatic mutation table contains information of positive somatic mutation only, the averages and variances needed to compute the $T$ score are computed as a function $S_y=\sum_{i=1}^{N_y}{g_i}$ and $Q_y=\sum_{i=1}^{N_y}{g_i^2}$, the summs over the gene expression and squared gene expression for articipants with somatic mutation. The following query string computes $S_y$ and $Q_y$:

In [9]:
summ_table = """
summ_table AS (
   COUNT( n1.ParticipantBarcode) as Ny,
   SUM( )  as Sy,
   SUM( * ) as Qy
   table1 AS n1
   table2 AS n2
   n1.ParticipantBarcode = n2.ParticipantBarcode

After computing $S_y$ and $Q_y$ we can compute the mean and the variance as : $$\bar{g}_y =\frac{S_y}{N_y}$$ $$s_y^2 = (N_y-1)^{-1} \left[ Q_y - \frac{S_y^2}{ N_y} \right]$$

To compute $S_n$ and $Q_n$, we first compute the sums of the gene expression and squeared gene expression using all the samples and then substract $S_y$ and $Q_y$. The following query uses this approach to compute the necessary variances and means, and then computes $T$ score. The $T$ score is only computed if the variances are greater than zero and if the number of participants in each group is greater than a user defined threshold.

In [10]:
query_tscore = """
    Ny, Nn,
    avg_y, avg_n,
    ABS(avg_y - avg_n)/ SQRT( var_y /Ny + var_n/Nn )  as tscore
       Sy / Ny as avg_y,
       ( Qy - Sy*Sy/Ny )/(Ny - 1) as var_y, 
       Nt - Ny as Nn,
       (St - Sy)/(Nt - Ny) as avg_n,
       (Qt - Qy - (St-Sy)*(St-Sy)/(Nt - Ny) )/(Nt - Ny -1 ) as var_n
FROM  summ_table as n1
LEFT JOIN ( SELECT Symbol, COUNT( ParticipantBarcode ) as Nt, SUM( data ) as St, SUM( data*data ) as Qt
            FROM table1 GROUP BY Symbol
           ) as n2
ON n1.Symbol = n2.Symbol      
   Ny > {0} AND Nn > {0} AND var_y > 0 and var_n > 0

sql = ( sql_data  + ',\n' + summ_table + query_tscore )
df_tscore = regulome.runQuery ( bqclient, sql, [] , dryRun=False )
 in runQuery ... 
    the results for this query were previously cached 
Ny Nn avg_y avg_n tscore
0 248 261 2.495134 2.894686 24.426213

To test our implementation we can use the 'ttest_ins' function available in python:

In [12]:
print( df_data.groupby(gene_mutation).agg(['mean', 'count']) )
Set1 = df_data[df_data[gene_mutation]=='NO']
Set2 = df_data[df_data[gene_mutation]=='YES']
print( stats.ttest_ind(Set1[gene_expre], Set2[gene_expre], equal_var=False ) )
          mean count
NO    2.894686   261
YES   2.495134   248
Ttest_indResult(statistic=24.426213134587776, pvalue=4.38554327085175e-84)
In [ ]: