Check out more notebooks at our Community Notebooks Repository!
This notebook demostrates how to use the kmeans_current
function available in the isb-cgc-bq.functions
dataset for clustering numerical data. This notebook uses Breast cancer gene expression data of two user defined genes in the BigQuery table isb-cgc.tcga_201607_beta.mRNA_UNC_RSEM
, but this example should also work on other numerical BigQuery table.
#pip install --upgrade google-cloud-bigquery
from google.cloud import bigquery
from google.colab import auth
import google.auth
import pandas as pd
import seaborn as sns
Before using BigQuery we need to get authorized for access to BigQuery and the Google Cloud. For more information see 'Quick Start Guide to ISB-CGC' and alternative authentication methods can be found here.
auth.authenticate_user()
my_project_id = "my_google_project" # write your project id here
bqclient = bigquery.Client( my_project_id )
In this experiment we use the gene expression of two genes (gene1
and gene2
) in Breast cancer. The clustering of the data is performed by using the k-means
method implemented as BigQuery function, isb-cgc-bq.function.kmeans_current
. In addition to the gene expression data, the function requires two additional parameteres, the number of iterations and the number of cluster.
gene1 = 'ESR1'
gene2 = 'EGFR'
study = 'BRCA'
niterations = 200
nclusters = 2
The following code generates a sub query to access the gene expression data of a gene. This subquery requires the table name({0}
), the cancer type ({1}
), and the gene symbol ({2}
).
gequery = """{0} AS (
SELECT
AliquotBarcode AS barcode,
AVG(LOG(normalized_count+1, 2)) AS count
FROM
`isb-cgc.tcga_201607_beta.mRNA_UNC_RSEM`
WHERE
Study = '{1}'
AND SampleTypeLetterCode = 'TP'
AND HGNC_gene_symbol = '{2}'
AND normalized_count >= 0
GROUP BY
AliquotBarcode, gene_id )
"""
The folowing query first combines the gene expression of the two genes using the JOIN
command, and then calls the isb-cgc-bq.functions.kmeans_current
on the gene expression (m1.count
and m2.count
) data.
clustering = """results AS (
SELECT
ARRAY_AGG(m1.count ORDER BY m1.barcode) AS esr1_a,
ARRAY_AGG(m2.count ORDER BY m1.barcode) AS egfr_a,
`isb-cgc-bq.functions.kmeans_current`( ARRAY_AGG( STRUCT([m1.count,m2.count]) ORDER BY m1.barcode ), {0},{1} ) AS cluster
FROM
gene1 AS m1
JOIN
gene2 AS m2
ON
m1.barcode = m2.barcode
)""".format( str(niterations), str(nclusters) )
Finally we generate the query that will perform the clustering on genexpression data.
query = "WITH " + \
gequery.format('gene1',study,gene1) + "," + \
gequery.format('gene2',study,gene2) + "," + \
clustering + """
SELECT cluster,
esr1_a[OFFSET(myoffset)] AS ESR1,
egfr_a[OFFSET(myoffset)] AS EGFR
FROM results,
results.cluster as cluster WITH OFFSET AS myoffset
"""
print(query)
The following commands run the generated query in google BigQuery. The results are saved in a pandas dataframe where the first column represents the cluster id and the second and third columns represent the gene expression of gene1 and gene2 respectively.
try:
query_job = bqclient.query ( query )
except:
print ( " FATAL ERROR: query execution failed " )
df = query_job.to_dataframe()
df
Finally we can generate a scatter plot coloring the gene expression data by the cluster id.
sns.lmplot( x="EGFR", y="ESR1", data=df, fit_reg=False, hue='cluster')
You can experiment with different numbers of clusters and genes.
Check out https://github.com/isb-cgc/Community-Notebooks for more examples!