This notebook will use the Clustergrammer-Widget to visualize the Cancer cell line Encyclopedia gene expression data (Broad-Institute CCLE). The CCLE project measured genetic data from over 1000 cancer cell lines. We'lll use Clustergrammer-Widget to visualize the data. We will start by importing required libraries and initializing the Clustergrammer Network object:
from clustergrammer_widget import *
import pandas as pd
import numpy as np
net = Network(clustergrammer_widget)
We are using a slightly reformatted version of the CCLE gene expression data with modified cell line meta-data (category) formatting. You can see below how cell-line categorical information (e.g. tissue) information is encoded as column tuples. The matrix has 18,874 rows (genes) and 1,037 columns (cell-lines).
net.load_file('../original_data/CCLE.txt')
ccle = net.export_df()
print(ccle.shape)
ccle.head()
(18874, 1037)
(cell line: LN18, tissue: central_nervous_system, histology: glioma, sub-histology: astrocytoma_Grade_IV, gender: M) | (cell line: 769P, tissue: kidney, histology: carcinoma, sub-histology: clear_cell_renal_cell_carcinoma, gender: F) | (cell line: 786O, tissue: kidney, histology: carcinoma, sub-histology: clear_cell_renal_cell_carcinoma, gender: M) | (cell line: CAOV3, tissue: ovary, histology: carcinoma, sub-histology: adenocarcinoma, gender: F) | (cell line: HEPG2, tissue: liver, histology: carcinoma, sub-histology: hepatocellular_carcinoma, gender: M) | (cell line: MOLT4, tissue: haematopoietic_and_lymphoid_tissue, histology: lymphoid_neoplasm, sub-histology: acute_lymphoblastic_T_cell_leukaemia, gender: M) | (cell line: NCIH524, tissue: lung, histology: carcinoma, sub-histology: small_cell_carcinoma, gender: M) | (cell line: NCIH209, tissue: lung, histology: carcinoma, sub-histology: small_cell_carcinoma, gender: M) | (cell line: MIAPACA2, tissue: pancreas, histology: carcinoma, sub-histology: ductal_carcinoma, gender: M) | (cell line: MCAS, tissue: ovary, histology: carcinoma, sub-histology: adenocarcinoma, gender: F) | ... | (cell line: SLR21, tissue: kidney, histology: carcinoma, sub-histology: renal_cell_carcinoma, gender: NA) | (cell line: LNZ308, tissue: central_nervous_system, histology: glioma, sub-histology: astrocytoma_Grade_IV, gender: NA) | (cell line: LN340, tissue: central_nervous_system, histology: glioma, sub-histology: astrocytoma_Grade_IV, gender: NA) | (cell line: HCC827GR5, tissue: lung, histology: carcinoma, sub-histology: adenocarcinoma, gender: NA) | (cell line: SLR20, tissue: kidney, histology: carcinoma, sub-histology: renal_cell_carcinoma, gender: NA) | (cell line: HK2, tissue: kidney, histology: other, sub-histology: immortalized_epithelial, gender: NA) | (cell line: EW8, tissue: bone, histology: Ewings_sarcoma-peripheral_primitive_neuroectodermal_tumour, sub-histology: NS, gender: NA) | (cell line: UOK101, tissue: kidney, histology: carcinoma, sub-histology: clear_cell_renal_cell_carcinoma, gender: NA) | (cell line: JHESOAD1, tissue: oesophagus, histology: carcinoma, sub-histology: barrett_associated_adenocarcinoma, gender: NA) | (cell line: CH157MN, tissue: central_nervous_system, histology: meningioma, sub-histology: NS, gender: NA) | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
LOC100009676 | 5.987545 | 5.444892 | 5.838828 | 6.074743 | 5.788600 | 5.459675 | 5.755560 | 7.190493 | 5.449818 | 5.801820 | ... | 5.473156 | 5.517208 | 5.858379 | 5.196033 | 5.831437 | 5.362021 | 5.799747 | 5.865606 | 5.463812 | 5.720593 |
AKT3 | 6.230233 | 7.544216 | 7.328450 | 4.270720 | 4.478293 | 6.212102 | 7.562398 | 8.642669 | 5.556191 | 6.808673 | ... | 6.375324 | 6.119814 | 6.561409 | 4.521773 | 6.830904 | 7.031690 | 4.881235 | 6.914640 | 5.313795 | 5.757825 |
MED6 | 9.363550 | 8.715909 | 8.410834 | 9.845271 | 9.761157 | 10.532820 | 10.393960 | 9.478429 | 9.112954 | 9.815614 | ... | 8.849773 | 8.767192 | 8.521635 | 8.224544 | 9.325785 | 8.362727 | 8.990524 | 8.958629 | 9.748100 | 9.758431 |
NR2E3 | 3.803069 | 4.173643 | 3.776557 | 3.934091 | 3.822202 | 3.949198 | 3.807546 | 3.930186 | 4.161937 | 4.028581 | ... | 3.717506 | 3.977377 | 3.659459 | 3.933996 | 4.515748 | 4.434658 | 4.127832 | 3.942736 | 4.062648 | 4.074257 |
NAALAD2 | 3.586430 | 3.663081 | 4.047007 | 3.817250 | 6.444302 | 4.081071 | 5.462774 | 4.252446 | 3.932451 | 3.835827 | ... | 3.520843 | 4.036661 | 4.168351 | 3.535915 | 4.445632 | 3.622032 | 5.436580 | 3.666404 | 3.556565 | 3.728828 |
5 rows × 1037 columns
Above we obtained a coarse-grained overview of the CCLE gene expression data. With this overview we saw that cell lines cluster according to their tissue and uisng Enrichr we verified that clusters of differentially expressed genes are informative about their assocaited tissues. Below we will look at the expression in specific tissues and see whether we can identify sub-clusters of tissues (e.g. based on their histology).
We will do this using Pandas to filter for specific cell lines of interest.
Next, we will visualize gene expression across bone tissue and we will again process the data such that we highlight genes with variable expression across bone cancers, which will help us characterize subsets of bone caners.
# filter for columns with category 1 bone
net.filter_cat('col', 1, 'tissue: bone')
net.filter_N_top('row', 500, 'var')
net.normalize(axis='row', norm_type='zscore')
net.dat['mat'].shape
(500, 29)
net.cluster(enrichrgram=True)
net.widget()
Osteosarcoma, Giant Cell Tumor, and Chondrosarcoma cluster separately from Ewings sarcoma-peripheral-primitive-neuroectodermal-tumors.
We can see that the up-regulated genes in Chondrosarcomas and Osteosarcomas are enriched in extracellular-matrix related terms, which makes sense for bone cancers. These extracellular-matrix terms do not 'point' to the up-regulated genes in Ewings-Sarcoma and we will use the row dendrogram crop button to further investigate these genes. We see that these genes are enriched for behavior and neuronal-related terms. This seems to agree with the neuroectodermal association of this tumor type.
Here we will visualize the expression in lung tissue and we will process the data such that we highlight the variability of gene expression across all lung cell lines. This will help us understand the differences between the lung cell lines and hopefully identify subpopulations.
There are 187 lung tissue cell lines. We will start by selecting lung tissue cell lines, then we will filter for the top 500 differentally expressed genes, and finally we will normalize the genes across all cell lines to easily compare their differential expression across the celll lines.
net.load_df(ccle)
net.filter_cat('col', 1, 'tissue: lung')
Note, that we can use the same net
object as before since loading a new DataFrame clears out the old data.
net.filter_N_top('row', 500, 'var')
net.normalize(axis='row', norm_type='zscore')
net.dat['mat'].shape
(500, 187)
net.cluster(enrichrgram=True)
net.widget()
Lung cell lines almost all have the same histology, carcinoma, but have several sub-histoogies. We see that cell lines cluster according to these sub-histologies and using Enrichrgram we can see the biological processes occurring in these clusters.
If we do GO Biological Function enrichment we get terms related to: regulation of endopeptidase activity, response to acid and inorganic substances, extracellular matrix organization, etc. These terms point to up-regulated genes in adenocarcinoma and large cell carcinomas (NSCLC), but we not to the genes up-regulated in small call carcinomas (SCLC).
To investigate the function of these up-regulated SCLC genes we can enrich for just this cluster of up-regulated genes. We see with the same enrichment analysis that the most specific functions are commonly neuron related. SCLC are known to display characteristics of neuronal cells (Onganer et al. 2005).
We would like to get an overview of the entire CCLE gene expression data, but the dataset is too large to visualize direcly using Clustergrammer. Also, we are probably not interested in the expression data of all 18,000 genes, but only in a subset of genes; e.g. those that are 'differentially expressed' across some subset of tissues.
We will use downsampling and filtering to get a more managable dataset, which we can visualize using Clustergrammer. First, we will use K-means to downsample the 1,037 cell lines down to 100 clusters and then we will filter for the top 2,000 differentially expressed genes.
We'll do the downsampling first and save it to ccle_ds
:
net.load_df(ccle)
net.downsample(ds_type='kmeans', axis='col', num_samples=100)
ccle_ds = net.export_df()
ccle_ds.shape
(18874, 100)
Now our downsampled data, ccle_ds
, only has 100 columns. We have also dropped some column categories and are only keeping track of the maojrity tissue in each cell-line-cluster and the number of cell-lines in each cluster. We can see how this is encoded in the column names as tuples below:
ccle_ds.head()
(Cluster: cluster-0, Majority-tissue: haematopoietic_and_lymphoid_tissue, Majority-histology: lymphoid_neoplasm, Majority-sub-histology: mycosis_fungoides-Sezary_syndrome, Majority-gender: M, number in clust: 2) | (Cluster: cluster-1, Majority-tissue: lung, Majority-histology: carcinoma, Majority-sub-histology: NS, Majority-gender: F, number in clust: 2) | (Cluster: cluster-2, Majority-tissue: upper_aerodigestive_tract, Majority-histology: carcinoma, Majority-sub-histology: squamous_cell_carcinoma, Majority-gender: M, number in clust: 47) | (Cluster: cluster-3, Majority-tissue: autonomic_ganglia, Majority-histology: neuroblastoma, Majority-sub-histology: NS, Majority-gender: M, number in clust: 11) | (Cluster: cluster-4, Majority-tissue: skin, Majority-histology: malignant_melanoma, Majority-sub-histology: NS, Majority-gender: M, number in clust: 50) | (Cluster: cluster-5, Majority-tissue: lung, Majority-histology: carcinoma, Majority-sub-histology: NS, Majority-gender: F, number in clust: 26) | (Cluster: cluster-6, Majority-tissue: haematopoietic_and_lymphoid_tissue, Majority-histology: lymphoid_neoplasm, Majority-sub-histology: diffuse_large_B_cell_lymphoma, Majority-gender: M, number in clust: 31) | (Cluster: cluster-7, Majority-tissue: large_intestine, Majority-histology: carcinoma, Majority-sub-histology: adenocarcinoma, Majority-gender: M, number in clust: 2) | (Cluster: cluster-8, Majority-tissue: lung, Majority-histology: carcinoma, Majority-sub-histology: NS, Majority-gender: M, number in clust: 15) | (Cluster: cluster-9, Majority-tissue: liver, Majority-histology: carcinoma, Majority-sub-histology: hepatocellular_carcinoma, Majority-gender: M, number in clust: 16) | ... | (Cluster: cluster-90, Majority-tissue: stomach, Majority-histology: carcinoma, Majority-sub-histology: tubular_adenocarcinoma, Majority-gender: M, number in clust: 1) | (Cluster: cluster-91, Majority-tissue: breast, Majority-histology: carcinoma, Majority-sub-histology: NS, Majority-gender: F, number in clust: 1) | (Cluster: cluster-92, Majority-tissue: haematopoietic_and_lymphoid_tissue, Majority-histology: lymphoid_neoplasm, Majority-sub-histology: Hodgkin_lymphoma, Majority-gender: M, number in clust: 1) | (Cluster: cluster-93, Majority-tissue: central_nervous_system, Majority-histology: glioma, Majority-sub-histology: astrocytoma_Grade_IV, Majority-gender: M, number in clust: 17) | (Cluster: cluster-94, Majority-tissue: autonomic_ganglia, Majority-histology: neuroblastoma, Majority-sub-histology: NS, Majority-gender: F, number in clust: 2) | (Cluster: cluster-95, Majority-tissue: thyroid, Majority-histology: carcinoma, Majority-sub-histology: papillary_carcinoma, Majority-gender: F, number in clust: 1) | (Cluster: cluster-96, Majority-tissue: stomach, Majority-histology: carcinoma, Majority-sub-histology: adenocarcinoma, Majority-gender: M, number in clust: 1) | (Cluster: cluster-97, Majority-tissue: oesophagus, Majority-histology: carcinoma, Majority-sub-histology: squamous_cell_carcinoma, Majority-gender: F, number in clust: 1) | (Cluster: cluster-98, Majority-tissue: kidney, Majority-histology: carcinoma, Majority-sub-histology: clear_cell_renal_cell_carcinoma, Majority-gender: NA, number in clust: 2) | (Cluster: cluster-99, Majority-tissue: liver, Majority-histology: carcinoma, Majority-sub-histology: hepatocellular_carcinoma, Majority-gender: M, number in clust: 8) | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
LOC100009676 | 5.665593 | 5.284315 | 5.673244 | 5.363738 | 6.057420 | 5.840425 | 5.841230 | 5.685825 | 5.680019 | 5.610437 | ... | 6.599907 | 5.425919 | 6.363237 | 5.773302 | 4.631328 | 5.855566 | 6.583266 | 5.077296 | 5.626497 | 5.672558 |
AKT3 | 6.435427 | 6.952967 | 5.605452 | 8.122674 | 7.267956 | 6.011819 | 5.094038 | 4.558152 | 7.085084 | 6.115605 | ... | 7.688763 | 4.401639 | 7.052142 | 6.299563 | 8.859103 | 7.864197 | 4.341758 | 5.051193 | 5.370956 | 4.430994 |
MED6 | 9.518722 | 8.762060 | 9.502653 | 9.341522 | 8.839631 | 9.507497 | 9.699576 | 9.673724 | 8.788507 | 8.855337 | ... | 9.539184 | 8.672265 | 9.594567 | 8.579562 | 9.669472 | 9.377676 | 9.125494 | 10.045597 | 8.782129 | 9.287517 |
NR2E3 | 3.989407 | 3.901817 | 4.051622 | 3.875381 | 3.804977 | 3.931573 | 3.993905 | 3.990742 | 3.864638 | 3.927532 | ... | 4.132590 | 4.118239 | 3.706916 | 3.829621 | 3.748747 | 4.071493 | 3.678901 | 3.869146 | 4.106636 | 3.887731 |
NAALAD2 | 4.389125 | 4.678008 | 3.844582 | 7.318395 | 4.123212 | 4.142046 | 3.873645 | 3.872935 | 3.751903 | 4.138593 | ... | 3.763763 | 3.873275 | 3.599781 | 3.703137 | 6.849774 | 3.681915 | 3.948763 | 3.678124 | 3.535314 | 6.393138 |
5 rows × 100 columns
Next, we will filter out genes based on variance -- we will only keep the top 2,000 genes based on their variance. After this, we will Z-score normalize the genes across all cell-line-clusters to more easily compare their differential expression. We will perform these operations within the net
object:
net.load_df(ccle_ds)
net.filter_N_top('row', 2000, rank_type='var')
net.normalize(axis='row', norm_type='zscore', keep_orig=True)
print('now our matrix has 2000 rows and 100 columns')
net.dat['mat'].shape
now our matrix has 2000 rows and 100 columns
(2000, 100)
Finally, we will use Clustergrammer to hierarchically cluster and visualize the downsampled and filtered dataset.
net.cluster(enrichrgram=True)
net.widget()
We can immediately see that cell-line-clusters (refered to as cell lines) cluster based on their tissue (specifically the majority tissue in the cluster) -- note the colored bands under the column labels. The second category 'number in clust' gives the number of cell lines represented by the K-means cluster -- the darker the category the more cell lines in each cluster.
From the column dendrogram (gray trapezoids on under the heatmap) that cell lines cluster into six large clusters. On the right we can see 'Haematopoietic and Lymphoid Tissue' forms a big cluster with a set of highly expressed genes on the bottom right.
This overview shows us that we have four big gene clusters and six cell line 'clusters' (made up of K-means clusters). We can zoom into the clusters to find out which genes are differentially expressed and mouseover specific gene name to bring up their full names and descriptions (via Harmonizome).
Clustergrammer leverages other Ma'ayan lab web-tools (e.g. Enrichr) to facilitate the exploration of biological gene-level data. We can use the Enrichrgram functionality to find biological information specific to our genes of interest. We will do this by first selecting our cluster of interest, the up-regulated genes in Haematopoietic/Lymphoid Tissue, using the dendrogram crop button (the triangle pointing at the dendrogram cluster). Once we have filtered for only these genes we can either export our genes to Enrichr or import Enrichr results into our visualization.
We'll enrich for Gene Ontology Biologial Process and we see that we get a lot of immune related processes that 'point' right at the cluster of up-regulated genes, which makes sense. We can also enrich for up-stream transcription factors that might be responsible for the expression of these genes. We see that the transcription factor IRF1 targets a subset of up-regulated genes and is known to be involved immune response. We see that the transcription factor MECOM targets these up-regulated genes and this trancription factor is known to be involved in hematopoiesis.
We can also use Enrichr to investigate other subsets.