3.0 2,700 PBMC scRNA-seq¶

Single cell RNA-seq (scRNA-seq) is a powerful method to interrogate gene expression across thousands of single cells. This method provides thousands of measurements (single cells) across thousands of dimensions (genes). This notebook uses Clustergrammer2 to interactively explore an example dataset measuring the gene expression of 2,700 PBMCs obtained from 10X Genomics. Bulg gene expression signatures of cell types from CIBERSORT were used to obtain a tentative cell type for each cell.

In [1]:

from clustergrammer2 import net
df = {}

>> clustergrammer2 backend version 0.15.0 -- hdbscan

In [2]:

from sklearn.metrics import f1_score
import pandas as pd
import numpy as np
from copy import deepcopy

import matplotlib.pyplot as plt
%matplotlib inline 

import warnings
warnings.filterwarnings('ignore')

In [3]:

def calc_mean_var_disp(df_inst):
    mean_arr = []
    var_arr = []
    mean_names = []
    for inst_gene in df_inst.index.tolist():
        mean_arr.append( df_inst.loc[inst_gene].mean() )
        var_arr.append(df_inst.loc[inst_gene].var())
        mean_names.append(inst_gene)

    ser_mean = pd.Series(data=mean_arr, index=mean_names)
    ser_var = pd.Series(data=var_arr, index=mean_names)    
    return ser_mean, ser_var

In [4]:

def cell_umi_count(df):
    sum_arr = []
    sum_names = []
    for inst_cell in df:
        sum_arr.append( df[inst_cell].sum() )
        sum_names.append(inst_cell)
    
    ser_sum = pd.Series(data=sum_arr, index=sum_names)
    return ser_sum

Load Data¶

In [5]:

df = net.load_gene_exp_to_df('../data/pbmc3k_filtered_gene_bc_matrices/hg19/')
df.shape

Out[5]:

(32738, 2700)

Remove Ribosomal and Mitochondrial Genes¶

In [6]:

all_genes = df.index.tolist()
print(len(all_genes))
keep_genes = [x for x in all_genes if 'RPL' not in x]
keep_genes = [x for x in keep_genes if 'RPS' not in x]
print(len(keep_genes))

df = df.loc[keep_genes]
df.shape

# Removing Mitochondrial Genes
list_mito_genes = ['MTRNR2L11', 'MTRF1', 'MTRNR2L12', 'MTRNR2L13', 'MTRF1L', 'MTRNR2L6', 'MTRNR2L7',
                'MTRNR2L10', 'MTRNR2L8', 'MTRNR2L5', 'MTRNR2L1', 'MTRNR2L3', 'MTRNR2L4']

all_genes = df.index.tolist()
mito_genes = [x for x in all_genes if 'MT-' == x[:3] or 
             x.split('_')[0] in list_mito_genes]
print(mito_genes)

keep_genes = [x for x in all_genes if x not in mito_genes]
df = df.loc[keep_genes]

32738
32546
['MTRNR2L11', 'MTRNR2L12', 'MTRNR2L13', 'MTRF1L', 'MTRNR2L6', 'MTRNR2L10', 'MTRNR2L7', 'MTRNR2L5', 'MTRNR2L8', 'MTRF1', 'MTRNR2L4', 'MTRNR2L1', 'MTRNR2L3', 'MT-ND1', 'MT-ND2', 'MT-CO1', 'MT-CO2', 'MT-ATP8', 'MT-ATP6', 'MT-CO3', 'MT-ND3', 'MT-ND4L', 'MT-ND4', 'MT-ND5', 'MT-ND6', 'MT-CYB']

Keep top 5K Expressing Genes¶

In [7]:

ser_mean, ser_var = calc_mean_var_disp(df)

num_keep_mean = 5000
num_top_var = 250

# filter for top expressing genes
keep_mean = ser_mean.sort_values(ascending=False)[:num_keep_mean].index.tolist()

df = df.loc[keep_mean]
df.shape

Out[7]:

(5000, 2700)

Find top 250 Variable Genes¶

In [8]:

ser_keep_var = ser_var[keep_mean]
# filter for top variance based
keep_var = ser_keep_var.sort_values(ascending=False).index.tolist()[:num_top_var]
len(keep_var)

Out[8]:

UMI Normalize GEX Data¶

In [9]:

%%time
ser_sum = cell_umi_count(df)
df = df.div(ser_sum)
print(df.shape)
print(df.sum().head())

(5000, 2700)
AAACATACAACCAC    1.0
AAACATTGAGCTAC    1.0
AAACATTGATCAGC    1.0
AAACCGTGCTTCCG    1.0
AAACCGTGTATGCG    1.0
dtype: float64
CPU times: user 967 ms, sys: 247 ms, total: 1.21 s
Wall time: 1.03 s

Find top expressing genes¶

In [10]:

ser_keep_var = ser_var[keep_mean]
# filter for top variance based
keep_var = ser_keep_var.sort_values(ascending=False).index.tolist()[:num_top_var]

ArcSinh Transform and Z-score GEX Data¶

In [11]:

# ArcSinh transform
df = np.arcsinh(df/5)

# Z-score genes
net.load_df(df)
net.normalize(axis='row', norm_type='zscore')

# round to two decimal points
df = net.export_df().round(2)

print(df.shape)

(5000, 2700)

In [12]:

# df.columns = [(x, 'Cell Type: Unknown') for x in df.columns.tolist()]

Unlabeled Cells¶

In [13]:

net.load_df(df.loc[keep_var])
net.clip(lower=-5, upper=5)
# net.manual_category(col='Cell Type')
net.widget()

CGM2(network='{"row_nodes": [{"name": "FTL", "ini": 250, "clust": 236, "rank": 245, "rankvar": 247}, {"name": …

In [14]:

# man_cat = net.get_manual_category('col', 'Cell Type')
# man_cat['Cell Type'].value_counts()

Load CIBERSORT gene sigantures¶

In [15]:

net.load_file('../data/cell_type_signatures/nm3337_narrow_cell_type_sigs.txt')
net.normalize(axis='row', norm_type='zscore')
df_sig = net.export_df()
print(df_sig.shape)

rows = df_sig.index.tolist()
new_rows = [x.split('_')[0] for x in rows]
df_sig.index = new_rows

---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-15-d61724600f92> in <module>()
----> 1 net.load_file('../data/cell_type_signatures/nm3337_narrow_cell_type_sigs.txt')
      2 net.normalize(axis='row', norm_type='zscore')
      3 df_sig = net.export_df()
      4 print(df_sig.shape)
      5 

~/Documents/clustergrammer2/clustergrammer2/clustergrammer_fun/__init__.py in load_file(self, filename)
     53     Load TSV file.
     54     '''
---> 55     load_data.load_file(self, filename)
     56 
     57   def load_file_as_string(self, file_string, filename=''):

~/Documents/clustergrammer2/clustergrammer2/clustergrammer_fun/load_data.py in load_file(net, filename)
     23   f.close()
     24 
---> 25   load_file_as_string(net, file_string, filename)
     26 
     27 def load_file_as_string(net, file_string, filename=''):

~/Documents/clustergrammer2/clustergrammer2/clustergrammer_fun/load_data.py in load_file_as_string(net, file_string, filename)
     41     filename = filename.split('/')[-1]
     42 
---> 43   net.load_tsv_to_net(buff, filename)
     44 
     45 def load_stdin(net):

~/Documents/clustergrammer2/clustergrammer2/clustergrammer_fun/__init__.py in load_tsv_to_net(self, file_buffer, filename)
     73     be possible to load data without having to read from a file.
     74     '''
---> 75     load_data.load_tsv_to_net(self, file_buffer, filename)
     76 
     77   def load_vect_post_to_net(self, vect_post):

~/Documents/clustergrammer2/clustergrammer2/clustergrammer_fun/load_data.py in load_tsv_to_net(net, file_buffer, filename)
     69   df = proc_df_labels.main(df)
     70 
---> 71   net.df_to_dat(df, True)
     72   net.dat['filename'] = filename
     73 

~/Documents/clustergrammer2/clustergrammer2/clustergrammer_fun/__init__.py in df_to_dat(self, df, define_cat_colors)
    246     Load Pandas DataFrame (will be deprecated).
    247     '''
--> 248     data_formats.df_to_dat(self, df, define_cat_colors)
    249 
    250   def set_matrix_colors(self, pos='red', neg='blue'):

~/Documents/clustergrammer2/clustergrammer2/clustergrammer_fun/data_formats.py in df_to_dat(net, df, define_cat_colors)
    114           net.dat['node_info'][axis][cat_name] = cat_values
    115 
--> 116   categories.dict_cat(net, define_cat_colors=define_cat_colors)
    117 
    118 def dat_to_df(net):

~/Documents/clustergrammer2/clustergrammer2/clustergrammer_fun/categories.py in dict_cat(net, define_cat_colors)
    131             for inst_full_name in inst_dict:
    132 
--> 133                 inst_name = inst_full_name.split(': ')[1]
    134                 inst_color = inst_dict[inst_full_name]
    135 

IndexError: list index out of range

In [ ]:

ct_color = {}
ct_color['T cells CD8'] = 'red'
ct_color['T cells CD4 naive'] = 'blue'
ct_color['T cells CD4 memory activated'] = 'blue'
ct_color['T cells CD4 memory resting'] = '#87cefa' # sky blue
ct_color['B cells naive'] = 'purple'
ct_color['B cells memory'] = '#DA70D6' # orchid
ct_color['NK cells activated'] = 'yellow'
ct_color['NK cells resting'] = '#FCD116' # sign yellow
ct_color['Monocytes'] = '#98ff98' # mint green
ct_color['Macrophages M0'] = '#D3D3D3' # light grey
ct_color['Macrophages M1'] = '#C0C0C0' # silver
ct_color['Macrophages M2'] = '#A9A9A9' # dark grey
ct_color['N.A.'] = 'white'

In [ ]:

def set_cat_colors(axis, cat_index, cat_title=False):
    for inst_ct in ct_color:
        if cat_title != False:
            cat_name = cat_title + ': ' + inst_ct
        else:
            cat_name = inst_ct
            
        inst_color = ct_color[inst_ct]
        net.set_cat_color(axis=axis, cat_index=cat_index, cat_name=cat_name, inst_color=inst_color)

In [ ]:

set_cat_colors('col', 1)

In [ ]:

gene_sig = df_sig.idxmax(axis=1)
gs_dict = {}
for inst_gene in gene_sig.index.tolist():
    gs_dict[inst_gene] = gene_sig[inst_gene][0]
df_sig_cat = deepcopy(df_sig)
rows = df_sig_cat.index.tolist()
new_rows = [(x, 'Cell Type: ' + gs_dict[x]) if x in gs_dict else (x, 'N.A.') for x in rows ]
df_sig_cat.index = new_rows

net.load_df(df_sig_cat)
set_cat_colors('row', 1, 'Cell Type')

In [ ]:

net.load_df(df_sig_cat)
net.clip(lower=-5, upper=5)
net.widget()

Predict Cell Types using CIBERSORT Signatures¶

In [ ]:

df_pred_cat, df_sig_sim, y_info = net.predict_cats_from_sigs(df, df_sig, 
                                                                   predict_level='Cell Type', unknown_thresh=0.05)
df.columns = df_pred_cat.columns.tolist()
print(df_pred_cat.shape)

Cell Type Similarity¶

In [ ]:

df_sig_sim = df_sig_sim.round(2)
net.load_df(df_sig_sim)
set_cat_colors('col', 1, cat_title='Cell Type')
set_cat_colors('row', 1)

In [ ]:

df_sig_sim.columns = df_pred_cat.columns.tolist()
net.load_df(df_sig_sim)
net.widget()

Cells in CIBERSORT GEX Space¶

In [ ]:

rows = df_pred_cat.index.tolist()
new_rows = [(x, 'Cell Type: ' + gs_dict[x]) if x in gs_dict else (x, 'N.A.') for x in rows ]
df_pred_cat.index = new_rows

In [ ]:

net.load_df(df_pred_cat)
net.clip(lower=-5, upper=5)
net.widget()

Cells with CIBERSORT Predictions, Top Genes Based on Variance¶

In [ ]:

df = df.loc[keep_var]
rows = df.index.tolist()
new_rows = [(x, 'Cell Type: ' + gs_dict[x]) if x in gs_dict else (x, 'N.A.') for x in rows ]
df.index = new_rows

In [ ]:

net.load_df(df)
net.clip(lower=-5, upper=5)
net.load_df(net.export_df().round(2))
net.widget()

In [ ]:

!mkdir ../jsons
net.save_dict_to_json(net.viz, '../jsons/pbmc_2700.json')

In [ ]: