scNT-seq human hematopoiesis dynamics

In this tutorial, we will demonstrate how to perform absolute total RNA velocity analyses with the metabolic labeling datasets that was reported in the dynamo Cell paper. This notebook reproduces the total RNA velocity analyses reported in figure 3B of the original cell study. Note that dynamo is equiped with the processed hematopoiesis dataset produced following the steps presented in this notebook. You can download the processed data via dyn.sample_data.hematopoiesis.

In [1]:
%%capture

import dynamo as dyn
import pandas as pd
import numpy as np
import warnings

warnings.filterwarnings("ignore")

First let us download the raw hematopoiesis data via dyn.sample_data.hematopoiesis_raw. Thenew and total layers correspond to the labeled RNA and total RNA and will be used for total RNA velocity analyses estimation. Note that in order to ensure the reproducibility, we will also included the umap embedding in .obsm['X_umap'] generated previously.

In [2]:
adata_hsc_raw = dyn.sample_data.hematopoiesis_raw()
adata_hsc_raw
Out[2]:
AnnData object with n_obs × n_vars = 1947 × 26193
    obs: 'batch', 'cell_type', 'time'
    var: 'gene_name_mapping'
    uns: 'genes_to_use'
    obsm: 'X_umap'
    layers: 'new', 'spliced', 'total', 'unspliced'

We use the monocle recipe to preprocess the adata object. Importantly, to ensure the reproducibility, we also use a predefined gene list that includes the highly variable genes and known markers genes based on previous reports (Paul, cell, 2015, Weintrab, Science, 2020, etc). The gene list is stored in adata_hsc_raw.uns["genes_to_use"] of the hematopoiesis raw dataset came with dynamo.

Note that the time key indicates the RNA metabolic labeling duration. Since we labeled cells from the same harvest time point for a single time pulse, so the experiment_type is set to be "one-shot" (although we labeled cells at day 7 for 3 hours while cells at day 10 for 5 hours).

In [3]:
selected_genes_to_use = adata_hsc_raw.uns["genes_to_use"]
In [5]:
dyn.pp.recipe_monocle(
    adata_hsc_raw,
    tkey="time",
    experiment_type="one-shot",
    genes_to_use=selected_genes_to_use,
    n_top_genes=len(selected_genes_to_use),
    # feature_selection_layer="new",
    maintain_n_top_genes=True,
)
|-----> apply Monocole recipe to adata...
|-----> convert ensemble name to official gene name
|-----? Your adata object uses non-official gene names as gene index. 
Dynamo is converting those names to official gene names.
Error: The requests_cache python module is required to use request caching.
See - https://requests-cache.readthedocs.io/en/latest/user_guide.html#installation
querying 1-1000...done.
querying 1001-2000...done.
querying 2001-3000...done.
querying 3001-4000...done.
querying 4001-5000...done.
querying 5001-6000...done.
querying 6001-7000...done.
querying 7001-8000...done.
querying 8001-9000...done.
querying 9001-10000...done.
querying 10001-11000...done.
querying 11001-12000...done.
querying 12001-13000...done.
querying 13001-14000...done.
querying 14001-15000...done.
querying 15001-16000...done.
querying 16001-17000...done.
querying 17001-18000...done.
querying 18001-19000...done.
querying 19001-20000...done.
querying 20001-21000...done.
querying 21001-22000...done.
querying 22001-23000...done.
querying 23001-24000...done.
querying 24001-25000...done.
querying 25001-26000...done.
querying 26001-26193...done.
Finished.
1 input query terms found dup hits:
	[('ENSG00000229425', 2)]
34 input query terms found no hit:
	['ENSG00000168078', 'ENSG00000189144', 'ENSG00000203812', 'ENSG00000204929', 'ENSG00000221995', 'ENS
Pass "returnall=True" to return complete lists of duplicate or missing query terms.
|-----> <insert> pp to uns in AnnData Object.
|-----------> <insert> has_splicing to uns['pp'] in AnnData Object.
|-----------> <insert> has_labling to uns['pp'] in AnnData Object.
|-----------> <insert> splicing_labeling to uns['pp'] in AnnData Object.
|-----------> <insert> has_protein to uns['pp'] in AnnData Object.
|-----> ensure all cell and variable names unique.
|-----> ensure all data in different layers in csr sparse matrix format.
|-----> ensure all labeling data properly collapased
|-----------> <insert> tkey to uns['pp'] in AnnData Object.
|-----------> <insert> experiment_type to uns['pp'] in AnnData Object.
|-----> filtering cells...
|-----> <insert> pass_basic_filter to obs in AnnData Object.
|-----> 1947 cells passed basic filters.
|-----> filtering gene...
|-----> <insert> pass_basic_filter to var in AnnData Object.
|-----> 2885 genes passed basic filters.
|-----> calculating size factor...
|-----> <insert> use_for_pca to var in AnnData Object.
|-----> <insert> frac to var in AnnData Object.
|-----> size factor normalizing the data, followed by log1p transformation.
|-----> applying PCA ...
|-----> <insert> pca_fit to uns in AnnData Object.
|-----> <insert> ntr to obs in AnnData Object.
|-----> <insert> ntr to var in AnnData Object.
|-----> cell cycle scoring...
|-----> <insert> cell_cycle_phase to obs in AnnData Object.
|-----> <insert> cell_cycle_scores to obsm in AnnData Object.
|-----> [Cell Cycle Scores Estimation] in progress: 100.0000%
|-----> [Cell Cycle Scores Estimation] finished [0.2453s]
|-----> [recipe_monocle preprocess] in progress: 100.0000%
|-----> [recipe_monocle preprocess] finished [66.7180s]
In [6]:
adata_hsc_raw.var.use_for_pca.sum()
Out[6]:
1766
In [7]:
dyn.tl.reduceDimension(adata_hsc_raw)
|-----> retrive data for non-linear dimension reduction...
|-----? adata already have basis umap. dimension reduction umap will be skipped! 
set enforce=True to re-performing dimension reduction.
|-----> Start computing neighbor graph...
|-----------> X_data is None, fetching or recomputing...
|-----> fetching X data from layer:None, basis:pca
|-----> method arg is None, choosing methods automatically...
|-----------> method ball_tree selected
|-----> <insert> connectivities to obsp in AnnData Object.
|-----> <insert> distances to obsp in AnnData Object.
|-----> <insert> neighbors to uns in AnnData Object.
|-----> <insert> neighbors.indices to uns in AnnData Object.
|-----> <insert> neighbors.params to uns in AnnData Object.
|-----> [dimension_reduction projection] in progress: 100.0000%
|-----> [dimension_reduction projection] finished [0.2599s]

Estimate RNA velocity with the Model 2

In general, dynamo supports two major models for estimating kinetic parameters and RNA velocity for tscRNA-seq data. The Model 2 doesn't consider RNA splicing while Monocle 3 does (see Fig. SI2. A).

Note that we also use labeling time to group cells for gene expression smoothing via dyn.tl.moments.

In [8]:
dyn.tl.moments(adata_hsc_raw, group="time")
|-----> calculating first/second moments...
2022-04-14 01:11:51.964395: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2022-04-14 01:11:51.964446: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
|-----> [moments calculation] in progress: 100.0000%
|-----> [moments calculation] finished [97.6468s]

Since we actually have unsplicing/splicing data in our adata, dynamo's preprocess module automatically recognizes this and then tag the adata to have both splicing and labeling information. In order to use Model 2, here we purposely set has_splicing to be false, which then considers labeling data (new/total) while ignores unsplicing/splicing information.

Note that in order to ensure the reproducibility, we set one_shot_method="sci_fate", model="deterministic" but running with default parameters will give you very similar results.

In [9]:
adata_hsc_raw.uns["pp"]["has_splicing"] = False
dyn.tl.dynamics(adata_hsc_raw, group="time", one_shot_method="sci_fate", model="deterministic");
|-----> calculating first/second moments...
|-----> [moments calculation] in progress: 100.0000%
|-----> [moments calculation] finished [5.6319s]
estimating gamma: 100%|██████████| 1766/1766 [00:19<00:00, 92.38it/s] 
estimating alpha: 100%|██████████| 1766/1766 [00:00<00:00, 36315.57it/s]
|-----> calculating first/second moments...
|-----> [moments calculation] in progress: 100.0000%
|-----> [moments calculation] finished [4.1257s]
estimating gamma: 100%|██████████| 1766/1766 [00:13<00:00, 128.37it/s]
estimating alpha: 100%|██████████| 1766/1766 [00:00<00:00, 44588.02it/s]

Next, because we actually quantified both the labeling and splicing information, we used the second formula that involves both splicing and labeling data to define total RNA velocity ($\dot{r} = n / (1 - e^{-rt}) \cdot r - \gamma s$) where $r, n, t, \gamma, s$ are total RNA, new RNA, labeling time, splicing rate and spliced RNA respectively.

Once the high-dimensional total RNA velocities are calculated, we will then projected them to two-dimensional UMAP space and visualized with the streamline plot, using dynamo with default parameters (dyn.tl.cell_velocities).

In [10]:
adata_hsc_raw.obs.time.unique()
Out[10]:
array([3, 5])
In [12]:
adata_hsc_raw
Out[12]:
AnnData object with n_obs × n_vars = 1947 × 20919
    obs: 'batch', 'cell_type', 'time', 'nGenes', 'nCounts', 'pMito', 'pass_basic_filter', 'new_Size_Factor', 'initial_new_cell_size', 'Size_Factor', 'initial_cell_size', 'spliced_Size_Factor', 'initial_spliced_cell_size', 'unspliced_Size_Factor', 'initial_unspliced_cell_size', 'total_Size_Factor', 'initial_total_cell_size', 'ntr', 'cell_cycle_phase'
    var: 'gene_name_mapping', 'query', 'scopes', '_id', '_score', 'symbol', 'notfound', 'nCells', 'nCounts', 'pass_basic_filter', 'use_for_pca', 'frac', 'ntr', 'time_3_alpha', 'time_3_beta', 'time_3_gamma', 'time_3_half_life', 'time_3_alpha_b', 'time_3_alpha_r2', 'time_3_gamma_b', 'time_3_gamma_r2', 'time_3_gamma_logLL', 'time_3_delta_b', 'time_3_delta_r2', 'time_3_bs', 'time_3_bf', 'time_3_uu0', 'time_3_ul0', 'time_3_su0', 'time_3_sl0', 'time_3_U0', 'time_3_S0', 'time_3_total0', 'time_3_beta_k', 'time_3_gamma_k', 'time_5_alpha', 'time_5_beta', 'time_5_gamma', 'time_5_half_life', 'time_5_alpha_b', 'time_5_alpha_r2', 'time_5_gamma_b', 'time_5_gamma_r2', 'time_5_gamma_logLL', 'time_5_bs', 'time_5_bf', 'time_5_uu0', 'time_5_ul0', 'time_5_su0', 'time_5_sl0', 'time_5_U0', 'time_5_S0', 'time_5_total0', 'time_5_beta_k', 'time_5_gamma_k', 'use_for_dynamics'
    uns: 'genes_to_use', 'pp', 'PCs', 'explained_variance_ratio_', 'pca_mean', 'pca_fit', 'feature_selection', 'cell_phase_genes', 'neighbors', 'dynamics'
    obsm: 'X_umap', 'X_pca', 'X', 'cell_cycle_scores'
    layers: 'new', 'spliced', 'total', 'unspliced', 'X_new', 'X_spliced', 'X_unspliced', 'X_total', 'M_u', 'M_uu', 'M_s', 'M_us', 'M_t', 'M_tt', 'M_n', 'M_tn', 'M_ss', 'M_nn', 'velocity_N', 'velocity_T'
    obsp: 'distances', 'connectivities', 'moments_con'

We have two time points in hsc dataset. Here we split the dataset based on time points and prepare data for calculation next.

In [14]:
pca_genes = adata_hsc_raw.var.use_for_pca
new_expr = adata_hsc_raw[:, pca_genes].layers["M_n"]
time_3_gamma = adata_hsc_raw[:, pca_genes].var.time_3_gamma.astype(float)
time_5_gamma = adata_hsc_raw[:, pca_genes].var.time_5_gamma.astype(float)

t = adata_hsc_raw.obs.time.astype(float)
M_s = adata_hsc_raw.layers["M_s"][:, pca_genes]

time_3_cells = adata_hsc_raw.obs.time == 3
time_5_cells = adata_hsc_raw.obs.time == 5

Next, we will calculate total RNA velocity according to $$\dot{r} = n / (1 - e^{-rt}) \cdot r - \gamma s$$

In [15]:
def alpha_minus_gamma_s(new, gamma, t, M_s):
    # equation: alpha = new / (1 - e^{-rt}) * r
    alpha = new.A.T / (1 - np.exp(-gamma.values[:, None] * t.values[None, :])) * gamma.values[:, None]
    
    gamma_s = gamma.values[:, None] * M_s.A.T
    alpha_minus_gamma_s = alpha - gamma_s
    return alpha_minus_gamma_s

time_3_velocity_n = alpha_minus_gamma_s(new_expr[time_3_cells, :], time_3_gamma, t[time_3_cells], M_s[time_3_cells, :])
time_5_velocity_n = alpha_minus_gamma_s(new_expr[time_5_cells, :], time_5_gamma, t[time_5_cells], M_s[time_5_cells, :])

velocity_n = adata_hsc_raw.layers["velocity_N"].copy()

valid_velocity_n = velocity_n[:, pca_genes].copy()
valid_velocity_n[time_3_cells, :] = time_3_velocity_n.T
valid_velocity_n[time_5_cells, :] = time_5_velocity_n.T
velocity_n[:, pca_genes] = valid_velocity_n.copy()

adata_hsc_raw.layers["velocity_alpha_minus_gamma_s"] = velocity_n.copy()

The results are stored in adata_hsc_raw.layers["velocity_alpha_minus_gamma_s"], which can be further projected to low dimension space for visualization.

In [17]:
dyn.tl.cell_velocities(
    adata_hsc_raw,
    enforce=True,
    X=adata_hsc_raw.layers["M_t"],
    V=adata_hsc_raw.layers["velocity_alpha_minus_gamma_s"],
    method="cosine",
);

Now let us plot the total RNA stream line plot and visualize the PF4 gene expression on the UMAP space with default parameters. This reproduces total RNA velocity streamline plot in Figure 3B and Figure 3C.

In [18]:
dyn.pl.streamline_plot(
    adata_hsc_raw,
    color=["batch", "cell_type", "PF4"],
    ncols=4,
    basis="umap",
)