Zebrafish pigementation

In our previous zebrafish tutorial, we have shown how dynamo goes beyond discrete RNA velocity vectors to continous RNA vector field functions. In this tutorial, we will demonstrate a set of awesome downsgtream differential geometry and dynamical systems based analyses, enabled by the differentiable vector field functions, to gain deep functional and predictive insights of cell fate transition during zebrafish pigementation (Saunders, et al. 2019).

With differential geometry analysis of the continous vector field fuctions, we can calculate the RNA Jacobian (see our primer on differential geometry), which is a cell by gene by gene tensor, encoding the gene regulatory network in each cell. With the Jacobian matrix, we can further derive the RNA acceleration, curvature, which are cell by gene matrices, just like gene expression dataset.

In general (see figure below), we can perform differential analyses and gene-set enrichment analyses based on top-ranked acceleration or curvature genes, as well as the top-ranked genes with the strongest self-interactions, top-ranked regulators/targets, or top-ranked interactions for each gene in individual cell types or across all cell types, with either raw or absolute values with the Jacobian tensor. Integrating that ranking information, we can build regulatory networks across different cell types, which can then be visualized with ArcPlot, CircosPlot, or other tools.

In this tutorial, we will cover following topics:

  • learn contionus RNA velocity vector field functions in different spaces (e.g. umap or pca space)
  • calculate RNA acceleration, curvature matrices (cell by gene)
  • rank genes based on RNA velocity, curvature and acceleration matrices
  • calculate RNA Jacobian tensor (cell by gene by gene) for genes with high PCA loadings.
  • rank genes based on the jacobian tensor, which including:
    • rank genes with strong postive or negative self-interaction (divergence ranking)
    • other rankings, ranking modes including full_reg, full_eff, eff, reg and int
  • build and visualize gene regulatory network with top ranked genes
  • gene enrichment analyses of top ranked genes
  • visualize Jacobian derived regulatory interactions across cells
  • visualize gene expression, velocity, acceleration and curvature kinetics along pseudotime trajectory
  • learn and visualize models of cell-fate transitions

Import relevant packages

In [75]:
# !pip install dynamo-release --upgrade --quiet

import dynamo as dyn

# set white background
dyn.configuration.set_figure_params(background='white') 

import matplotlib.pyplot as plt 
import numpy as np 
import pandas as pd
from gseapy.plot import barplot, dotplot

import warnings
warnings.filterwarnings('ignore')

Set the logging level. Various logging level can be setted according to your needs:

  • DEBUG: useful for dynamo development, show all logging information, including those debugging information
  • INFO: useful for most dynamo users, show detailed dynamo running information
  • WARNING: show only warning information
  • ERROR: show only exception or error information
  • CRITICAL: show only critical information
In [76]:
%matplotlib inline
from dynamo.dynamo_logger import main_info, LoggerManager
LoggerManager.main_logger.setLevel(LoggerManager.INFO)

Load processed data or data preprocessing

If you followed the zebrafish pigmentation tutorial, you can load the processed zebrafish adata object here for all downstream analysis.

In [77]:
adata = dyn.sample_data.zebrafish()
adata
|-----> Downloading data to ./data/zebrafish.h5ad
Out[77]:
AnnData object with n_obs × n_vars = 4181 × 16940
    obs: 'split_id', 'sample', 'Size_Factor', 'condition', 'Cluster', 'Cell_type', 'umap_1', 'umap_2', 'batch'
    layers: 'spliced', 'unspliced'
In [78]:
adata = dyn.sample_data.zebrafish()

dyn.pp.recipe_monocle(adata)
dyn.tl.dynamics(adata, cores=3)

dyn.tl.reduceDimension(adata)
dyn.tl.cell_velocities(adata)

dyn.tl.cell_velocities(adata)
dyn.pl.streamline_plot(adata, color=['Cell_type'])
|-----> Downloading data to ./data/zebrafish.h5ad
|-----> recipe_monocle_keep_filtered_cells_key is None. Using default value from DynamoAdataConfig: recipe_monocle_keep_filtered_cells_key=True
|-----> recipe_monocle_keep_filtered_genes_key is None. Using default value from DynamoAdataConfig: recipe_monocle_keep_filtered_genes_key=True
|-----> recipe_monocle_keep_raw_layers_key is None. Using default value from DynamoAdataConfig: recipe_monocle_keep_raw_layers_key=True
|-----> apply Monocole recipe to adata...
|-----> <insert> pp to uns in AnnData Object.
|-----------> <insert> has_splicing to uns['pp'] in AnnData Object.
|-----------> <insert> has_labling to uns['pp'] in AnnData Object.
|-----------> <insert> splicing_labeling to uns['pp'] in AnnData Object.
|-----------> <insert> has_protein to uns['pp'] in AnnData Object.
|-----> ensure all cell and variable names unique.
|-----> ensure all data in different layers in csr sparse matrix format.
|-----> ensure all labeling data properly collapased
|-----------> <insert> tkey to uns['pp'] in AnnData Object.
|-----------> <insert> experiment_type to uns['pp'] in AnnData Object.
|-----> filtering cells...
|-----> <insert> pass_basic_filter to obs in AnnData Object.
|-----> 4167 cells passed basic filters.
|-----> filtering gene...
|-----> <insert> pass_basic_filter to var in AnnData Object.
|-----> 4194 genes passed basic filters.
|-----> calculating size factor...
|-----> selecting genes in layer: X, sort method: SVR...
|-----> <insert> frac to var in AnnData Object.
|-----> size factor normalizing the data, followed by log1p transformation.
|-----> Set <adata.X> to normalized data
|-----> applying PCA ...
|-----> <insert> pca_fit to uns in AnnData Object.
|-----> <insert> ntr to obs in AnnData Object.
|-----> <insert> ntr to var in AnnData Object.
|-----> cell cycle scoring...
|-----> computing cell phase...
|-----> [cell phase estimation] in progress: 100.0000%
|-----> [cell phase estimation] finished [644.1496s]
|-----> <insert> cell_cycle_phase to obs in AnnData Object.
|-----> <insert> cell_cycle_scores to obsm in AnnData Object.
|-----? 
Dynamo is not able to perform cell cycle staging for you automatically. 
Since dyn.pl.phase_diagram in dynamo by default colors cells by its cell-cycle stage, 
you need to set color argument accordingly if confronting errors related to this.
|-----> [recipe_monocle preprocess] in progress: 100.0000%
|-----> [recipe_monocle preprocess] finished [3.2558s]
|-----> dynamics_del_2nd_moments_key is None. Using default value from DynamoAdataConfig: dynamics_del_2nd_moments_key=False
|-----------> removing existing M layers:[]...
|-----------> making adata smooth...
|-----> calculating first/second moments...
|-----> [moments calculation] in progress: 100.0000%
|-----> [moments calculation] finished [12.2996s]
|-----> retrive data for non-linear dimension reduction...
|-----> perform umap...
|-----> [dimension_reduction projection] in progress: 100.0000%
|-----> [dimension_reduction projection] finished [17.8196s]
|-----> incomplete neighbor graph info detected: connectivities and distances do not exist in adata.obsp, indices not in adata.uns.neighbors.
|-----> Neighbor graph is broken, recomputing....
|-----> Start computing neighbor graph...
|-----------> X_data is None, fetching or recomputing...
|-----> fetching X data from layer:None, basis:pca
|-----> method arg is None, choosing methods automatically...
|-----------> method ball_tree selected
|-----> <insert> connectivities to obsp in AnnData Object.
|-----> <insert> distances to obsp in AnnData Object.
|-----> <insert> neighbors to uns in AnnData Object.
|-----> <insert> neighbors.indices to uns in AnnData Object.
|-----> <insert> neighbors.params to uns in AnnData Object.
|-----> 0 genes are removed because of nan velocity values.
|-----> [calculating transition matrix via pearson kernel with sqrt transform.] in progress: 100.0000%
|-----> [calculating transition matrix via pearson kernel with sqrt transform.] finished [5.7484s]
|-----> [projecting velocity vector to low dimensional embedding] in progress: 100.0000%
|-----> [projecting velocity vector to low dimensional embedding] finished [0.7405s]
|-----> 0 genes are removed because of nan velocity values.
Using existing pearson_transition_matrix found in .obsp.
|-----> [projecting velocity vector to low dimensional embedding] in progress: 100.0000%
|-----> [projecting velocity vector to low dimensional embedding] finished [0.7422s]

If you confronted errors when saving dynamo processed adata object, please see the very end of this tutorial.

If you would like to start from scratch, use the following code to preprocess the zebrafish adata object (or use your own dataset):

adata = dyn.sample_data.zebrafish()

dyn.pp.recipe_monocle(adata)
dyn.tl.dynamics(adata, cores=3)

dyn.tl.reduceDimension(adata)
dyn.tl.cell_velocities(adata)

dyn.tl.cell_velocities(adata)
dyn.pl.streamline_plot(adata, color=['Cell_type'])

Differential geometry analysis

In this part we will demonstrate how to leverage dynamo to estimate RNA jacobian (reveals state-dependent regulation), RNA acceleration/curvature (reveals earlier drivers and fate decision points), etc.

To gain functional and biological insights, we can perform a series of downstream analysis with the computed differential geometric quantities. We can first rank genes across all cells or in each cell group for any of those differential geometric quantities, followed by gene set enrichment analyses of the top ranked genes, as well as regulatory network construction and visualization.

The differential geometry and dynamical systems (i.e. fixed points, nullclines, etc mentioned in the previous zebrafish tutorial) are conventionally used to describe small-scale systems, while the vector field we build comes from high-dimensional genomics datasets. From this, you can appreciate that with dynamo, we are bridging small-scale systems-biology/physics type of thinking with high-dimensional genomics using ML, something really unimaginable until very recently!

In order to calculate RNA jacobian, acceleration and curvature, we can either learn the vector field function directly in the gene expression space or on the PCA space but then project the differential geometric quantities learned in PCA space back to the original gene expression space. Since we often have thousands of genes, we generally learn vector field in PCA space to avoid the curse of dimensionality and to improve the efficiency and accuracy of our calculation.

Vector field learning in PCA space

To learn PCA basis based RNA velocity vector field function, we need to first project the RNA velocities into PCA space.

In [79]:
dyn.tl.cell_velocities(adata, basis='pca');
|-----> 0 genes are removed because of nan velocity values.
Using existing pearson_transition_matrix found in .obsp.
|-----> [projecting velocity vector to low dimensional embedding] in progress: 100.0000%
|-----> [projecting velocity vector to low dimensional embedding] finished [0.9930s]

Then we will use the dyn.vf.VectorField function to learns the vector field function in PCA space. This function relies on sparseVFC to learn the high dimensional vector field function in the entire expression space from sparse single cell velocity vector samples robustly.

Note that if you don't provide any basis, vector field will be learned in the original gene expression and you can learn vector field for other basis too, as long as you have the RNA velocities projected in that basis.

Related information for the learned vector field are stored in adata.

In [80]:
dyn.vf.VectorField(adata, 
                   basis='pca', 
                   M=100)
|-----> VectorField reconstruction begins...
|-----> Retrieve X and V based on basis: PCA. 
        Vector field will be learned in the PCA space.
|-----> Learning vector field with method: sparsevfc.
|-----> [SparseVFC] begins...
|-----> Sampling control points based on data velocity magnitude...
|-----> [SparseVFC] in progress: 100.0000%
|-----> [SparseVFC] finished [0.3066s]
|-----> <insert> velocity_pca_SparseVFC to obsm in AnnData Object.
|-----> <insert> X_pca_SparseVFC to obsm in AnnData Object.
|-----> <insert> VecFld_pca to uns in AnnData Object.
|-----> <insert> control_point_pca to obs in AnnData Object.
|-----> <insert> inlier_prob_pca to obs in AnnData Object.
|-----> <insert> obs_vf_angle_pca to obs in AnnData Object.
|-----> [VectorField] in progress: 100.0000%
|-----> [VectorField] finished [0.8875s]

Velocity, acceleration and curvature ranking

To gain functional insights of the biological process under study, we design a set of ranking methods to rank gene's absolute, positive, negative vector field quantities in different cell groups that you can specify. Here we will first demonstrate how to rank genes based on their velocity matrix.

Basically, the rank functions in the vector field submodule (vf) of dynamo is organized as rank_{quantities}_genes where {quantities} can be any differential geometry quantities, including, velocity, divergence, acceleration, curvature, jacobian:

  • dyn.vf.rank_velocity_genes(adata, groups='Cell_type')
  • dyn.vf.rank_divergence_genes(adata, groups='Cell_type')
  • dyn.vf.rank_acceleration_genes(adata, groups='Cell_type')
  • dyn.vf.rank_curvature_genes(adata, groups='Cell_type')
  • dyn.vf.rank_jacobian_genes(adata, groups='Cell_type')

Gene ranking for different quantities (except jacobian, see below) are done based on both their raw and absolute velocities for each cell group when groups is set or for all cells if it is not set.

In [81]:
dyn.vf.rank_velocity_genes(adata, 
                           groups='Cell_type', 
                           vkey="velocity_S");

Ranking results are saved in .uns with the pattern rank_{quantities}_genes or rankabs{quantities}_genes where {quantities} can be any differential geometry quantities and the one with _abs indicates the ranking is based on absolute values instead of raw values.

We can save the speed ranking information to rank_speed or rank_abs_speed for future usages if needed.

In [82]:
rank_speed = adata.uns['rank_velocity_S'];
rank_abs_speed = adata.uns['rank_abs_velocity_S'];

Next we usedyn.vf.acceleration to compute acceleration for each cell with the learned vector field in adata. Note that we use PCA basis to calculate acceleration, but dyn.vf.acceleration will by default project acceleration_pca back to the original high dimension gene-wise space. You can check the resulted adata which will have both acceleration (in .layers) and acceleration_pca (in .obsm). We can also rank acceleration in the same fashion as what we did to velocity.

In [83]:
dyn.vf.acceleration(adata, basis='pca')
|-----> [Calculating acceleration] in progress: 100.0000%
|-----> [Calculating acceleration] finished [0.1588s]
|-----> <insert> acceleration to layers in AnnData Object.
In [84]:
dyn.vf.rank_acceleration_genes(adata, 
                               groups='Cell_type', 
                               akey="acceleration", 
                               prefix_store="rank");
rank_acceleration = adata.uns['rank_acceleration'];
rank_abs_acceleration = adata.uns['rank_abs_acceleration'];

Similarly, we can also use dyn.vf.curvature to calculate curvature for each cell with the reconstructed vector field function stored in adata. dyn.vf.rank_curvature_genes ranks genes based on their raw or absolute curvature values in different cell groups.

In [85]:
dyn.vf.curvature(adata, basis='pca');
|-----> [Calculating acceleration] in progress: 100.0000%
|-----> [Calculating acceleration] finished [0.1446s]
|-----> [Calculating curvature] in progress: 100.0000%
|-----> [Calculating curvature] finished [0.1690s]
|-----> <insert> curvature_pca to obs in AnnData Object.
|-----> <insert> curvature_pca to obsm in AnnData Object.
|-----> <insert> curvature to layers in AnnData Object.
In [86]:
dyn.vf.rank_curvature_genes(adata, groups='Cell_type');

Now we estimated RNA acceleration and RNA curvature, we can visualize the acceleration or curvature for individual genes just like what we can do with gene expression or velocity, etc.

Let us show the velocity for gene tfec and pnp4a. bwr (blue-white-red) colormap is used here because velocity has both positive and negative values. The same applies to acceleration and curvature.

In [87]:
dyn.pl.umap(adata, color=['tfec', 'pnp4a'], layer='velocity_S', frontier=True)

This is for acceleration of genes tfec and pnp4a.

In [88]:
dyn.pl.umap(adata, color=['tfec', 'pnp4a'], layer='acceleration', frontier=True)