This tutorial uses data from Saunders, et al (2019). Special thanks also go to Lauren for the tutorial improvement.
In this study, the authors profiled thousands of neural crest-derived cells from trunks of post-embryonic zebrafish. These cell classes include pigment cells, multipotent pigment cell progenitors, peripheral neurons, Schwann cells, chromaffin cells and others. These cells were collected during an active period of post-embryonic development, which has many similarities to fetal and neonatal development in mammals, when many of these cell types are migrating and differentiating as the animal transitions into its adult form. This study also explores the role of thyroid hormone (TH), a common endocrine factor, on the development of these different cell types.
Such developmental and other dynamical processes are especially suitable for dynamo analysis as dynamo is designed to accurately estimate direction and magnitude of expression dynamics (RNA velocity
), predict the entire lineage trajectory of any intial cell state (vector field
), characterize the structure (vector field topology
) of full gene expression space, as well as fate commitment potential (single cell potential
).
# get the latest pypi version
# to get the latest version on github and other installations approaches, see:
# https://dynamo-release.readthedocs.io/en/latest/ten_minutes_to_dynamo.html#how-to-install
!pip install dynamo-release --upgrade --quiet
Import the package and silence some warning information (mostly is_categorical_dtype
warning from anndata)
import warnings
warnings.filterwarnings('ignore')
import dynamo as dyn
this is like R's sessionInfo() which helps you to debug version related bugs if any.
dyn.get_all_dependencies_version()
emulate ggplot2 plotting style with white background
dyn.configuration.set_figure_params('dynamo', background='white')
Dynamo comes with a few builtin sample datasets so you can familiarize with dynamo before analyzing your own dataset.
You can read your own data via read
, read_loom
, read_h5ad
, read_h5
(powered by the anndata package) or load_NASC_seq, etc. Here I just load the zebrafish sample data that comes with dynamo. This dataset has 4181 cells and 16940 genes. Its .obs
attribute also included condition
, batch
information from the original study (you should also store those information to your .obs
attribute which is essentially a Pandas Dataframe, see more at anndata). Cluster
, Cell_type
, umap coordinates that was originally analyzed with Monocle 3 are also provided.
adata = dyn.sample_data.zebrafish()
After loading data, you are ready to performs some preprocessing. You can run the recipe_monocle
function that uses similar but generalized strategy from Monocle 3 to normalize all datasets in different layers (the spliced and unspliced or new, i.e. metabolic labelled, and total mRNAs or others), followed by feature selection, log1p
transformation of the data and PCA dimension reduction. recipe_monocle
also does a few additionl steps, which include:
converting ensemble gene names to gene official name and set them as .var_names
if needed.
calculating number of expressed genes (nGenes
), total expression values (nCounts
), percentage of total mitochondria gene values (pMito
) for each cell and save them to .obs
.
detecting your experiment type (conventional scRNA-seq or time-resolved metabolic labeling datasets) and set certain proper layers (i.e. ignore some unconventional layers provided by the users) to be size factor normalized, log1p
transformed, etc.
makings cell (.obs_names
) and gene names (.var_names
) unique.
savings data in .layers
as csr
sparse matrix for the purpose of memory efficency.
adding collapsed new, total
and unspliced, spliced
layers from the uu, ul, su, sl
layers of a metabolic labeling experiment.
calculating each cell's cell cycle stage score.
calculating new to total ratio (ntr
) for each gene and cell.
Note that by default, we don't filter any cells or genes for your adata object to avoid the trouble of losing your favorite genes/cells. However, if your dataset is huge, we recommend filtering them by setting keep_filtered_cells=False, keep_filtered_genes=False
in recipe_monocle
.
dyn.pp.recipe_monocle(adata)
RNA velocity ($\frac{ds}{dt}$) for conventional scRNA-seq is just $\frac{ds}{dt} = \beta u - \gamma s$ (while $u/s$ is the unspliced or spliced mRNA respectively.$\beta$ is splicing rate and is generally assumed to be 1 while $\gamma$ is degration rate and is what we need to estimate). To estimate gamma for conventional scRNA-seq data, we provided three approaches deterministic
, stochastic
and negbin
. The first one is equivalent to velocyto's implementation or scvelo's deterministic mode while the second one scvelo's stochastic mode. Negative binomal is a novel method from us that relies on the negative binomial formulation of gene exrpession distribution at steady state. Furthermore, we support multi-core parallelism of gamma estimation so you can analyze your large single-cell datasets with dynamo efficiently.
dyn.tl.dynamics
function combines gamma estimation and velocity calculation in one-shot. Furthermore, it implicitly calls dyn.tl.moments
first, and then performs the following steps:
checks the data you have and determines the experimental type automatically, either the conventional scRNA-seq, kinetics, degradation or one-shot single-cell metabolic labelling experiment or the CITE-seq or REAP-seq co-assay, etc.
learns the velocity for each feature gene using either the original deterministic model based on a steady-state assumption from the seminal RNA velocity work or a few new methods, including the stochastic (default) or negative binomial method for conventional scRNA-seq or kinetic, degradation or one-shot models for metabolic labeling based scRNA-seq.
Those later methods are based on moment equations which basically considers both mean and uncentered variance of gene expression. The moment based models require calculation of the first and second moments of the expression data, which relies on the cell nearest neighbours graph, constructed in the reduced PCA space from the spliced or total mRNA expression.
dyn.tl.dynamics(adata, model='stochastic', cores=3)
# or dyn.tl.dynamics(adata, model='deterministic')
# or dyn.tl.dynamics(adata, model='stochastic', est_method='negbin')
Next we perform dimension reduction (by default, UMAP) and visualize the UMAP embedding of cells. The provided Cell_type
information is also used to color cells. To get cluster/cell type information for your own data, dynamo also provides facilities to perform clustering and marker gene detection. By default we use HDBSCAN for clustering. HDBSCAN package was developed also by Leland McInnes, the developer of UMAP. You may clustering your single cells in UMAP space (set basis='umap'
instead of the default pca
in HDBSCAN). See more discussion aboout this here.
For marker gene detection, please check functions in Markers and differential expressions section in our API. A more detailed tutorial designated for this will be released soon.
dyn.tl.reduceDimension(adata)
dyn.pl.umap(adata, color='Cell_type')
Kinetic estimation of the conventional scRNA-seq and metabolic labeling based scRNA-seq is often tricky and has a lot pitfalls. Sometimes you may even observed undesired backward vector flow. You can evaluate the confidence of gene-wise velocity via:
adata
adata.obs["Cell_type"]
dyn.tl.gene_wise_confidence(adata, group='Cell_type', lineage_dict={'Proliferating Progenitor': ['Schwann Cell']})
Here group
is the column for the group informations for cells in the .obs
. lineage_dict
is a dictionary indicates broad lineage information in which key points to the progenitor group while values (a list) are the possible terminal cell groups, all from the group
column.
In the following, let us have a look at the phase diagram of some genes that have velocity calculated. You will see the pvalb1
gene has a strange phase diagram with a few cells have high spliced expression values but extremely low unspliced expression values. Those kind of phase space may points to inproper intron capture of those genes during the library prepartion or sequencing and they should never be used for velocity projection and vector field analysis. A tutorial with details for identifying those genes, evaluating the confidence of velocity estimation and then correcting (briefly mentioned below) the RNA velocity results will be released soon.
dyn.pl.phase_portraits(adata, genes=adata.var_names[adata.var.use_for_dynamics][:4], figsize=(6, 4), color='Cell_type')