This tutorial introduces you to CellRank's high level API for computing initial & terminal states and fate probabilities. Once we have the fate probabilities, this tutorial shows you how to use them to plot a directed PAGA graph, to compute putative lineage drivers and to visualize smooth gene expression trends. If you want a bit more control over how initial & terminal states and fate probabilities are computed, then you should check out CellRank's low level API, composed of kernels and estimators. This really isn't any more complicated than using scikit-learn, so please do check out the Kernels and estimators tutorial.
In this tutorial, we will use RNA velocity and transcriptomic similarity to estimate cell-cell transition probabilities. Using kernels and estimators, you can apply CellRank even without RNA velocity information, check out our CellRank beyond RNA velocity tutorial. CellRank generalizes beyond RNA velocity and is a widely applicable framework to model single-cell data based on the powerful concept of Markov chains.
The first part of this tutorial is very similar to scVelo's tutorial on pancreatic endocrinogenesis. The data we use here comes from Bastidas-Ponce et al., Development 2018. For more info on scVelo, see the documentation or take a look at Bergen et al., Nat. Biotechnol. 2020.
This tutorial notebook can be downloaded using the following link.
Easiest way to start is to download Miniconda3 along with the environment file found here. To create the environment, run conda create -f environment.yml
.
import sys
if "google.colab" in sys.modules:
!pip install -q git+https://github.com/theislab/cellrank@dev
!pip install python-igraph
import scvelo as scv
import scanpy as sc
import cellrank as cr
import numpy as np
scv.settings.verbosity = 3
scv.settings.set_figure_params("scvelo")
cr.settings.verbosity = 2
import warnings
warnings.simplefilter("ignore", category=UserWarning)
warnings.simplefilter("ignore", category=FutureWarning)
warnings.simplefilter("ignore", category=DeprecationWarning)
First, we need to get the data. The following commands will download the adata
object and save it under datasets/endocrinogenesis_day15.5.h5ad
. We'll also show the fraction of spliced/unspliced reads, which we need to estimate RNA velocity.
adata = cr.datasets.pancreas()
scv.pl.proportions(adata)
adata
100%|██████████| 33.5M/33.5M [00:01<00:00, 27.4MB/s]
AnnData object with n_obs × n_vars = 2531 × 27998 obs: 'day', 'proliferation', 'G2M_score', 'S_score', 'phase', 'clusters_coarse', 'clusters', 'clusters_fine', 'louvain_Alpha', 'louvain_Beta', 'palantir_pseudotime' var: 'highly_variable_genes' uns: 'clusters_colors', 'clusters_fine_colors', 'day_colors', 'louvain_Alpha_colors', 'louvain_Beta_colors', 'neighbors', 'pca' obsm: 'X_pca', 'X_umap' layers: 'spliced', 'unspliced' obsp: 'connectivities', 'distances'
Filter out genes which don't have enough spliced/unspliced counts, normalize and log transform the data and restrict to the top highly variable genes. Further, compute principal components and moments for velocity estimation. These are standard scanpy/scvelo functions, for more information about them, see the scVelo API.
scv.pp.filter_and_normalize(adata, min_shared_counts=20, n_top_genes=2000)
sc.tl.pca(adata)
sc.pp.neighbors(adata, n_pcs=30, n_neighbors=30)
scv.pp.moments(adata, n_pcs=None, n_neighbors=None)
Filtered out 22024 genes that are detected 20 counts (shared). Normalized count data: X, spliced, unspliced. Extracted 2000 highly variable genes. Logarithmized X. computing moments based on connectivities finished (0:00:00) --> added 'Ms' and 'Mu', moments of un/spliced abundances (adata.layers)
We will use the dynamical model from scVelo to estimate the velocities. Please make sure to have at least version 0.2.3 of scVelo installed to make use parallelisation in scv.tl.recover_dynamics
. On my laptop, using 8 cores, the below cell takes about 1:30 min to execute.
scv.tl.recover_dynamics(adata, n_jobs=8)
recovering dynamics (using 2/2 cores) WARNING: Unable to create progress bar. Consider installing `tqdm` as `pip install tqdm` and `ipywidgets` as `pip install ipywidgets`, or disable the progress bar using `show_progress_bar=False`. finished (0:03:07) --> added 'fit_pars', fitted parameters for splicing dynamics (adata.var)
Once we have the parameters, we can use these to compute the velocities and the velocity graph. The velocity graph is a weighted graph that specifies how likely two cells are to transition into another, given their velocity vectors and relative positions.
scv.tl.velocity(adata, mode="dynamical")
scv.tl.velocity_graph(adata)
computing velocities finished (0:00:02) --> added 'velocity', velocity vectors for each individual cell (adata.layers) computing velocity graph (using 1/2 cores) finished (0:00:04) --> added 'velocity_graph', sparse matrix with cosine correlations (adata.uns)
scv.pl.velocity_embedding_stream(
adata, basis="umap", legend_fontsize=12, title="", smooth=0.8, min_mass=4
)
computing velocity embedding finished (0:00:00) --> added 'velocity_umap', embedded velocity vectors (adata.obsm)