Time series datasets

This tutorial demonstrates how to use CellRank's interface to Waddington OT (WOT) to work with time-series scRNA-seq data Schiebinger et al., Cell 2019. The interface is constructed via an external kernel. In CellRank, we call everyting that computes a cell-cell transition matrix a kernel, see our Kernels and estimators tutorial. An external kernel wraps around another method (like Waddington OT) so that you can use it from within CellRank, through the usual API. If you have your own method that you would like to integrate as an external kernel in CellRank, see our Creating a new kernel tutorial.

We'll demonstrate this using the original data from the WOT publication Schiebinger et al., Cell 2019, i.e. reprogramming of mouse embryonic fibroblasts (MEFs) towards induced pluripotent stem cells (iPSCs) across 39 time points spanning days 0-18 past reprogramming. We'll restrict our attention to the serum condition and subsample the data to speed up computations.

Import packages and data

In [1]:
import sys

# the installation can take ~10mins
# ignore the messages: ERROR: Failed building wheel for <package>
# as long as it later print: Running setup.py install for <package> ... done
if "google.colab" in sys.modules:
    !pip install -q git+https://github.com/theislab/[email protected]#egg=cellrank[external,krylov]
In [1]:
import matplotlib.pyplot as plt
import scvelo as scv
import scanpy as sc
import cellrank as cr

# import CellRank kernels and estimators
from cellrank.external.kernels import WOTKernel
from cellrank.tl.kernels import ConnectivityKernel
from cellrank.tl.estimators import GPCCA

# set verbosity levels
cr.settings.verbosity = 2
scv.settings.verbosity = 3

# figure settings
    "scvelo", dpi_save=400, dpi=80, transparent=True, fontsize=20, color_map="viridis"

Import the data.

In [2]:
adata = cr.datasets.reprogramming_schiebinger()

AnnData object with n_obs × n_vars = 236285 × 19089
    obs: 'day', 'MEF.identity', 'Pluripotency', 'Cell.cycle', 'ER.stress', 'Epithelial.identity', 'ECM.rearrangement', 'Apoptosis', 'SASP', 'Neural.identity', 'Placental.identity', 'X.reactivation', 'XEN', 'Trophoblast', 'Trophoblast progenitors', 'Spiral Artery Trophpblast Giant Cells', 'Spongiotrophoblasts', 'Oligodendrocyte precursor cells (OPC)', 'Astrocytes', 'Cortical Neurons', 'RadialGlia-Id3', 'RadialGlia-Gdf10', 'RadialGlia-Neurog2', 'Long-term MEFs', 'Embryonic mesenchyme', 'Cxcl12 co-expressed', 'Ifitm1 co-expressed', 'Matn4 co-expressed', '2-cell', '4-cell', '8-cell', '16-cell', '32-cell', 'cell_growth_rate', 'serum', '2i', 'major_cell_sets', 'cell_sets', 'batch'
    var: 'highly_variable', 'TF'
    uns: 'batch_colors', 'cell_sets_colors', 'day_colors', 'major_cell_sets_colors'
    obsm: 'X_force_directed'

This data contains cells cultured in both 2i as well as serum conditions - we'll focus on the serum condition here.

In [3]:
adata = adata[adata.obs["serum"] == "True"].copy()

Subsample to speed up the analysis - this tutorial is meant to run in a couple of minutes on a laptop. It's not a problem for CellRank to do any of the computations here on the full data, we'd just have to wait a bit longer.

In [4]:
sc.pp.subsample(adata, fraction=0.25, random_state=0)
AnnData object with n_obs × n_vars = 41473 × 19089
    obs: 'day', 'MEF.identity', 'Pluripotency', 'Cell.cycle', 'ER.stress', 'Epithelial.identity', 'ECM.rearrangement', 'Apoptosis', 'SASP', 'Neural.identity', 'Placental.identity', 'X.reactivation', 'XEN', 'Trophoblast', 'Trophoblast progenitors', 'Spiral Artery Trophpblast Giant Cells', 'Spongiotrophoblasts', 'Oligodendrocyte precursor cells (OPC)', 'Astrocytes', 'Cortical Neurons', 'RadialGlia-Id3', 'RadialGlia-Gdf10', 'RadialGlia-Neurog2', 'Long-term MEFs', 'Embryonic mesenchyme', 'Cxcl12 co-expressed', 'Ifitm1 co-expressed', 'Matn4 co-expressed', '2-cell', '4-cell', '8-cell', '16-cell', '32-cell', 'cell_growth_rate', 'serum', '2i', 'major_cell_sets', 'cell_sets', 'batch'
    var: 'highly_variable', 'TF'
    uns: 'batch_colors', 'cell_sets_colors', 'day_colors', 'major_cell_sets_colors'
    obsm: 'X_force_directed'

Let's visualize this data, using the original force-directed layout as well as annotated cell sets. We simplifed the original cell-set annotations for our purposes here.

In [5]:
scv.pl.scatter(adata, basis="force_directed", c="day", legend_loc="none")

Each dot is a cell in the force-direction embedding, colored according to one of the 39 time-points of sequencing, from early (day 0, in black) to late (day 18, in yellow).

In [6]:
scv.pl.scatter(adata, basis="force_directed", c="cell_sets", legend_loc="right")

Pre-process the data

This datset has already been normalized by total counts and log2-transformed. Further, highly variable genes have already been annotated. We can thus direclty compute PCA and a KNN graph, which we'll need for later.

In [7]:
In [8]:
sc.pp.neighbors(adata, random_state=0)

Estimate initial growth rates

WOT incorporates cellular growth and death alongside differentiation. To initialze the computation of growth rates, we can use pre-compiled sets of proliferation- and apoptosis-related genes. Note that WOT uses the concept of unbalanced optimal transport which means our initial guess doesn't have to be exact - it's optimized in an iterative fashion internally. Please see the original WOT tutorial for more information on this.

Initialize the WOT kernel based on the AnnData object

In [9]:
wk = WOTKernel(adata, time_key="day")

Estimate growth rates based on gene-sets for proliferation and apoptosis.

In [10]:
wk.compute_initial_growth_rates(organism="mouse", key_added="growth_rate_init")
    adata, c="growth_rate_init", legend_loc="right", basis="force_directed", s=10
Computing `proliferation` scores
WARNING: genes are not in var_names and ignored: ['Mlf1ip']
Computing `apoptosis` scores
WARNING: genes are not in var_names and ignored: ['Clca2', 'Ccnd2']

Compute transition matrix

During the computation of transport maps between adjacent time-points, WOT incorporates cellular growth and death. The output of the method is a list of transport maps, one for each pair of time points ($t_i, t_{i+1}$). These are saved in wk.transport_maps. Rows in these transport maps do not need to sum to one because depending on whether a cell is likely to proliferate or die, it's outgoing probability mass will be larger or smaller than one, respectively.

CellRank, however, is based on Markov chains where we need one single transition matrix with rows that sum to one. For this reason, in addition to the saved transport_maps, we accumulate the individual transport maps into one large transition matrix which we row-normalize.

Further, for N time points, transition probabilities for cells in the Nth time point aren't defined since there is nowhere to transport their probability mass. However, to treat the system as a Markov chain, we need to assign transition probabilites for final-day cells as well. In this example, we'll use transcriptomic similarity for this purpose via the argument last_time_point='connectivities'. That means we'll compute transition probabilities for earlier day cells to later day cells using optimal transport, however, once cells have reached the last time-point, we'll compute transition probabilities among final-day cells just based on their gene expression similarity.

In [11]:
    growth_iters=3, growth_rate_key="growth_rate_init", last_time_point="connectivities"
Computing transition matrix using Waddington optimal transport
Using `1479` HVGs from `adata.var['highly_variable']`
Using default cost matrices
Computing transport maps for `38` time pairs
    Finish (0:03:18)

Computing transition matrix based on `adata.obsp['connectivities']`
    Finish (0:00:00)
    Finish (0:03:25)

The method argument growth_iters defines how many iterations we use to fine-tune the cellular growth rates. For other parameters, see the original WOT documentation.

Simulate random walks

To gain some intuition for the transition matrix we have just computed, we can simualte some random walks and plot them in the embedding.

In [12]:
    start_ixs={"day": 0.0},
Simulating `300` random walks of maximum length `200`
    Finish (0:00:25)

Plotting random walks

Black dots indicate the start of a random walk (can't really see them here, but they come from the lower right corner where day 0 MEFs are located) and yellow dots inicate the end. If we compare with the cluster labels below, we can observe that most random walks terminate in either the Trophoblast, Stromal, Epithelial, Neural or IPS clusters. We'll make this more quantitative below using pyGPCCA to compute macrostates. Generalized Perron Cluster Cluster Analysis (GPCCA) is a fast and mathematically sound way to define macrostates - sets of cells that have some level of metastability Reuter et al., JCP (2019).

In [13]:
scv.pl.scatter(adata, c="cell_sets", basis="force_directed", legend_loc="right")