In the main analysis of Fig. 2 and 3, we focus on endocrine maturation, i.e. the process that leads to the hormone-producing alpha, beta, epsilon and delta cells, see for example Bastidas-Ponce et al., Development 2017. Here, we show how CellRank can be used to delineate the signatures of ductal vs. endocrine maturation.
# import standard packages
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import os
import sys
# import single-cell packages
import scanpy as sc
import scanpy.external as sce
import scvelo as scv
import cellrank as cr
# set verbosity levels
sc.settings.verbosity = 2
cr.settings.verbosity = 2
scv.settings.verbosity = 3
If you want to exactly reproduce the results shown here, please make sure that your package versions match what is printed below.
cr.logging.print_versions()
cellrank==1.5.0+g65f1562 scanpy==1.8.1 anndata==0.7.6 numpy==1.20.3 numba==0.54.0 scipy==1.7.1 pandas==1.3.3 pygpcca==1.0.2 scikit-learn==0.24.2 statsmodels==0.13.0rc0 python-igraph==0.9.1 scvelo==0.2.4 pygam==0.8.0 matplotlib==3.4.3 seaborn==0.11.2
cr.logging.print_version_and_date()
Running CellRank 1.5.0+g65f1562, on 2021-10-26 15:02.
Define the paths to load data, cache results and write figure panels.
sys.path.insert(0, "../../") # this depends on the notebook depth and must be adapted per notebook
from paths import DATA_DIR, CACHE_DIR, FIG_DIR
# we're only using one directly here, so can save some text
FIG_DIR = FIG_DIR / "suppl_figures_pancreas" / "suppl_fig_proliferation"
Set up the paths to save figures.
scv.settings.figdir = str(FIG_DIR)
sc.settings.figdir = str(FIG_DIR)
cr.settings.figdir = str(FIG_DIR)
Note: we use a caching extension called scachepy
for this analysis, see here. We do this to speed up the runtime of this notebook by avoiding the most expensive computations. Below, we check whether you have scachepy installed and if you don't, then we automatically recompute all results.
try:
import scachepy
c = scachepy.Cache(CACHE_DIR / "pancreas", separate_dirs=True)
print('Sucessfully imported scachepy. Expensive computations will be cached. ')
except ImportError:
print('Failed to import scachepy. Consider installing it from `https://github.com/theislab/scachepy`. ')
c = None
use_caching = c is not None
c
Sucessfully imported scachepy. Expensive computations will be cached.
Cache(root=/Users/marius/Projects/cellrank_reproducibility_2/cache/pancreas, ext='.pickle', compression='None')
Set some plotting parameters.
scv.settings.set_figure_params('scvelo', dpi_save=400, dpi=80, transparent=True, fontsize=20, color_map='viridis')
scv.settings.plot_prefix = ""
Set other global parameters
# should figures just be displayed or also saved?
save_figure = True
# should cashed values be used, or recompute?
force_recompute = False
If there are other global parameters for this analysis, put them here as well.
Download the pancreas data, including ductal cells, from scVelo by running scv.datasets.pancreas()
.
adata = scv.read(DATA_DIR / "pancreas" / "pancreas_ductal_endocrine.h5ad")
adata
AnnData object with n_obs × n_vars = 3696 × 27998 obs: 'clusters_coarse', 'clusters', 'S_score', 'G2M_score' var: 'highly_variable_genes' uns: 'clusters_coarse_colors', 'clusters_colors', 'day_colors', 'neighbors', 'pca' obsm: 'X_pca', 'X_umap' layers: 'spliced', 'unspliced' obsp: 'distances', 'connectivities'
Often, we want to process the raw data a bit differently than the data in .X
.
adata_raw = adata.copy()
sc.pp.filter_genes(adata_raw, min_cells=10)
scv.pp.normalize_per_cell(adata_raw)
sc.pp.log1p(adata_raw)
# annotate highly variable genes, but don't filter them out
sc.pp.highly_variable_genes(adata_raw)
print(f"This detected {np.sum(adata_raw.var['highly_variable'])} highly variable genes. ")
adata.raw = adata_raw
filtered out 14177 genes that are detected in less than 10 cells Normalized count data: X, spliced, unspliced. extracting highly variable genes finished (0:00:00) This detected 1789 highly variable genes.
# filter, normalise, log transform
scv.pp.filter_and_normalize(adata, min_shared_counts=20, log=True, n_top_genes=2000)
# compute pca, knn graph and scvelo's moments
sc.tl.pca(adata)
sc.pp.neighbors(adata, n_pcs=30, n_neighbors=30)
scv.pp.moments(adata, n_pcs=None, n_neighbors=None)
Filtered out 20801 genes that are detected 20 counts (shared). Normalized count data: X, spliced, unspliced. Extracted 2000 highly variable genes. Logarithmized X. computing PCA on highly variable genes with n_comps=50 finished (0:00:00) computing neighbors using 'X_pca' with n_pcs = 30 finished (0:00:02) computing moments based on connectivities finished (0:00:00) --> added 'Ms' and 'Mu', moments of un/spliced abundances (adata.layers)
Show proliferation signature:
scv.pl.scatter(adata, c=['S_score', 'G2M_score'], s=70, dpi=150, perc=[0, 100],
save="umap_cell_cycle_scores.pdf" if save_figure else None)
saving figure to file /Users/marius/Projects/cellrank_reproducibility_2/figures/suppl_figures_pancreas/suppl_fig_proliferation/umap_cell_cycle_scores.pdf
# compute/load from cache the dyn. model params and compute velocities
n_jobs=8
if use_caching:
c.tl.recover_dynamics(adata, fname='2021-09-22_recover_dynamics_ductal_vs_endocrine', force=force_recompute,
n_jobs=n_jobs)
else:
scv.tl.recover_dynamics(adata)
scv.tl.velocity(adata, mode='dynamical', n_jobs=n_jobs)
Loading data from: `2021-09-22_recover_dynamics_ductal_vs_endocrine.pickle`. computing velocities finished (0:00:04) --> added 'velocity', velocity vectors for each individual cell (adata.layers)
Show velocities in UMAP.
scv.tl.velocity_graph(adata, n_jobs=8)
scv.pl.velocity_embedding_stream(adata, dpi=150, save="umap_velocities.pdf" if save_figure else None )
computing velocity graph (using 8/8 cores)
0%| | 0/3696 [00:00<?, ?cells/s]
/Users/marius/miniconda3/envs/cellrank/lib/python3.8/site-packages/scvelo/core/_parallelize.py:138: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray. res = np.array(res) if as_array else res
finished (0:00:07) --> added 'velocity_graph', sparse matrix with cosine correlations (adata.uns) computing velocity embedding finished (0:00:00) --> added 'velocity_umap', embedded velocity vectors (adata.obsm) figure cannot be saved as pdf, using png instead. saving figure to file /Users/marius/Projects/cellrank_reproducibility_2/figures/suppl_figures_pancreas/suppl_fig_proliferation/umap_velocities.png
This serves as an example of how to cach a transition matrix and of the default values we should be using.
from cellrank.tl.estimators import GPCCA
# set the parameters we want to use to compute the transition matrix
weight_connectivities=0.2
mode="stochastic"
n_jobs=8
softmax_scale=None
if use_caching:
c.tl.transition_matrix(adata,
weight_connectivities=weight_connectivities,
mode=mode,
n_jobs=n_jobs,
softmax_scale=softmax_scale,
fname=f"2021-09-22_transition_matrix_{mode}_ductal_vs_endocrine",
force=force_recompute)
g_fwd = GPCCA(adata, obsp_key='T_fwd')
else:
kernel = cr.tl.transition_matrix(adata,
weight_connectivities=weight_connectivities,
mode=mode,
n_jobs=n_jobs,
softmax_scale=softmax_scale)
g_fwd = GPCCA(kernel)
print(g_fwd)
Loading data from: `2021-09-22_transition_matrix_stochastic_ductal_vs_endocrine.pickle`. GPCCA[n=3696, kernel=<PrecomputedKernel[origin=adata.obsp['T_fwd']]>]
To make sure that we're using the same transition matrix, show some stats.
# just for consistency, display some summary staes about the transition matrix
T = adata.obsp['T_fwd']
T_med, T_mean, T_var = np.median(T.data), np.mean(T.data), np.var(T.data)
T_min, T_max = np.min(T.data), np.max(T.data)
print(f'med = {T_med:.6f},\nmean = {T_mean:.6f},\nvar = {T_var:.6f},\nmin = {T_min:.6f},\nmax = {T_max:.6f}')
med = 0.018122, mean = 0.024749, var = 0.000598, min = 0.000019, max = 0.404272
g_fwd.compute_schur()
g_fwd.plot_spectrum(real_only=True, s=500, legend_loc="none", figsize=(4, 4),
save='spectrum.pdf' if save_figure else None)
Computing Schur decomposition When computing macrostates, choose a number of states NOT in `[6, 8]` Adding `adata.uns['eigendecomposition_fwd']` `.schur_vectors` `.schur_matrix` `.eigendecomposition` Finish (0:00:00)
g_fwd.compute_macrostates(cluster_key='clusters', n_states=5)
g_fwd.plot_macrostates(discrete=True, legend_loc='on data', size=100, dpi=150,
save='umap_macrostates.pdf' if save_figure else None)
Computing `5` macrostates Adding `.macrostates` `.macrostates_memberships` `.coarse_T` `.coarse_initial_distribution `.coarse_stationary_distribution` `.schur_vectors` `.schur_matrix` `.eigendecomposition` Finish (0:00:00) saving figure to file /Users/marius/Projects/cellrank_reproducibility_2/figures/suppl_figures_pancreas/suppl_fig_proliferation/umap_macrostates.pdf
T_coarse = g_fwd.coarse_T
macro_names = ['Ductal_1', 'Ductal_2', 'Epsilon', 'Alpha', 'Beta']
fig = plt.figure(None, (6, 5))
ax = fig.add_subplot(111)
sns.heatmap(T_coarse[macro_names].loc[macro_names], cmap='viridis', ax=ax, annot=False, fmt='.2f')
if save_figure:
plt.savefig(FIG_DIR / "trans_mat_coarse.pdf", bbox_inches='tight')
plt.show()
/Users/marius/miniconda3/envs/cellrank/lib/python3.8/site-packages/ipykernel/ipkernel.py:287: DeprecationWarning: `should_run_async` will not call `transform_cell` automatically in the future. Please pass the result to `transformed_cell` argument and any exception that happen during thetransform in `preprocessing_exc_tuple` in IPython 7.17 and above. and should_run_async(code)
g_fwd.compute_terminal_states()
g_fwd.plot_terminal_states(discrete=True, legend_loc='right', size=100)
Adding `adata.obs['terminal_states']` `adata.obs['terminal_states_probs']` `.terminal_states` `.terminal_states_probabilities` `.terminal_states_memberships Finish`
/Users/marius/miniconda3/envs/cellrank/lib/python3.8/site-packages/ipykernel/ipkernel.py:287: DeprecationWarning: `should_run_async` will not call `transform_cell` automatically in the future. Please pass the result to `transformed_cell` argument and any exception that happen during thetransform in `preprocessing_exc_tuple` in IPython 7.17 and above. and should_run_async(code)
g_fwd.compute_absorption_probabilities(keys=['Epsilon, Alpha, Beta', 'Ductal_2'])
g_fwd.plot_absorption_probabilities(same_plot=False, size=70, dpi=500,
save="umap_fate_probabilities.pdf" if save_figure else None)
Computing absorption probabilities
/Users/marius/miniconda3/envs/cellrank/lib/python3.8/site-packages/ipykernel/ipkernel.py:287: DeprecationWarning: `should_run_async` will not call `transform_cell` automatically in the future. Please pass the result to `transformed_cell` argument and any exception that happen during thetransform in `preprocessing_exc_tuple` in IPython 7.17 and above. and should_run_async(code)
0%| | 0/2 [00:00<?, ?/s]
Adding `adata.obsm['to_terminal_states']` `.absorption_probabilities` Finish (0:00:00) saving figure to file /Users/marius/Projects/cellrank_reproducibility_2/figures/suppl_figures_pancreas/suppl_fig_proliferation/umap_fate_probabilities.pdf
Compute lineage drivers, restricted to the Ductal and Ngn3 low EP clusters:
ld_df = g_fwd.compute_lineage_drivers(clusters=['Ductal', 'Ngn3 low EP'], cluster_key='clusters', return_drivers=False,
use_raw=True)
ld_df.head(5)
Adding `adata.raw.varm['terminal_lineage_drivers']` `.lineage_drivers` Finish (0:00:00)
/Users/marius/miniconda3/envs/cellrank/lib/python3.8/site-packages/ipykernel/ipkernel.py:287: DeprecationWarning: `should_run_async` will not call `transform_cell` automatically in the future. Please pass the result to `transformed_cell` argument and any exception that happen during thetransform in `preprocessing_exc_tuple` in IPython 7.17 and above. and should_run_async(code)
Ductal_2_corr | Ductal_2_pval | Ductal_2_qval | Ductal_2_ci_low | Ductal_2_ci_high | Alpha or Beta or Epsilon_corr | Alpha or Beta or Epsilon_pval | Alpha or Beta or Epsilon_qval | Alpha or Beta or Epsilon_ci_low | Alpha or Beta or Epsilon_ci_high | |
---|---|---|---|---|---|---|---|---|---|---|
index | ||||||||||
Lurap1l | 0.459084 | 7.268318e-65 | NaN | 0.412792 | 0.503010 | -0.459085 | 7.260169e-65 | NaN | -0.503012 | -0.412793 |
Atp1b1 | 0.358750 | 6.656695e-38 | NaN | 0.307944 | 0.407516 | -0.358752 | 6.650961e-38 | NaN | -0.407518 | -0.307946 |
Krt19 | 0.357934 | 1.008197e-37 | NaN | 0.307096 | 0.406735 | -0.357934 | 1.008287e-37 | NaN | -0.406734 | -0.307096 |
Nudt19 | 0.357873 | 1.039919e-37 | NaN | 0.307033 | 0.406676 | -0.357873 | 1.040155e-37 | NaN | -0.406676 | -0.307033 |
Jun | 0.314602 | 6.219774e-29 | NaN | 0.262197 | 0.365156 | -0.314601 | 6.220298e-29 | NaN | -0.365156 | -0.262197 |
Differentiation towards endocrine populations:
# define endocrine and ductal markers
endocrine_markers = ['Neurog3', 'Insm1']
ductal_markers = ['Sox9']
assert all([gene in adata_raw.var.index for gene in ductal_markers]), "Some genes are missing from AnnData"
assert(all([gene in adata_raw.var.index for gene in endocrine_markers])), "Some genes are missing from AnnData"
/Users/marius/miniconda3/envs/cellrank/lib/python3.8/site-packages/ipykernel/ipkernel.py:287: DeprecationWarning: `should_run_async` will not call `transform_cell` automatically in the future. Please pass the result to `transformed_cell` argument and any exception that happen during thetransform in `preprocessing_exc_tuple` in IPython 7.17 and above. and should_run_async(code)
Check where endocrine markers rank:
endocrine_corr_df = ld_df.sort_values(by='Alpha or Beta or Epsilon_corr', ascending=False)
endocrine_corr_df['rank'] = range(endocrine_corr_df.shape[0])
{gene: endocrine_corr_df.loc[gene, 'rank'] for gene in endocrine_markers}
{'Neurog3': 0, 'Insm1': 5}
Plot Ngn3 expression:
scv.pl.scatter(adata_raw, c='Neurog3', s=70, dpi=500, perc=[0, 98],
save="umap_ngn3.pdf" if save_figure else None)
saving figure to file /Users/marius/Projects/cellrank_reproducibility_2/figures/suppl_figures_pancreas/suppl_fig_proliferation/umap_ngn3.pdf
Check for the correlction of Ngn3 with endocrne fate probabilities:
print(f"Corrleation between Ngn3 and endocrine fate probability:",
f"{endocrine_corr_df.loc['Neurog3', 'Alpha or Beta or Epsilon_corr']:.2f}.")
Corrleation between Ngn3 and endocrine fate probability: 0.66.
/Users/marius/miniconda3/envs/cellrank/lib/python3.8/site-packages/ipykernel/ipkernel.py:287: DeprecationWarning: `should_run_async` will not call `transform_cell` automatically in the future. Please pass the result to `transformed_cell` argument and any exception that happen during thetransform in `preprocessing_exc_tuple` in IPython 7.17 and above. and should_run_async(code)
Check where ductal markers rank:
ductal_corr_df = ld_df.sort_values(by='Ductal_2_corr', ascending=False)
ductal_corr_df['rank'] = range(ductal_corr_df.shape[0])
{gene: ductal_corr_df.loc[gene, 'rank'] for gene in ductal_markers}
{'Sox9': 21}