#!/usr/bin/env python # coding: utf-8 # # Differential abundance analysis in python with `milopy` # # **Author:** Emma Dann
# **Date:** 27/08/2021 # # In this notebook I will demonstrate how to run differential abundance analysis on single-cell datasets using the python implementation of the Milo framework. # In[1]: get_ipython().run_line_magic('load_ext', 'autoreload') get_ipython().run_line_magic('autoreload', '2') # In[2]: import numpy as np import pandas as pd import scanpy as sc import scvelo ## For mouse gastrulation data import anndata import matplotlib.pyplot as plt plt.rcParams['figure.figsize']=(8,8) #rescale figures sc.settings.verbosity = 3 import milopy.core as milo import milopy.plot as milopl # ### Load and prepare example dataset # # For this vignette I will use the mouse gastrulation data from [Pijuan-Sala et al. 2019](https://www.nature.com/articles/s41586-019-0933-9). The dataset can be downloaded as a `anndata` object from the [`scvelo`](https://scvelo.readthedocs.io/scvelo.datasets.gastrulation/#scvelo.datasets.gastrulation) package (v0.2.4). # # In[3]: adata = scvelo.datasets.gastrulation() # In[4]: adata # We will use this dataset to identify for cell populations that increase with time (i.e. embryonic stage) # In[60]: sc.pl.umap(adata, color=["celltype"], legend_loc="on data"); sc.pl.umap(adata, color=["stage"]); # ### Build KNN graph # This object already contains PCA dimensionality reduction in `adata.obsm["X_pca"]`. We can use scanpy functions to build a KNN graph. We set the dimensionality and value for k to use in subsequent steps. # In[6]: d = 30 k = 50 sc.pp.neighbors(adata, n_neighbors=k, n_pcs=d) # ### Construct neighbourhoods # # This step assigns cells to a set of representative neighbourhoods on the KNN graph. # In[7]: milo.make_nhoods(adata, prop=0.1) # The assignment of cells to neighbourhoods is stored as a sparse binary matrix in `adata.obsm`. Here we see that cells have been assigned to 4970 neighbourhoods. # In[8]: adata.obsm["nhoods"] # The information on which cells are sampled as index cells of representative neighbourhoods is stored in `adata.obs`, along with the distance of the index to the kth nearest neighbor, which is used later for the SpatialFDR correction. # In[9]: adata[adata.obs['nhood_ixs_refined'] != 0].obs[['nhood_ixs_refined', 'nhood_kth_distance']] # We can visualize the distribution of neighbourhood sizes to get an idea of the right value for k # In[10]: nhood_size = np.array(adata.obsm["nhoods"].sum(0)).ravel() plt.hist(nhood_size, bins=100); # Since here cells come from 34 samples, we want the average number of cells in a neighbourhood to peak around 34 x 3 = 102 cells, to have a median number of 3 cells per sample per neighbourhood. # ### Count cells in neighbourhoods # # Milo leverages the variation in cell numbers between replicates for the same experimental condition to test for differential abundance. Therefore we have to count how many cells from each sample are in each neighbourhood. We need to use the cell metadata saved in `adata.obs` and specify which column contains the sample information. # In[11]: milo.count_nhoods(adata, sample_col="sample") # This function adds `adata.uns["nhood_adata"]`, that stores an anndata object where `obs` correspond to neighbourhoods and `vars` correspond to samples, and where `.X` stores the number of cells from each sample counted in a neighbourhood. This count matrix will be used for DA testing. # In[12]: adata.uns["nhood_adata"] # ### Differential abundance testing with GLM # # We are now ready to test for differential abundance in time. The experimental design needs to be specified with R-style formulas. In this case we need to convert the "stage" to a continuous variable, to test for linear increase over time. # In[27]: adata.obs["stage_continuous"] = adata.obs["stage"].cat.codes # In[28]: milo.DA_nhoods(adata, design="~stage_continuous") # The differential abundance test results are stored in `adata.uns["nhood_adata"].obs`. In particular: # # * `logFC`: stores the log-Fold Change in abundance (i.e. the slope of the linear model) # * `PValue` stores the p-value for the test # * `SpatialFDR` stores the p-values adjusted for multiple testing (accounting for overlap between neighbourhoods) # In[31]: adata.uns["nhood_adata"].obs # We can start inspecting the results of our DA analysis from a couple of standard diagnostic plots. # In[44]: old_figsize = plt.rcParams["figure.figsize"] plt.rcParams["figure.figsize"] = [10,5] plt.subplot(1,2,1) plt.hist(adata.uns["nhood_adata"].obs.PValue, bins=50); plt.xlabel("P-Vals"); plt.subplot(1,2,2) plt.plot(adata.uns["nhood_adata"].obs.logFC, -np.log10(adata.uns["nhood_adata"].obs.SpatialFDR), '.'); plt.xlabel("log-Fold Change"); plt.ylabel("- log10(Spatial FDR)"); plt.tight_layout() plt.rcParams["figure.figsize"] = old_figsize # ### Visualize results on embedding # To visualize DA results relating them to the embedding of single cells, we can build an abstracted graph of neighbourhoods that we can superimpose on the single-cell embedding. Here each node represents a neighbourhood, while edges indicate how many cells two neighbourhoods have in common. Here the layout of nodes is determined by the position of the index cell in the UMAP embedding of all single-cells. The neighbourhoods displaying singificant DA are colored by their log-Fold Change. # In[45]: import milopy.utils milopy.utils.build_nhood_graph(adata) # In[51]: plt.rcParams["figure.figsize"] = [10,10] milopl.plot_nhood_graph(adata, alpha=0.01, ## SpatialFDR level (1%) min_size=2 ## Size of smallest dot ) # ### Visualize result by celltype # # We might want to visualize whether DA is particularly evident in certain cell types. To do this, we assign a cell type label to each neighbourhood by finding the most abundant cell type within cells in each neighbourhood (after all, neighbourhoods are in most cases small subpopulations within the same cell type). We can label neighbourhoods in the results data.frame using the function `milopy.core.annotate_nhoods`. This also saves the fraction of cells harbouring the label. # In[61]: milopy.utils.annotate_nhoods(adata, anno_col='celltype') # We can see that for the majority of neighbourhoods, almost all cells have the same neighbourhood. We can rename neighbourhoods where less than 60% of the cells have the top label as "Mixed" # In[68]: plt.hist(adata.uns['nhood_adata'].obs["nhood_annotation_frac"]); plt.xlabel("celltype fraction") # In[69]: adata.uns['nhood_adata'].obs.loc[adata.uns['nhood_adata'].obs["nhood_annotation_frac"] < 0.6, "nhood_annotation"] = "Mixed" # Now we can plot the fold changes using standard scanpy functions # In[75]: sc.pl.violin(adata.uns['nhood_adata'], "logFC", groupby="nhood_annotation", rotation=90, show=False); plt.axhline(y=0, color='black', linestyle='--'); plt.show() # Here this shows that neighbourhoods of cells of the epiblast and primitive streak are enriched in early stages and cells from the neural crest are enriched in later stages.