scBoolSeq
with scanpy
(Binarisation)¶In this notebook we showcase:
scBoolSeq
. scanpy
is a popular package for scRNA-seq data analysis in Python (part of the scverse). This tutorial builds on scanpy
's preprocessing and clustering tutorial.scBoolSeq
with scanpy
.scBoolSeq
.This notebook can be seen as a complement to the presented Case-Study on Early-born Retinal Neurons.
The data can be obtained from its corresponding GEO accession GSE122466.
The reference of the original study which published the data:
Giudice, Q. Lo, Leleu, M., Manno, G. La, & Fabre, P. J. (2019).
Single-cell transcriptional logic of cell-fate specification and axon guidance in early-born retinal neurons.
Development (Cambridge), 146(17). https://doi.org/10.1242/dev.178103
import pickle
import warnings
warnings.filterwarnings('ignore')
import sklearn
import pandas as pd
import scanpy as sc
sc.settings.verbosity = 0
sc.logging.print_header()
import anndata as ad
import scboolseq
from scboolseq import scBoolSeq
scanpy==1.10.2 anndata==0.10.8 umap==0.5.6 numpy==1.26.4 scipy==1.14.0 pandas==2.2.2 scikit-learn==1.5.0 statsmodels==0.14.2 igraph==0.11.5 louvain==0.8.2 pynndescent==0.5.13
!test -f 'GSE122466_Merged5347cells_RAW.csv.gz' || wget -O 'GSE122466_Merged5347cells_RAW.csv.gz' 'https://www.ncbi.nlm.nih.gov/geo/download/?acc=GSE122466&format=file&file=GSE122466%5FMerged5347cells%5FRAW%2Ecsv%2Egz'
%%time
in_data = pd.read_csv("GSE122466_Merged5347cells_RAW.csv.gz", index_col=0, sep=";")
if not (in_data.columns.is_unique and in_data.index.is_unique):
print("Check for duplicated index/columns")
in_data.head()
CPU times: user 6.69 s, sys: 393 ms, total: 7.08 s Wall time: 7.09 s
Lane1_AAACCTGAGATGTCGG | Lane1_AAACCTGCAATCCAAC | Lane1_AAACCTGGTTCCTCCA | Lane1_AAACCTGTCCAATGGT | Lane1_AAACGGGAGGCAATTA | Lane1_AAACGGGCACCAGGCT | Lane1_AAACGGGCACGGCGTT | Lane1_AAACGGGCACTCAGGC | Lane1_AAACGGGCATCCAACA | Lane1_AAACGGGGTCCAAGTT | ... | Lane2_TTTGGTTCATAAAGGT | Lane2_TTTGGTTGTGGCCCTA | Lane2_TTTGGTTTCTAACTCT | Lane2_TTTGTCAAGCGTGAAC | Lane2_TTTGTCAAGGGCTCTC | Lane2_TTTGTCAAGTACCGGA | Lane2_TTTGTCACATCGATGT | Lane2_TTTGTCAGTAGAGGAA | Lane2_TTTGTCAGTCATCCCT | Lane2_TTTGTCATCAGTTCGA | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Xkr4 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
Mrpl15 | 3 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 1 | 1 | 0 | 1 | 0 | 0 | 2 | 1 | 0 |
Lypla1 | 0 | 0 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
Gm37988 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
Tcea1 | 2 | 0 | 5 | 1 | 2 | 4 | 1 | 0 | 0 | 0 | ... | 1 | 3 | 0 | 1 | 1 | 0 | 1 | 1 | 1 | 0 |
5 rows × 5347 columns
batch_info = pd.DataFrame(
data={
"batch": in_data.columns.map(lambda x: x.split("_")[0]),
"tag": in_data.columns.map(lambda x: x.split("_")[-1])
},
index=in_data.columns
)
batch_info.batch.value_counts()
batch Lane2 2674 Lane1 2673 Name: count, dtype: int64
scanpy
¶scanpy
provides numerous functions for scRNA-seq data analysis.
Here we perform a standard pipeline which consists of:
And finally:
scBoolSeq
adata = ad.AnnData(in_data.T, obs=batch_info)
adata
AnnData object with n_obs × n_vars = 5347 × 15176 obs: 'batch', 'tag'
%%time
# mitochondrial genes
adata.var["mt"] = adata.var_names.str.startswith(("Mt-", "mt-"))
sc.pp.calculate_qc_metrics(adata, qc_vars=["mt"], inplace=True, log1p=True)
adata
CPU times: user 1.12 s, sys: 312 ms, total: 1.43 s Wall time: 1.43 s
AnnData object with n_obs × n_vars = 5347 × 15176 obs: 'batch', 'tag', 'n_genes_by_counts', 'log1p_n_genes_by_counts', 'total_counts', 'log1p_total_counts', 'pct_counts_in_top_50_genes', 'pct_counts_in_top_100_genes', 'pct_counts_in_top_200_genes', 'pct_counts_in_top_500_genes', 'total_counts_mt', 'log1p_total_counts_mt', 'pct_counts_mt' var: 'mt', 'n_cells_by_counts', 'mean_counts', 'log1p_mean_counts', 'pct_dropout_by_counts', 'total_counts', 'log1p_total_counts'
sc.pl.violin(
adata,
["n_genes_by_counts", "total_counts", "pct_counts_mt"],
jitter=0.7,
multi_panel=True,
)
sc.pl.scatter(adata, "total_counts", "n_genes_by_counts", color="pct_counts_mt")
sc.pp.filter_cells(adata, min_genes=100)
sc.pp.filter_genes(adata, min_cells=3)
%%time
adata.layers["counts"] = adata.X.copy()
sc.pp.scrublet(adata, batch_key="batch")
CPU times: user 35.6 s, sys: 19.9 s, total: 55.5 s Wall time: 13.2 s
sc.pp.normalize_total(
adata,
target_sum=1e4,
exclude_highly_expressed=True,
inplace=True
)
sc.pp.log1p(adata)
sc.pp.highly_variable_genes(
adata, n_top_genes=2000,
batch_key="batch",
flavor="cell_ranger",
)
sc.pl.highly_variable_genes(adata)
sc.tl.pca(adata)
sc.pl.pca_variance_ratio(adata, n_pcs=50, log=True)
sc.pl.pca(
adata,
color=["batch", "batch", "pct_counts_mt", "pct_counts_mt"],
dimensions=[(0, 1), (2, 3), (0, 1), (2, 3)],
ncols=2,
)
%%time
sc.pp.neighbors(adata)
sc.tl.umap(adata)
sc.tl.louvain(adata)
CPU times: user 18.3 s, sys: 0 ns, total: 18.3 s Wall time: 16.6 s
sc.pl.umap(adata, color=["batch", "louvain", "pct_counts_mt", "doublet_score"], ncols=2)
from scboolseq import binarization as scbin
exclusion_criteria = ['doublet_score', 'pct_counts_mt']
with sklearn.config_context(transform_output="pandas"):
# The QuantileBinarizer with parameter "alpha=1.5" is the standard "Tukey's Fences for Outlier Detection"
exclusion = scbin.QuantileBinarizer(alpha=1.5).fit(adata.obs[exclusion_criteria])
excluded = exclusion.binarize(adata.obs[exclusion_criteria]).fillna(0)
(
excluded.sum().to_frame().T
.join(pd.Series(excluded.all(axis=1).sum(), index=[0], name="both"))
.join(pd.Series(excluded.any(axis=1).sum(), index=[0], name="any"))
)
doublet_score | pct_counts_mt | both | any | |
---|---|---|---|---|
0 | 148.0 | 207.0 | 12 | 343 |
clean_adata = adata[~excluded.any(axis=1), :].copy()
clean_adata
AnnData object with n_obs × n_vars = 5004 × 15176 obs: 'batch', 'tag', 'n_genes_by_counts', 'log1p_n_genes_by_counts', 'total_counts', 'log1p_total_counts', 'pct_counts_in_top_50_genes', 'pct_counts_in_top_100_genes', 'pct_counts_in_top_200_genes', 'pct_counts_in_top_500_genes', 'total_counts_mt', 'log1p_total_counts_mt', 'pct_counts_mt', 'n_genes', 'doublet_score', 'predicted_doublet', 'louvain' var: 'mt', 'n_cells_by_counts', 'mean_counts', 'log1p_mean_counts', 'pct_dropout_by_counts', 'total_counts', 'log1p_total_counts', 'n_cells', 'highly_variable', 'means', 'dispersions', 'dispersions_norm', 'highly_variable_nbatches', 'highly_variable_intersection' uns: 'scrublet', 'log1p', 'hvg', 'pca', 'batch_colors', 'neighbors', 'umap', 'louvain', 'louvain_colors' obsm: 'X_pca', 'X_umap' varm: 'PCs' layers: 'counts' obsp: 'distances', 'connectivities'
sc.pl.violin(
clean_adata,
["n_genes_by_counts", "total_counts", "pct_counts_mt"],
jitter=0.7,
multi_panel=True,
)
sc.pl.scatter(clean_adata, "total_counts", "n_genes_by_counts", color="pct_counts_mt")
sc.pp.filter_cells(clean_adata, min_genes=100)
sc.pp.filter_genes(clean_adata, min_cells=3)
sc.pl.umap(
clean_adata,
color=[
"batch", "louvain",
"total_counts", "pct_counts_mt",
"n_genes_by_counts", "doublet_score"
],
ncols=2
)
We are happy with these results because the previous normalisation and quality control were effective, we can judge this because:
doublet_score
or by the percentage of mitochondrial counts.As explained in the main text, scBoolSeq
should not be applied on the whole dataset but rather on the subset of "Highly Variable Genes"
which are deemed to be explicative of the phenomenon of interest.
These can be accessed using the "highly variable" annotation of the var
attribute of an AnnData
object as follows:
_hvgs = adata.var.index[adata.var['highly_variable']]
ref = adata.to_df()[_hvgs]
ref.shape
(5347, 2000)
By overriding the parameter zeroes_are=0
of scBoolSeq
, cells with zero counts for genes classified as "ZeroInflated"
will have their coarse-grained activity assigned to be inactive rather than unknown.
This is a modelling choice that can be changed depending on the application.
%time scbool = scBoolSeq(zeroes_are=0)
CPU times: user 17 μs, sys: 7 μs, total: 24 μs Wall time: 30.3 μs
%time scbool.fit(ref)
CPU times: user 7min 6s, sys: 8min 7s, total: 15min 14s Wall time: 1min 12s
scBoolSeqBinarizer(zeroes_are=0)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
scBoolSeqBinarizer(zeroes_are=0)
As shown by the ipython magic command %% time
, estimating all parameters needed by scBoolSeq
to coarse-grain experimental data
and generate synthetic counts from Boolean dynamics is relatively fast.
Treating a reference dataset comprised of 4717 cells and 1980 genes took a little over a minute Wall time: 1min 15s
on modern
laptop with an Intel i7-10875H (16) @ 5.100GHz
processor.
It is important to note that scBoolSeq
objects are pickable.
Even though fitting the instance (computing the classification parameters) is relatively fast,
there is no reason to compute them more than once:
with open("scbool_GSE122466.pkl", "wb") as _out_f:
pickle.dump(scbool, _out_f)
with open("scbool_GSE122466.pkl", "rb") as _in_f:
scbool = pickle.load(scbool, _in_f)
scBoolSeq
instances require little disk space:$ du -sh scbool_GSE122466.pkl
# 808K scbool_GSE122466.pkl
scbool.criteria_.Category.value_counts()
Category ZeroInf 1362 Discarded 403 Bimodal 235 Name: count, dtype: int64
Some genes selected as being Highly Variable by scanpy
have been rejected for binarisation by our classification scheme.
%time raw_bin_ref = scbool.binarize(ref)
raw_bin_ref.head()
CPU times: user 711 ms, sys: 0 ns, total: 711 ms Wall time: 710 ms
St18 | 3110035E14Rik | Mybl1 | Mcmdc2 | Tram1 | Terf1 | Tfap2d | Tfap2b | Mcm3 | Col9a1 | ... | Hba-a2 | Gm12146 | Gm16075 | Abi3 | Slc4a1 | Tgfbi | F2rl1 | Adssl1 | Papss2 | Ffar4 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Lane1_AAACCTGAGATGTCGG | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0.0 | 1.0 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
Lane1_AAACCTGCAATCCAAC | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1.0 | 0.0 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
Lane1_AAACCTGGTTCCTCCA | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0.0 | 0.0 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
Lane1_AAACCTGTCCAATGGT | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.0 | 0.0 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
Lane1_AAACGGGAGGCAATTA | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1.0 | 1.0 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
5 rows × 2000 columns
As seen here, coarse-graining expression data is almost instant.
One might notice that numerous genes have NaN
everywhere. This is expected as 387 genes were classified as being Discarded.
bin_ref = raw_bin_ref[raw_bin_ref.columns[raw_bin_ref.var() > 0]]
print(raw_bin_ref.shape, bin_ref.shape)
bin_ref.head(10)[bin_ref.var().sort_values(ascending=False).index[:10]].style.background_gradient()
(5347, 2000) (5347, 1597)
Serpinh1 | Lman1 | Msmo1 | Dapl1 | Mab21l2 | Mgarp | Slc2a1 | Glo1 | Hmgb2 | Mllt11 | |
---|---|---|---|---|---|---|---|---|---|---|
Lane1_AAACCTGAGATGTCGG | 1.000000 | 0.000000 | 1.000000 | 1.000000 | 0.000000 | 0.000000 | 1.000000 | 1.000000 | 1.000000 | 0.000000 |
Lane1_AAACCTGCAATCCAAC | 0.000000 | 1.000000 | 0.000000 | 1.000000 | 0.000000 | 0.000000 | 1.000000 | 1.000000 | nan | 0.000000 |
Lane1_AAACCTGGTTCCTCCA | 0.000000 | 0.000000 | 1.000000 | 0.000000 | 1.000000 | 0.000000 | 1.000000 | 1.000000 | 0.000000 | 1.000000 |
Lane1_AAACCTGTCCAATGGT | 0.000000 | 0.000000 | 1.000000 | 0.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | nan | 1.000000 |
Lane1_AAACGGGAGGCAATTA | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 0.000000 |
Lane1_AAACGGGCACCAGGCT | 1.000000 | 0.000000 | 1.000000 | 0.000000 | 1.000000 | 0.000000 | 1.000000 | 0.000000 | 0.000000 | 1.000000 |
Lane1_AAACGGGCACGGCGTT | 1.000000 | 1.000000 | 0.000000 | 1.000000 | 0.000000 | 0.000000 | 1.000000 | 0.000000 | 1.000000 | 0.000000 |
Lane1_AAACGGGCACTCAGGC | 0.000000 | 0.000000 | 1.000000 | 1.000000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 0.000000 |
Lane1_AAACGGGCATCCAACA | 1.000000 | 0.000000 | 0.000000 | 1.000000 | 0.000000 | 1.000000 | 1.000000 | 0.000000 | nan | 0.000000 |
Lane1_AAACGGGGTCCAAGTT | 0.000000 | 1.000000 | 1.000000 | 0.000000 | 0.000000 | 1.000000 | 0.000000 | 0.000000 | 1.000000 | 1.000000 |
One may want to pool groups of cells together to perform analyses on some sort of consensus Boolean signature, rather than a single cells.
For this purpose, there is a module scboolseq.meta
containing classes and functions facilitating the agreggation of Boolean cells.
By defining these meta-observations of Boolean values, one can reason in terms of the common/consensus activation state of cell clusters.
scboolseq.meta.CellAggregator
has two main arguments:
binary_df
, the first argument with coarse-grained cells (the output of
scboolseq.scBoolSeq().fit_transform(log_expression)
).groups
, the second argument, which should be either a pandas GroupBy
object or a dictionnary whose keys are the group names
and values are pandas.Index
.Combining these two, the aggregator has numerous properties representing different aggregation strategies:
%%time
aggregator = scboolseq.meta.CellAggregator(
bin_ref,
clean_adata.obs.groupby("louvain", observed=True)
)
aggregator
CPU times: user 17 ms, sys: 4.99 ms, total: 22 ms Wall time: 20.8 ms
<scboolseq.meta.aggregations.CellAggregator at 0x7a0c82688a50>
Here, we demonstrate what we call binary strategies. These use the following statistics to yield a consensus signature:
scboolseq.meta.CellAggregator._binary_strategies
('mode', 'median', 'min', 'max')
We can asssess if these strategies permit identifying the groups (in this case cell clusters determined via the Louvain algorithm:
scanpy.tl.louvain(adata)
%%time
scores = {
k: scboolseq.meta.meta_marker_counter(getattr(aggregator, k))
for k in aggregator._binary_strategies
}
CPU times: user 5.35 s, sys: 20.4 ms, total: 5.37 s Wall time: 5.35 s
{k: df.mean().astype(int).to_dict() for k, df in scores.items()}
{'mode': {'n_positive_markers': 7, 'n_negative_markers': 0}, 'median': {'n_positive_markers': 7, 'n_negative_markers': 0}, 'min': {'n_positive_markers': 1, 'n_negative_markers': 0}, 'max': {'n_positive_markers': 0, 'n_negative_markers': 68}}
The mode strategy corresponds to taking the by-group most frequent coarse-grained value, for each gene. It seems like this strategy allows yields a high average number of positive markers (genes being classified as active within the group and in no other group).
_selected_mode = "mode"
metabin = scores[_selected_mode]
metabin
n_positive_markers | n_negative_markers | |
---|---|---|
0 | 3 | 0 |
1 | 7 | 0 |
10 | 1 | 0 |
11 | 1 | 0 |
2 | 3 | 0 |
3 | 2 | 0 |
4 | 16 | 1 |
5 | 19 | 0 |
6 | 24 | 0 |
7 | 11 | 0 |
8 | 2 | 0 |
9 | 2 | 0 |
To better assess the quality of these markers, we can add them to the adata.obs
DataFrame.
We can subsequently visualise the number of markers with the corresponding clusters on the UMAP projection.
for louv in clean_adata.obs.louvain.unique():
_nm = metabin.loc[louv, :].to_frame().T
for c in _nm.columns:
clean_adata.obs.loc[clean_adata.obs.louvain == louv, f"{c}_{_selected_mode}"] = _nm.loc[louv, c]
sc.pl.umap(clean_adata, color=["louvain", f"n_positive_markers_{_selected_mode}"], ncols=2)
We observe that some clusters have few Boolean markers. Extremes representing the starting point of the differentiation process (Early Retinogenesis) and the final phenotypes have a higher number of markers, while transient clusters have less extreme expression values and will therefore have less genes whose coarse-grained activity enables their identification.
These can be compared with the figure published in the original article:
Giudice, Q. Lo, Leleu, M., Manno, G. La, & Fabre, P. J. (2019).
Single-cell transcriptional logic of cell-fate specification and axon guidance in early-born retinal neurons.
Development (Cambridge), 146(17). https://doi.org/10.1242/dev.178103