In this tutorial, we are going to build a reference atlas using scGen and also map two new query datasets on the top of the reference atlas.
Note: scGen requires cell-type labels for data integration. The method outputs both corrected gene expression and also latent space.
import os
import sys
sys.path.insert(0, "../")
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
warnings.simplefilter(action='ignore', category=UserWarning)
import scanpy as sc
import scarches as sca
from scarches.dataset.trvae.data_handling import remove_sparsity
import matplotlib.pyplot as plt
import numpy as np
import gdown
sc.settings.set_figure_params(dpi=200, frameon=False)
sc.set_figure_params(dpi=200)
sc.set_figure_params(figsize=(4, 4))
condition_key = 'study'
cell_type_key = 'cell_type'
target_conditions = ['Pancreas CelSeq2', 'Pancreas SS2']
epoch = 50
early_stopping_kwargs = {
"early_stopping_metric": "val_loss",
"patience": 20,
"threshold": 0,
"reduce_lr": True,
"lr_patience": 13,
"lr_factor": 0.1,
}
url = 'https://drive.google.com/uc?id=1ehxgfHTsMZXy6YzlFKGJOsBKQ5rrvMnd'
output = 'pancreas.h5ad'
gdown.download(url, output, quiet=False)
Downloading... From: https://drive.google.com/uc?id=1ehxgfHTsMZXy6YzlFKGJOsBKQ5rrvMnd To: /home/mo/projects/scarches/notebooks/pancreas.h5ad 126MB [00:01, 102MB/s]
'pancreas.h5ad'
Imortant note : scGen requires normalized and log-transformed data in adata.X
adata = sc.read('pancreas.h5ad')
sc.pp.neighbors(adata)
WARNING: You’re trying to run this on 1000 dimensions of `.X`, if you really want this, set `use_rep='X'`. Falling back to preprocessing with `sc.pp.pca` and default params.
sc.tl.umap(adata)
sc.pl.umap(adata, color=['study', 'cell_type'],
frameon=False, wspace=0.6)
As observed above these Pancreas studies are seperated accroding to source of study
Here we use the CelSeq2 and SS2 studies as query data and the other 3 studies to build as reference atlas.
adata = remove_sparsity(adata) # remove sparsity
source_adata = adata[~adata.obs[condition_key].isin(target_conditions)].copy()
target_adata = adata[adata.obs[condition_key].isin(target_conditions)].copy()
Create the scgen model instance
network = sca.models.scgen(adata = source_adata,
hidden_layer_sizes=[256,128])
INITIALIZING NEW NETWORK.............. Encoder Architecture: Input Layer in, out: 1000 256 Hidden Layer 1 in/out: 256 128 Mean/Var Layer in/out: 128 10 Decoder Architecture: First Layer in, out 10 128 Hidden Layer 1 in/out: 128 256 Output Layer in/out: 256 1000
network.train(n_epochs=epoch, early_stopping_kwargs = early_stopping_kwargs)
|████████████████████| 100.0% - epoch_loss: 1.9351395184 - val_loss: 1.8963928728 Saving best state of network... Best State was in Epoch 87
This function returns corrected gene expression in adata.X
, raw uncorrected data in adata.obsm["original_data"]
. Also it returns uncorrected data in adata.layers["original_data"]
.
The low-dimensional corrected latent space in adata.obsm["latent_corrected"]
corrected_reference_adata = network.batch_removal(source_adata, batch_key="study", cell_label_key="cell_type",return_latent=True)
Corrected gene expression
sc.pp.neighbors(corrected_reference_adata)
sc.tl.umap(corrected_reference_adata)
sc.pl.umap(corrected_reference_adata, color=["study", "cell_type"], wspace=.5, frameon=False)
WARNING: You’re trying to run this on 1000 dimensions of `.X`, if you really want this, set `use_rep='X'`. Falling back to preprocessing with `sc.pp.pca` and default params. ... storing 'cell_type' as categorical
We an also use low-dim corrected reference data
sc.pp.neighbors(corrected_reference_adata, use_rep="latent_corrected")
sc.tl.umap(corrected_reference_adata)
sc.pl.umap(corrected_reference_adata,
color=['study', 'cell_type'],
frameon=False,
wspace=0.6,
)
After training, the model can be saved for later use and projection of new query studies
ref_path = './ref_model/'
network.save(ref_path, overwrite=True)
os.getcwd()
'/home/mo/projects/scarches/notebooks'
query data needs to be preprocessed same way as reference data with same genes
This function need pretrained reference model, corrected gene expression from reference data and incorrected query data
# here we pass the saved model from a file to the map query
integrated_query = sca.models.scgen.map_query_data(reference_model = network,
corrected_reference = corrected_reference_adata,
query = target_adata,
batch_key = 'study',
return_latent=True)
INITIALIZING NEW NETWORK.............. Encoder Architecture: Input Layer in, out: 1000 256 Hidden Layer 1 in/out: 256 128 Mean/Var Layer in/out: 128 10 Decoder Architecture: First Layer in, out 10 128 Hidden Layer 1 in/out: 128 256 Output Layer in/out: 256 1000
sc.pp.neighbors(integrated_query, use_rep="latent_corrected")
sc.tl.umap(integrated_query)
sc.pl.umap(
integrated_query,
color=["study", "cell_type"],
frameon=False,
wspace=0.6)
... storing 'batch' as categorical ... storing 'study' as categorical ... storing 'cell_type' as categorical ... storing 'original_batch' as categorical
sc.pp.neighbors(integrated_query)
sc.tl.umap(integrated_query)
sc.pl.umap(
integrated_query,
color=["study", "cell_type"],
frameon=False,
wspace=0.6)
WARNING: You’re trying to run this on 1000 dimensions of `.X`, if you really want this, set `use_rep='X'`. Falling back to preprocessing with `sc.pp.pca` and default params.