Reference maping using scGen¶

In this tutorial, we are going to build a reference atlas using scGen and also map two new query datasets on the top of the reference atlas.

Note: scGen requires cell-type labels for data integration. The method outputs both corrected gene expression and also latent space.

In [1]:

import os
import sys
sys.path.insert(0, "../")

import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
warnings.simplefilter(action='ignore', category=UserWarning)

In [2]:

import scanpy as sc
import scarches as sca
from scarches.dataset.trvae.data_handling import remove_sparsity
import matplotlib.pyplot as plt
import numpy as np
import gdown

In [3]:

sc.settings.set_figure_params(dpi=200, frameon=False)
sc.set_figure_params(dpi=200)
sc.set_figure_params(figsize=(4, 4))

set relevant anndata.obs labels and training hyperparameters¶

In [4]:

condition_key = 'study'
cell_type_key = 'cell_type'
target_conditions = ['Pancreas CelSeq2', 'Pancreas SS2']


epoch = 50

early_stopping_kwargs = {
    "early_stopping_metric": "val_loss",
    "patience": 20,
    "threshold": 0,
    "reduce_lr": True,
    "lr_patience": 13,
    "lr_factor": 0.1,
}

Download Dataset and split into reference dataset and query dataset¶

In [5]:

url = 'https://drive.google.com/uc?id=1ehxgfHTsMZXy6YzlFKGJOsBKQ5rrvMnd'
output = 'pancreas.h5ad'
gdown.download(url, output, quiet=False)

Downloading...
From: https://drive.google.com/uc?id=1ehxgfHTsMZXy6YzlFKGJOsBKQ5rrvMnd
To: /home/mo/projects/scarches/notebooks/pancreas.h5ad
126MB [00:01, 102MB/s]

Out[5]:

'pancreas.h5ad'

Imortant note : scGen requires normalized and log-transformed data in adata.X

In [6]:

adata = sc.read('pancreas.h5ad')

original uncorrected data UMAP¶

In [7]:

sc.pp.neighbors(adata)

WARNING: You’re trying to run this on 1000 dimensions of `.X`, if you really want this, set `use_rep='X'`.
         Falling back to preprocessing with `sc.pp.pca` and default params.

In [8]:

sc.tl.umap(adata)

In [9]:

sc.pl.umap(adata, color=['study', 'cell_type'],
           frameon=False, wspace=0.6)

As observed above these Pancreas studies are seperated accroding to source of study

Here we use the CelSeq2 and SS2 studies as query data and the other 3 studies to build as reference atlas.

In [10]:

adata = remove_sparsity(adata) # remove sparsity 
source_adata = adata[~adata.obs[condition_key].isin(target_conditions)].copy()
target_adata = adata[adata.obs[condition_key].isin(target_conditions)].copy()

Create scGen model and train it on reference dataset¶

Create the scgen model instance

In [11]:

network = sca.models.scgen(adata = source_adata, 
                           hidden_layer_sizes=[256,128])

INITIALIZING NEW NETWORK..............
Encoder Architecture:
	Input Layer in, out: 1000 256
	Hidden Layer 1 in/out: 256 128
	Mean/Var Layer in/out: 128 10
Decoder Architecture:
	First Layer in, out 10 128
	Hidden Layer 1 in/out: 128 256
	Output Layer in/out:  256 1000

In [12]:

network.train(n_epochs=epoch, early_stopping_kwargs = early_stopping_kwargs)

 |████████████████████| 100.0%  - epoch_loss: 1.9351395184 - val_loss: 1.8963928728
Saving best state of network...
Best State was in Epoch 87

Correct batches in reference data¶

This function returns corrected gene expression in adata.X, raw uncorrected data in adata.obsm["original_data"]. Also it returns uncorrected data in adata.layers["original_data"].

The low-dimensional corrected latent space in adata.obsm["latent_corrected"]

In [13]:

corrected_reference_adata = network.batch_removal(source_adata, batch_key="study", cell_label_key="cell_type",return_latent=True)

Corrected gene expression

In [14]:

sc.pp.neighbors(corrected_reference_adata)
sc.tl.umap(corrected_reference_adata)
sc.pl.umap(corrected_reference_adata, color=["study", "cell_type"], wspace=.5, frameon=False)

WARNING: You’re trying to run this on 1000 dimensions of `.X`, if you really want this, set `use_rep='X'`.
         Falling back to preprocessing with `sc.pp.pca` and default params.
... storing 'cell_type' as categorical

We an also use low-dim corrected reference data

In [15]:

sc.pp.neighbors(corrected_reference_adata, use_rep="latent_corrected")
sc.tl.umap(corrected_reference_adata)
sc.pl.umap(corrected_reference_adata,
           color=['study', 'cell_type'],
           frameon=False,
           wspace=0.6,
           )

After training, the model can be saved for later use and projection of new query studies

In [16]:

ref_path = './ref_model/'
network.save(ref_path, overwrite=True)

In [17]:

os.getcwd()

Out[17]:

'/home/mo/projects/scarches/notebooks'

Project query on top of the reference¶

query data needs to be preprocessed same way as reference data with same genes

This function need pretrained reference model, corrected gene expression from reference data and incorrected query data

In [18]:

# here we pass the saved model from a file to the map query
integrated_query = sca.models.scgen.map_query_data(reference_model = network,
                                                   corrected_reference = corrected_reference_adata,
                                                   query = target_adata, 
                                                   batch_key = 'study',
                                                   return_latent=True)

INITIALIZING NEW NETWORK..............
Encoder Architecture:
	Input Layer in, out: 1000 256
	Hidden Layer 1 in/out: 256 128
	Mean/Var Layer in/out: 128 10
Decoder Architecture:
	First Layer in, out 10 128
	Hidden Layer 1 in/out: 128 256
	Output Layer in/out:  256 1000

Plot the latent space of integrated query and reference¶

In [19]:

sc.pp.neighbors(integrated_query, use_rep="latent_corrected")
sc.tl.umap(integrated_query)
sc.pl.umap(
    integrated_query,
    color=["study", "cell_type"],
    frameon=False,
    wspace=0.6)

... storing 'batch' as categorical
... storing 'study' as categorical
... storing 'cell_type' as categorical
... storing 'original_batch' as categorical

Plot corrected gene expression space of integrated query and reference¶

In [20]:

sc.pp.neighbors(integrated_query)
sc.tl.umap(integrated_query)
sc.pl.umap(
    integrated_query,
    color=["study", "cell_type"],
    frameon=False,
    wspace=0.6)

WARNING: You’re trying to run this on 1000 dimensions of `.X`, if you really want this, set `use_rep='X'`.
         Falling back to preprocessing with `sc.pp.pca` and default params.