#!/usr/bin/env python
# coding: utf-8

# # Preprocessing Tutorial  
# This tutorial focuses on how to utilize dynamo to preprocess data. In the new version, we make `Preprocessor` class to allow you to freely explore different preprocessing recipes whose step parameters that can be configured inside `Preprocessor`. Existing recipes in preprocessor includes monocle, pearson residual, seurat and sctransform. Moreover, you can replace each preprocessing step with your own implementation with ease. For instance, `Preprocessor`'s monocle pipeline contains `filter_cells_by_outliers`, `filter_genes_by_outliers`, `normalize_by_cells`, `select_genes` and other steps. You can replace the implementation and default monocle parameters passed in to these functions by replacing or changing attributes of `Preprocessor`.
# 
# In older versions, dynamo offer several recipes, among which `recipe_monocle` is  a basic function as a building block of other recipes. You can still use these functions to preprocess data.  
# 
# `Preprocessor` provides users with `config_monocle_recipe` and other `config_*_recipes` methods to help you reproduce different preprocessor results and integrate with your newly developed preprocessing algorithms.

# Import packages

# In[2]:


import dynamo as dyn
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib
import numpy as np
from dynamo.configuration import DKM
import warnings
warnings.filterwarnings('ignore')
warnings.filterwarnings("ignore", message="numpy.dtype size changed")
dyn.configuration.set_figure_params('dynamo', background='white')
dyn.get_all_dependencies_version()


# ## Glossary of keys generated during preprocessing
# 
# 
# - `adata.obs.pass_basic_filter`: a list of boolean variables indicating whether cells pass certain basic filters. In monocle recipe, the basic filtering is based on thresholding of expression values.
# - `adata.var.pass_basic_filter`: a list of boolean variables indicating whether genes pass certain basic filters. In monocle recipe, the basic filtering is based on thresholding of expression values.
# - `adata.var.use_for_pca`: a list of boolean variables used during PCA dimension reduction and following downstream analysis. In many recipes, this key is equivalent to highly variable genes.
# - `adata.var.highly_variable_scores`: a list of float number scores indicating how variable each gene is, typically generated during gene feature selection (`preprocessor.select_genes`). Note only part of recipes do not have this highly variable scores. E.g. `seuratV3` recipe implemented in dynamo does not have highly variable scores due to its thresholding nature. 
# - `adata.layers.X_spliced`: unspliced expression matrix after normalization used in downstream computation
# - `adata.layers.X_unspliced`: spliced expression matrix after normalization used in downstream computation
# - `adata.obsm.X_pca`: normalized X after PCA transformation
#   

# ## Using Predefined (default) Recipe Configurations in Preprocessor

# In[4]:


adata = dyn.sample_data.zebrafish()


# Read zebrafish data

# In[5]:


adata = dyn.sample_data.zebrafish()
celltype_key = "Cell_type"
figsize = (10, 10)


# Import `Preprocessor` class

# In[6]:


from dynamo.preprocessing import Preprocessor


# `dynamo` provides users with `preprocess_adata`, a simple wrapper, to apply preprocess steps with default settings. In this section, we will go through recipes in `preprocess_adata` and observe how preprocess methods can influence visualization results. 
# 

# ### Applying Monocle Recipe 

# **TODO: remake the flowchart**
# ![monocle-flowchart](./images/normalization-monocle-dynamo-flow-chart.png)

# In[7]:


preprocessor = Preprocessor()
preprocessor.preprocess_adata(adata, recipe="monocle")

# Alternative
# preprocessor.config_monocle_recipe(adata)
# preprocessor.preprocess_adata_monocle(adata)

default_preprocessor_monocle_adata = adata # save for usage later


# In[8]:


dyn.tl.reduceDimension(adata, basis="pca")
dyn.pl.umap(adata, color=celltype_key, figsize=(4,4),
           adjust_legend=True)


# ### Applying Pearson Residuals Recipe

# In[9]:


adata = dyn.sample_data.zebrafish()
preprocessor = Preprocessor()
# preprocessor.config_pearson_residuals_recipe(adata)
# preprocessor.preprocess_adata_pearson_residuals(adata)
preprocessor.preprocess_adata(adata, recipe="pearson_residuals")


# In[10]:


dyn.tl.reduceDimension(adata)
dyn.pl.umap(adata, color=celltype_key,figsize=(4,4),
           adjust_legend=True)


# ### Applying Sctransform Recipe

# In[11]:


adata = dyn.sample_data.zebrafish()
preprocessor = Preprocessor()
# preprocessor.config_sctransform_recipe(adata)
# preprocessor.preprocess_adata_sctransform(adata)
preprocessor.preprocess_adata(adata, recipe="sctransform")


# In[12]:


dyn.tl.reduceDimension(adata)
dyn.pl.umap(adata, color=celltype_key, figsize=(4,4),
           adjust_legend=True)


# ### Applying Seurat Recipe

# In[13]:


adata = dyn.sample_data.zebrafish()
preprocessor = Preprocessor()
# preprocessor.config_seurat_recipe()
# preprocessor.preprocess_adata_seurat(adata)
preprocessor.preprocess_adata(adata, recipe="seurat")


# In[14]:


dyn.tl.reduceDimension(adata)
dyn.pl.umap(adata, color=celltype_key, figsize=(4,4),
           adjust_legend=True)


# ## Customize Function Parameters Configured in Preprocessor
# Here we are gong to use recipe monocle as an example. In recipe monocle's selection genes function, we can set recipe to be `dynamo_monocle`, `seurat`, `svr` and others to apply different criterions to select genes. We can set preprocesor's `select_genes_kwargs` to pass wanted parameters. In the example below, the default parameter is `recipe=dynmoa_monocle`. We can change it to `seurat` and add other contraint parameters as well.

# Let's call `preprocessor.config_monocle_recipe` to set `monocle` recipe preprocessing steps and corresponding function parameters. The default constructor parameters of `Preprocessor` for preprocessing are from our monocle recipe used in `dynamo` papers.

# In[22]:


adata = dyn.sample_data.zebrafish()
preprocessor = Preprocessor()
preprocessor.config_monocle_recipe(adata)


# `preprocessor.select_genes_kwargs` contains arguments that will be passed to `select_genes` step.

# In[23]:


preprocessor.select_genes_kwargs


# In[25]:


preprocessor.select_genes_kwargs = dict(
    n_top_genes=2000,
    SVRs_kwargs={'relative_expr': False}
)
preprocessor.select_genes_kwargs


# In[26]:


preprocessor.preprocess_adata_monocle(adata);
dyn.tl.reduceDimension(adata, basis="pca")
dyn.pl.umap(adata, color=celltype_key, figsize=(4,4),
           adjust_legend=True)


# ## Define customized preprocessor steps and integrate with existing preprocessor