#!/usr/bin/env python # coding: utf-8 # # Preprocessing Tutorial # This tutorial focuses on how to utilize dynamo to preprocess data. In the new version, we make `Preprocessor` class to allow you to freely explore different preprocessing recipes whose step parameters that can be configured inside `Preprocessor`. Existing recipes in preprocessor includes monocle, pearson residual, seurat and sctransform. Moreover, you can replace each preprocessing step with your own implementation with ease. For instance, `Preprocessor`'s monocle pipeline contains `filter_cells_by_outliers`, `filter_genes_by_outliers`, `normalize_by_cells`, `select_genes` and other steps. You can replace the implementation and default monocle parameters passed in to these functions by replacing or changing attributes of `Preprocessor`. # # In older versions, dynamo offer several recipes, among which `recipe_monocle` is a basic function as a building block of other recipes. You can still use these functions to preprocess data. # # `Preprocessor` provides users with `config_monocle_recipe` and other `config_*_recipes` methods to help you reproduce different preprocessor results and integrate with your newly developed preprocessing algorithms. # Import packages # In[2]: import dynamo as dyn import seaborn as sns import matplotlib.pyplot as plt import matplotlib import numpy as np from dynamo.configuration import DKM import warnings warnings.filterwarnings('ignore') warnings.filterwarnings("ignore", message="numpy.dtype size changed") dyn.configuration.set_figure_params('dynamo', background='white') dyn.get_all_dependencies_version() # ## Glossary of keys generated during preprocessing # # # - `adata.obs.pass_basic_filter`: a list of boolean variables indicating whether cells pass certain basic filters. In monocle recipe, the basic filtering is based on thresholding of expression values. # - `adata.var.pass_basic_filter`: a list of boolean variables indicating whether genes pass certain basic filters. In monocle recipe, the basic filtering is based on thresholding of expression values. # - `adata.var.use_for_pca`: a list of boolean variables used during PCA dimension reduction and following downstream analysis. In many recipes, this key is equivalent to highly variable genes. # - `adata.var.highly_variable_scores`: a list of float number scores indicating how variable each gene is, typically generated during gene feature selection (`preprocessor.select_genes`). Note only part of recipes do not have this highly variable scores. E.g. `seuratV3` recipe implemented in dynamo does not have highly variable scores due to its thresholding nature. # - `adata.layers.X_spliced`: unspliced expression matrix after normalization used in downstream computation # - `adata.layers.X_unspliced`: spliced expression matrix after normalization used in downstream computation # - `adata.obsm.X_pca`: normalized X after PCA transformation # # ## Using Predefined (default) Recipe Configurations in Preprocessor # In[4]: adata = dyn.sample_data.zebrafish() # Read zebrafish data # In[5]: adata = dyn.sample_data.zebrafish() celltype_key = "Cell_type" figsize = (10, 10) # Import `Preprocessor` class # In[6]: from dynamo.preprocessing import Preprocessor # `dynamo` provides users with `preprocess_adata`, a simple wrapper, to apply preprocess steps with default settings. In this section, we will go through recipes in `preprocess_adata` and observe how preprocess methods can influence visualization results. # # ### Applying Monocle Recipe # **TODO: remake the flowchart** # ![monocle-flowchart](./images/normalization-monocle-dynamo-flow-chart.png) # In[7]: preprocessor = Preprocessor() preprocessor.preprocess_adata(adata, recipe="monocle") # Alternative # preprocessor.config_monocle_recipe(adata) # preprocessor.preprocess_adata_monocle(adata) default_preprocessor_monocle_adata = adata # save for usage later # In[8]: dyn.tl.reduceDimension(adata, basis="pca") dyn.pl.umap(adata, color=celltype_key, figsize=(4,4), adjust_legend=True) # ### Applying Pearson Residuals Recipe # In[9]: adata = dyn.sample_data.zebrafish() preprocessor = Preprocessor() # preprocessor.config_pearson_residuals_recipe(adata) # preprocessor.preprocess_adata_pearson_residuals(adata) preprocessor.preprocess_adata(adata, recipe="pearson_residuals") # In[10]: dyn.tl.reduceDimension(adata) dyn.pl.umap(adata, color=celltype_key,figsize=(4,4), adjust_legend=True) # ### Applying Sctransform Recipe # In[11]: adata = dyn.sample_data.zebrafish() preprocessor = Preprocessor() # preprocessor.config_sctransform_recipe(adata) # preprocessor.preprocess_adata_sctransform(adata) preprocessor.preprocess_adata(adata, recipe="sctransform") # In[12]: dyn.tl.reduceDimension(adata) dyn.pl.umap(adata, color=celltype_key, figsize=(4,4), adjust_legend=True) # ### Applying Seurat Recipe # In[13]: adata = dyn.sample_data.zebrafish() preprocessor = Preprocessor() # preprocessor.config_seurat_recipe() # preprocessor.preprocess_adata_seurat(adata) preprocessor.preprocess_adata(adata, recipe="seurat") # In[14]: dyn.tl.reduceDimension(adata) dyn.pl.umap(adata, color=celltype_key, figsize=(4,4), adjust_legend=True) # ## Customize Function Parameters Configured in Preprocessor # Here we are gong to use recipe monocle as an example. In recipe monocle's selection genes function, we can set recipe to be `dynamo_monocle`, `seurat`, `svr` and others to apply different criterions to select genes. We can set preprocesor's `select_genes_kwargs` to pass wanted parameters. In the example below, the default parameter is `recipe=dynmoa_monocle`. We can change it to `seurat` and add other contraint parameters as well. # Let's call `preprocessor.config_monocle_recipe` to set `monocle` recipe preprocessing steps and corresponding function parameters. The default constructor parameters of `Preprocessor` for preprocessing are from our monocle recipe used in `dynamo` papers. # In[22]: adata = dyn.sample_data.zebrafish() preprocessor = Preprocessor() preprocessor.config_monocle_recipe(adata) # `preprocessor.select_genes_kwargs` contains arguments that will be passed to `select_genes` step. # In[23]: preprocessor.select_genes_kwargs # In[25]: preprocessor.select_genes_kwargs = dict( n_top_genes=2000, SVRs_kwargs={'relative_expr': False} ) preprocessor.select_genes_kwargs # In[26]: preprocessor.preprocess_adata_monocle(adata); dyn.tl.reduceDimension(adata, basis="pca") dyn.pl.umap(adata, color=celltype_key, figsize=(4,4), adjust_legend=True) # ## Define customized preprocessor steps and integrate with existing preprocessor