Before we dig into cell type annotation, we need to filter out cells that don't meet QC criteria. We'll apply the following filters:
From scrublet: predicted_doublet == False
to remove doublets
From scanpy: pct_counts_mito < 10
to remove high mitochondrial content
From pipeline metadata: n_genes > 200
to remove low-content cells
From pipeline metadata: n_genes < 5000
to remove abnormal cells that may be doublets
After filtering, we use Harmony to integrate samples across our cohorts so that cohort is not a major contributor to the variation in the dataset (including both age of subjects as well as collection and sample handling), and our clustering will mainly focus on differences between cell types. Harmony is described in this publication:
Korsunsky, I. et al. Fast, sensitive and accurate integration of single-cell data with Harmony. Nat. Methods 16, 1289–1296 (2019)
After integration, we use scanpy to perform PCA, UMAP, and Leiden clustering, then store the results in HISE for use in cell type assignment.
from datetime import date
import hisepy
import os
import pandas as pd
import re
import scanpy as sc
import scanpy.external as sce
h5ad_uuid = 'ca8cb1b9-f37b-453e-b606-89ae15f711ac'
hise_res = hisepy.reader.cache_files([h5ad_uuid])
h5ad_path = '/home/jupyter/cache/{u}'.format(u = h5ad_uuid)
h5ad_filename = os.listdir(h5ad_path)[0]
h5ad_file = '{p}/{f}'.format(p = h5ad_path, f = h5ad_filename)
adata = sc.read_h5ad(h5ad_file)
adata
AnnData object with n_obs × n_vars = 2093078 × 33538 obs: 'barcodes', 'batch_id', 'cell_name', 'cell_uuid', 'chip_id', 'hto_barcode', 'hto_category', 'n_genes', 'n_mito_umis', 'n_reads', 'n_umis', 'original_barcodes', 'pbmc_sample_id', 'pool_id', 'well_id', 'sample.sampleKitGuid', 'cohort.cohortGuid', 'subject.subjectGuid', 'subject.biologicalSex', 'subject.race', 'subject.ethnicity', 'subject.birthYear', 'sample.visitName', 'sample.drawDate', 'file.id', 'subject.cmv', 'subject.bmi', 'celltypist.low', 'seurat.l1', 'seurat.l1.score', 'seurat.l2', 'seurat.l2.score', 'seurat.l2.5', 'seurat.l2.5.score', 'seurat.l3', 'seurat.l3.score', 'predicted_doublet', 'doublet_score'
Remove doublets based on Scrublet calls:
n_start = adata.shape[0]
n_doublets = sum(adata.obs['predicted_doublet'] == True)
perc_doublets = round(n_doublets / adata.shape[0] * 100, 2)
n_retain = adata.shape[0] - n_doublets
perc_retain = round(n_retain / adata.shape[0] * 100, 2)
doublet_message = 'Removing {n} Doublets = {p}%'.format(n = n_doublets, p = perc_doublets)
print(doublet_message)
retain_message = 'Retaining {n} Singlets = {p}%'.format(n = n_retain, p = perc_retain)
print(retain_message)
Removing 27907 Doublets = 1.33% Retaining 2065171 Singlets = 98.67%
adata = adata[adata.obs['predicted_doublet'] == False]
Compute Mitochondrial RNA QC metrics
adata.var["mito"] = adata.var_names.str.startswith("MT-")
sc.pp.calculate_qc_metrics(adata, qc_vars=["mito"], inplace=True)
/tmp/ipykernel_35889/2209369727.py:1: ImplicitModificationWarning: Trying to modify attribute `.var` of view, initializing view as actual. adata.var["mito"] = adata.var_names.str.startswith("MT-")
sc.pl.violin(
adata,
['n_genes_by_counts', 'total_counts', 'pct_counts_mito'],
multi_panel = True
)
/opt/conda/lib/python3.10/site-packages/seaborn/_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead. with pd.option_context('mode.use_inf_as_na', True): /opt/conda/lib/python3.10/site-packages/seaborn/_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead. with pd.option_context('mode.use_inf_as_na', True): /opt/conda/lib/python3.10/site-packages/seaborn/_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead. with pd.option_context('mode.use_inf_as_na', True): /opt/conda/lib/python3.10/site-packages/seaborn/_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead. with pd.option_context('mode.use_inf_as_na', True): /opt/conda/lib/python3.10/site-packages/seaborn/_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead. with pd.option_context('mode.use_inf_as_na', True): /opt/conda/lib/python3.10/site-packages/seaborn/_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead. with pd.option_context('mode.use_inf_as_na', True):
Remove cells with high mitochondrial fraction
mito_cut = 10
n_mito = sum(adata.obs["pct_counts_mito"] >= mito_cut)
perc_mito = round(n_mito / adata.shape[0] * 100, 2)
n_retain = adata.shape[0] - n_mito
perc_retain = round(n_retain / adata.shape[0] * 100, 2)
mito_message = 'Removing {n} High Mito = {p}%'.format(n = n_mito, p = perc_mito)
print(mito_message)
retain_message = 'Retaining {n} Low Mito = {p}%'.format(n = n_retain, p = perc_retain)
print(retain_message)
Removing 109940 High Mito = 5.32% Retaining 1955231 Low Mito = 94.68%
adata = adata[(adata.obs["pct_counts_mito"] < mito_cut)]
Filter cells with low or abnormally high gene counts
low_cut = 200
n_low = sum(adata.obs["n_genes"] <= low_cut)
perc_low = round(n_low / adata.shape[0] * 100, 2)
n_retain = adata.shape[0] - n_low
perc_retain = round(n_retain / adata.shape[0] * 100, 2)
low_message = 'Removing {n} Low Gene Count = {p}%'.format(n = n_low, p = perc_low)
print(low_message)
retain_message = 'Retaining {n} Cells = {p}%'.format(n = n_retain, p = perc_retain)
print(retain_message)
Removing 1306 Low Gene Count = 0.07% Retaining 1953925 Low Mito = 99.93%
adata = adata[adata.obs["n_genes"] > low_cut]
high_cut = 5000
n_high = sum(adata.obs["n_genes"] >= high_cut)
perc_high = round(n_high / adata.shape[0] * 100, 2)
n_retain = adata.shape[0] - n_high
perc_retain = round(n_retain / adata.shape[0] * 100, 2)
high_message = 'Removing {n} High Gene Count = {p}%'.format(n = n_high, p = perc_high)
print(high_message)
retain_message = 'Retaining {n} Cells = {p}%'.format(n = n_retain, p = perc_retain)
print(retain_message)
Removing 1797 High Mito = 0.09% Retaining 1952128 Low Mito = 99.91%
adata = adata[adata.obs["n_genes"] < high_cut]
n_removed = n_start - adata.shape[0]
perc_removed = round(n_removed / n_start * 100, 2)
n_retain = adata.shape[0]
perc_retain = round(n_retain / n_start * 100, 2)
total_message = 'Removed {n} in Total = {p}%'.format(n = n_removed, p = perc_removed)
print(total_message)
retain_message = 'Retaining {n} in Total = {p}%'.format(n = n_retain, p = perc_retain)
print(retain_message)
Removed 140950 in Total = 6.73% Retaining 1952128 in Total = 93.27%
Because we're interested in cell types, not age-associated variation in our reference, we'll use Harmony to remove variation associated with cohort.cohortGuid
, which corresponds to the major age groups in our dataset: UP1 = Pediatric, BR1 = Young Adult, BR2 = Older Adult.
adata.raw = adata
sc.pp.normalize_total(adata, target_sum = 1e4)
sc.pp.log1p(adata)
sc.pp.highly_variable_genes(adata)
/opt/conda/lib/python3.10/site-packages/scanpy/preprocessing/_highly_variable_genes.py:220: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning. disp_grouped = df.groupby('mean_bin')['dispersions']
adata = adata[:, adata.var_names[adata.var['highly_variable']]].copy()
sc.pp.scale(adata)
%%time
sc.tl.pca(adata, svd_solver = 'arpack')
CPU times: user 48min 28s, sys: 16min 59s, total: 1h 5min 27s Wall time: 2min 30s
%%time
sce.pp.harmony_integrate(
adata,
'cohort.cohortGuid',
max_iter_harmony = 20
)
2024-02-19 18:06:14,216 - harmonypy - INFO - Computing initial centroids with sklearn.KMeans... 2024-02-19 18:18:00,701 - harmonypy - INFO - sklearn.KMeans initialization complete. 2024-02-19 18:18:09,465 - harmonypy - INFO - Iteration 1 of 20 2024-02-19 18:33:22,619 - harmonypy - INFO - Iteration 2 of 20 2024-02-19 18:49:39,548 - harmonypy - INFO - Iteration 3 of 20 2024-02-19 19:04:38,934 - harmonypy - INFO - Iteration 4 of 20 2024-02-19 19:20:20,379 - harmonypy - INFO - Iteration 5 of 20 2024-02-19 19:36:15,926 - harmonypy - INFO - Converged after 5 iterations
CPU times: user 12h 21min 42s, sys: 13h 30min 44s, total: 1d 1h 52min 27s Wall time: 1h 30min 2s
Find Nearest Neighbors and perform UMAP for visualization using harmonized PCs.
%%time
sc.pp.neighbors(
adata,
n_neighbors = 50,
use_rep = 'X_pca_harmony',
n_pcs = 30
)
%%time
sc.tl.umap(adata)
CPU times: user 12h 5min 42s, sys: 2h 24min 25s, total: 14h 30min 8s Wall time: 1h 17min 30s
out_dir = 'output'
if not os.path.isdir(out_dir):
os.makedirs(out_dir)
filtered_h5ad = 'output/ref_pbmc_filtered_{d}.h5ad'.format(d = date.today())
adata.write_h5ad(filtered_h5ad)
Perform Leiden clustering for use with identifying major cell classes and subsetting.
sc.pl.umap(
adata,
color = ['seurat.l2.5'],
size = 2,
show = False,
ncols = 1,
frameon = False
)
/opt/conda/lib/python3.10/site-packages/scanpy/plotting/_tools/scatterplots.py:1251: FutureWarning: The default value of 'ignore' for the `na_action` parameter in pandas.Categorical.map is deprecated and will be changed to 'None' in a future version. Please set na_action to the desired value to avoid seeing this warning color_vector = pd.Categorical(values.map(color_map)) /opt/conda/lib/python3.10/site-packages/scanpy/plotting/_tools/scatterplots.py:394: UserWarning: No data for colormapping provided via 'c'. Parameters 'cmap' will be ignored cax = scatter(
<Axes: title={'center': 'seurat.l2.5'}, xlabel='UMAP1', ylabel='UMAP2'>
sc.pl.umap(
adata,
color = ['celltypist.low'],
size = 2,
show = False,
ncols = 1,
frameon = False
)
/opt/conda/lib/python3.10/site-packages/scanpy/plotting/_tools/scatterplots.py:1251: FutureWarning: The default value of 'ignore' for the `na_action` parameter in pandas.Categorical.map is deprecated and will be changed to 'None' in a future version. Please set na_action to the desired value to avoid seeing this warning color_vector = pd.Categorical(values.map(color_map)) /opt/conda/lib/python3.10/site-packages/scanpy/plotting/_tools/scatterplots.py:394: UserWarning: No data for colormapping provided via 'c'. Parameters 'cmap' will be ignored cax = scatter(
<Axes: title={'center': 'celltypist.low'}, xlabel='UMAP1', ylabel='UMAP2'>
sc.tl.leiden(adata)
Note: Leiden clustering took ~24hr
Save clustered AnnData object to .h5ad
clustered_h5ad = 'output/ref_pbmc_clustered_{d}.h5ad'.format(d = date.today())
adata.write_h5ad(clustered_h5ad)
out_files = os.listdir('output')
clustered_h5ad = list(filter(lambda f: 'clustered' in f, out_files))[0]
adata = sc.read_h5ad('output/' + clustered_h5ad)
sc.pl.umap(
adata,
color = ['leiden'],
size = 2,
show = False,
ncols = 1 ,
frameon = False
)
/opt/conda/lib/python3.10/site-packages/scanpy/plotting/_tools/scatterplots.py:1251: FutureWarning: The default value of 'ignore' for the `na_action` parameter in pandas.Categorical.map is deprecated and will be changed to 'None' in a future version. Please set na_action to the desired value to avoid seeing this warning color_vector = pd.Categorical(values.map(color_map)) /opt/conda/lib/python3.10/site-packages/scanpy/plotting/_tools/scatterplots.py:394: UserWarning: No data for colormapping provided via 'c'. Parameters 'cmap' will be ignored cax = scatter(
<Axes: title={'center': 'leiden'}, xlabel='UMAP1', ylabel='UMAP2'>
umap_mat = adata.obsm['X_umap']
umap_df = pd.DataFrame(umap_mat, columns = ['umap_1', 'umap_2'])
obs = adata.obs
obs['umap_1'] = umap_df['umap_1']
obs['umap_2'] = umap_df['umap_2']
out_csv = 'output/ref_pbmc_clustered_umap_meta_{d}.csv'.format(d = date.today())
obs.to_csv(out_csv)
out_parquet = 'output/ref_pbmc_clustered_umap_meta_{d}.parquet'.format(d = date.today())
obs = obs.to_parquet(out_parquet)
Finally, we'll use hisepy.upload.upload_files()
to send a copy of our output to HISE to use for downstream analysis steps.
study_space_uuid = '64097865-486d-43b3-8f94-74994e0a72e0'
title = 'Clustered PBMC .h5ad {d}'.format(d = date.today())
in_files = [h5ad_uuid]
out_files = ['output/' + clustered_h5ad, out_csv, out_parquet]
hisepy.upload.upload_files(
files = out_files,
study_space_id = study_space_uuid,
title = title,
input_file_ids = in_files
)
you are trying to upload file_ids... ['output/ref_pbmc_clustered_2024-02-20.h5ad', 'output/ref_pbmc_clustered_umap_meta_2024-02-20.csv', 'output/ref_pbmc_clustered_umap_meta_2024-02-20.parquet']. Do you truly want to proceed?
{'trace_id': 'dff21171-2134-4de1-9126-3917d98b2541', 'files': ['output/ref_pbmc_clustered_2024-02-20.h5ad', 'output/ref_pbmc_clustered_umap_meta_2024-02-20.csv', 'output/ref_pbmc_clustered_umap_meta_2024-02-20.parquet']}
import session_info
session_info.show()
----- anndata 0.10.3 hisepy 0.3.0 matplotlib 3.8.0 pandas 2.1.4 scanpy 1.9.6 session_info 1.0.0 -----
PIL 10.0.1 anyio NA arrow 1.3.0 asttokens NA attr 23.2.0 attrs 23.2.0 babel 2.14.0 beatrix_jupyterlab NA brotli NA cachetools 5.3.1 certifi 2023.11.17 cffi 1.16.0 charset_normalizer 3.3.2 cloudpickle 2.2.1 colorama 0.4.6 comm 0.1.4 cryptography 41.0.7 cycler 0.10.0 cython_runtime NA dateutil 2.8.2 db_dtypes 1.1.1 debugpy 1.8.0 decorator 5.1.1 defusedxml 0.7.1 deprecated 1.2.14 exceptiongroup 1.2.0 executing 2.0.1 fastjsonschema NA fqdn NA google NA greenlet 2.0.2 grpc 1.58.0 grpc_status NA h5py 3.10.0 idna 3.6 igraph 0.10.8 importlib_metadata NA ipykernel 6.28.0 ipython_genutils 0.2.0 ipywidgets 8.1.1 isoduration NA jedi 0.19.1 jinja2 3.1.2 joblib 1.3.2 json5 NA jsonpointer 2.4 jsonschema 4.20.0 jsonschema_specifications NA jupyter_events 0.9.0 jupyter_server 2.12.1 jupyterlab_server 2.25.2 jwt 2.8.0 kiwisolver 1.4.5 leidenalg 0.10.1 llvmlite 0.41.0 lz4 4.3.2 markupsafe 2.1.3 matplotlib_inline 0.1.6 mpl_toolkits NA mpmath 1.3.0 natsort 8.4.0 nbformat 5.9.2 numba 0.58.0 numpy 1.24.0 opentelemetry NA overrides NA packaging 23.2 parso 0.8.3 pexpect 4.8.0 pickleshare 0.7.5 pkg_resources NA platformdirs 4.1.0 plotly 5.18.0 prettytable 3.9.0 prometheus_client NA prompt_toolkit 3.0.42 proto NA psutil NA ptyprocess 0.7.0 pure_eval 0.2.2 pyarrow 13.0.0 pydev_ipython NA pydevconsole NA pydevd 2.9.5 pydevd_file_utils NA pydevd_plugins NA pydevd_tracing NA pygments 2.17.2 pynvml NA pyparsing 3.1.1 pyreadr 0.5.0 pythonjsonlogger NA pytz 2023.3.post1 referencing NA requests 2.31.0 rfc3339_validator 0.1.4 rfc3986_validator 0.1.1 rpds NA scipy 1.11.4 send2trash NA shapely 1.8.5.post1 six 1.16.0 sklearn 1.3.2 sniffio 1.3.0 socks 1.7.1 sql NA sqlalchemy 2.0.21 sqlparse 0.4.4 stack_data 0.6.2 sympy 1.12 termcolor NA texttable 1.7.0 threadpoolctl 3.2.0 torch 2.1.2+cu121 torchgen NA tornado 6.3.3 tqdm 4.66.1 traitlets 5.9.0 typing_extensions NA uri_template NA urllib3 1.26.18 wcwidth 0.2.12 webcolors 1.13 websocket 1.7.0 wrapt 1.15.0 xarray 2023.12.0 yaml 6.0.1 zipp NA zmq 25.1.2 zoneinfo NA zstandard 0.22.0
----- IPython 8.19.0 jupyter_client 8.6.0 jupyter_core 5.6.1 jupyterlab 4.0.10 notebook 6.5.4 ----- Python 3.10.13 | packaged by conda-forge | (main, Dec 23 2023, 15:36:39) [GCC 12.3.0] Linux-5.15.0-1051-gcp-x86_64-with-glibc2.31 ----- Session information updated at 2024-02-20 20:24