Our cell hashing processes catch and remove many doublets that are generated by mixing of cells from different samples, but some percentage of doublets (~7-8%) will be generated by collisions of cells from the same sample, and will not be detected.
To detect and remove these, we'll utilize the Scrublet package. Scrublet's process for doublet identification is described in this publication:
Wolock, S. L., Lopez, R. & Klein, A. M. Scrublet: Computational Identification of Cell Doublets in Single-Cell Transcriptomic Data. Cell Syst 8, 281–291.e9 (2019)
We'll use scrublet's integration into the scanpy package's scanpy.external tools.
Here, we apply scrublet to each of our .h5 files, and store the results in HISE for downstream analysis steps.
anndata
: Data structures for scRNA-seq
concurrent.futures
: parallelization methods
datetime
: date and time functions
h5py
: HDF5 file I/O
hisepy
: The HISE SDK for Python
numpy
: Mathematical data structures and computation
os
: operating system calls
pandas
: DataFrame data structures
re
: Regular expressions
scanpy
: scRNA-seq analysis
scipy.sparse
: Spare matrix data structures
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
warnings.simplefilter(action='ignore', category=UserWarning)
import anndata
import concurrent.futures
from concurrent.futures import ThreadPoolExecutor
from datetime import date
import h5py
import hisepy
import numpy as np
import os
import pandas as pd
import re
import scanpy as sc
import scanpy.external as sce
import scipy.sparse as scs
sample_meta_file_uuid = '2da66a1a-17cc-498b-9129-6858cf639caf'
file_query = hisepy.reader.read_files(
[sample_meta_file_uuid]
)
meta_data = file_query['values']
These functions will retrieve data from HISE and read as AnnData for use with scrublet, and for reading and applying scrublet to each file.
# define a function to read count data
def read_mat(h5_con):
mat = scs.csc_matrix(
(h5_con['matrix']['data'][:], # Count values
h5_con['matrix']['indices'][:], # Row indices
h5_con['matrix']['indptr'][:]), # Pointers for column positions
shape = tuple(h5_con['matrix']['shape'][:]) # Matrix dimensions
)
return mat
# define a function to read obeservation metadata (i.e. cell metadata)
def read_obs(h5con):
bc = h5con['matrix']['barcodes'][:]
bc = [x.decode('UTF-8') for x in bc]
# Initialized the DataFrame with cell barcodes
obs_df = pd.DataFrame({ 'barcodes' : bc })
obs_df = obs_df.set_index('barcodes', drop = False)
obs_df['barcodes'] = obs_df['barcodes'].astype("category")
return obs_df
# define a function to construct anndata object from a h5 file
def read_h5_anndata(h5_con):
#h5_con = h5py.File(h5_file, mode = 'r')
# extract the expression matrix
mat = read_mat(h5_con)
# extract gene names
genes = h5_con['matrix']['features']['name'][:]
genes = [x.decode('UTF-8') for x in genes]
# extract metadata
obs_df = read_obs(h5_con)
# construct anndata
adata = anndata.AnnData(mat.T,
obs = obs_df)
# make sure the gene names aligned
adata.var_names = genes
adata.var_names_make_unique()
return adata
def get_adata(uuid):
# Load the file using HISE
res = hisepy.reader.read_files([uuid])
# If there's an error, read_files returns a list instead of a dictionary.
# We should raise and exception with the message when this happens.
if(isinstance(res, list)):
error_message = res[0]['message']
raise Exception(error_message)
# Read the file to adata
h5_con = res['values'][0]
adata = read_h5_anndata(h5_con)
# Clean up the file now that we're done with it
h5_con.close()
return(adata)
def process_file(file_uuid):
adata = get_adata(file_uuid)
sc.external.pp.scrublet(
adata,
random_state = 3030,
verbose = False
)
result = adata.obs[['barcodes','predicted_doublet','doublet_score']]
return result
Here, we'll use concurrent.futures
to apply the function above to our samples in parallel.
results = []
file_uuids = meta_data['file.id'].tolist()
with ThreadPoolExecutor(max_workers = 20) as executor:
for result in executor.map(process_file, file_uuids):
results.append(result)
final_result = pd.concat(results, ignore_index = True)
prediction_counts = final_result['predicted_doublet'].value_counts()
prediction_counts
predicted_doublet False 2065171 True 27907 Name: count, dtype: int64
prediction_counts[1] / sum(prediction_counts)
0.013332995712534363
out_dir = 'output'
if not os.path.isdir(out_dir):
os.makedirs(out_dir)
out_file = 'output/ref_scrublet_results_{d}.csv'.format(d = date.today())
final_result.to_csv(out_file)
Finally, we'll use hisepy.upload.upload_files()
to send a copy of our output to HISE to use for downstream analysis steps.
study_space_uuid = '64097865-486d-43b3-8f94-74994e0a72e0'
title = 'Reference Scrublet results'
in_files = [sample_meta_file_uuid] + meta_data['file.id'].to_list()
out_files = [out_file]
hisepy.upload.upload_files(
files = out_files,
study_space_id = study_space_uuid,
title = title,
input_file_ids = in_files
)
you are trying to upload file_ids... ['output/ref_scrublet_results_2024-02-18.csv']. Do you truly want to proceed?
{'trace_id': '6b0ec5e3-86ec-41a5-9900-9c52a464681c', 'files': ['output/ref_scrublet_results_2024-02-18.csv']}
import session_info
session_info.show()
----- anndata 0.10.3 h5py 3.10.0 hisepy 0.3.0 numpy 1.24.0 pandas 2.1.4 scanpy 1.9.6 scipy 1.11.4 session_info 1.0.0 -----
PIL 10.0.1 annoy NA anyio NA arrow 1.3.0 asttokens NA attr 23.2.0 attrs 23.2.0 babel 2.14.0 beatrix_jupyterlab NA brotli NA cachetools 5.3.1 certifi 2023.11.17 cffi 1.16.0 charset_normalizer 3.3.2 cloudpickle 2.2.1 colorama 0.4.6 comm 0.1.4 cryptography 41.0.7 cycler 0.10.0 cython_runtime NA dateutil 2.8.2 db_dtypes 1.1.1 debugpy 1.8.0 decorator 5.1.1 defusedxml 0.7.1 deprecated 1.2.14 exceptiongroup 1.2.0 executing 2.0.1 fastjsonschema NA fqdn NA google NA greenlet 2.0.2 grpc 1.58.0 grpc_status NA idna 3.6 igraph 0.10.8 importlib_metadata NA ipykernel 6.28.0 ipython_genutils 0.2.0 ipywidgets 8.1.1 isoduration NA jedi 0.19.1 jinja2 3.1.2 joblib 1.3.2 json5 NA jsonpointer 2.4 jsonschema 4.20.0 jsonschema_specifications NA jupyter_events 0.9.0 jupyter_server 2.12.1 jupyterlab_server 2.25.2 jwt 2.8.0 kiwisolver 1.4.5 lazy_loader NA leidenalg 0.10.1 llvmlite 0.41.0 lz4 4.3.2 markupsafe 2.1.3 matplotlib 3.8.0 matplotlib_inline 0.1.6 mpl_toolkits NA mpmath 1.3.0 natsort 8.4.0 nbformat 5.9.2 numba 0.58.0 opentelemetry NA overrides NA packaging 23.2 parso 0.8.3 pexpect 4.8.0 pickleshare 0.7.5 pkg_resources NA platformdirs 4.1.0 plotly 5.18.0 prettytable 3.9.0 prometheus_client NA prompt_toolkit 3.0.42 proto NA psutil NA ptyprocess 0.7.0 pure_eval 0.2.2 pyarrow 13.0.0 pycparser 2.21 pydev_ipython NA pydevconsole NA pydevd 2.9.5 pydevd_file_utils NA pydevd_plugins NA pydevd_tracing NA pygments 2.17.2 pynvml NA pyparsing 3.1.1 pyreadr 0.5.0 pythonjsonlogger NA pytz 2023.3.post1 referencing NA requests 2.31.0 rfc3339_validator 0.1.4 rfc3986_validator 0.1.1 rpds NA scrublet NA send2trash NA shapely 1.8.5.post1 six 1.16.0 skimage 0.21.0 sklearn 1.3.2 sniffio 1.3.0 socks 1.7.1 sparse 0.14.0 sql NA sqlalchemy 2.0.21 sqlparse 0.4.4 stack_data 0.6.2 sympy 1.12 termcolor NA texttable 1.7.0 threadpoolctl 3.2.0 torch 2.1.2+cu121 torchgen NA tornado 6.3.3 tqdm 4.66.1 traitlets 5.9.0 typing_extensions NA uri_template NA urllib3 1.26.18 wcwidth 0.2.12 webcolors 1.13 websocket 1.7.0 wrapt 1.15.0 xarray 2023.12.0 yaml 6.0.1 zipp NA zmq 25.1.2 zoneinfo NA zstandard 0.22.0
----- IPython 8.19.0 jupyter_client 8.6.0 jupyter_core 5.6.1 jupyterlab 4.0.10 notebook 6.5.4 ----- Python 3.10.13 | packaged by conda-forge | (main, Dec 23 2023, 15:36:39) [GCC 12.3.0] Linux-5.15.0-1051-gcp-x86_64-with-glibc2.31 ----- Session information updated at 2024-02-18 05:05