Notebook

Detect doublets using Scrublet¶

Our cell hashing processes catch and remove many doublets that are generated by mixing of cells from different samples, but some percentage of doublets (~7-8%) will be generated by collisions of cells from the same sample, and will not be detected.

To detect and remove these, we'll utilize the Scrublet package. Scrublet's process for doublet identification is described in this publication:

Wolock, S. L., Lopez, R. & Klein, A. M. Scrublet: Computational Identification of Cell Doublets in Single-Cell Transcriptomic Data. Cell Syst 8, 281–291.e9 (2019)

We'll use scrublet's integration into the scanpy package's scanpy.external tools.

Here, we apply scrublet to each of our .h5 files, and store the results in HISE for downstream analysis steps.

Load Packages¶

anndata: Data structures for scRNA-seq
concurrent.futures: parallelization methods
datetime: date and time functions
h5py: HDF5 file I/O
hisepy: The HISE SDK for Python
numpy: Mathematical data structures and computation
os: operating system calls
pandas: DataFrame data structures
re: Regular expressions
scanpy: scRNA-seq analysis
scipy.sparse: Spare matrix data structures

In [1]:

import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
warnings.simplefilter(action='ignore', category=UserWarning)

import anndata
import concurrent.futures
from concurrent.futures import ThreadPoolExecutor
from datetime import date
import h5py
import hisepy
import numpy as np
import os
import pandas as pd 
import re
import scanpy as sc
import scanpy.external as sce
import scipy.sparse as scs

Read sample metadata from HISE¶

In [2]:

sample_meta_file_uuid = '2da66a1a-17cc-498b-9129-6858cf639caf'
file_query = hisepy.reader.read_files(
    [sample_meta_file_uuid]
)

In [3]:

meta_data = file_query['values']

Helper functions¶

These functions will retrieve data from HISE and read as AnnData for use with scrublet, and for reading and applying scrublet to each file.

In [4]:

# define a function to read count data
def read_mat(h5_con):
    mat = scs.csc_matrix(
        (h5_con['matrix']['data'][:], # Count values
         h5_con['matrix']['indices'][:], # Row indices
         h5_con['matrix']['indptr'][:]), # Pointers for column positions
        shape = tuple(h5_con['matrix']['shape'][:]) # Matrix dimensions
    )
    return mat

# define a function to read obeservation metadata (i.e. cell metadata)
def read_obs(h5con):
    bc = h5con['matrix']['barcodes'][:]
    bc = [x.decode('UTF-8') for x in bc]

    # Initialized the DataFrame with cell barcodes
    obs_df = pd.DataFrame({ 'barcodes' : bc })
    obs_df = obs_df.set_index('barcodes', drop = False)
    obs_df['barcodes'] = obs_df['barcodes'].astype("category")

    return obs_df

# define a function to construct anndata object from a h5 file
def read_h5_anndata(h5_con):
    #h5_con = h5py.File(h5_file, mode = 'r')
    # extract the expression matrix
    mat = read_mat(h5_con)
    # extract gene names
    genes = h5_con['matrix']['features']['name'][:]
    genes = [x.decode('UTF-8') for x in genes]
    # extract metadata
    obs_df = read_obs(h5_con)
    # construct anndata
    adata = anndata.AnnData(mat.T,
                            obs = obs_df)
    # make sure the gene names aligned
    adata.var_names = genes

    adata.var_names_make_unique()
    return adata

In [5]:

def get_adata(uuid):
    # Load the file using HISE
    res = hisepy.reader.read_files([uuid])

    # If there's an error, read_files returns a list instead of a dictionary.
    # We should raise and exception with the message when this happens.
    if(isinstance(res, list)):
        error_message = res[0]['message']
        raise Exception(error_message)
    
    # Read the file to adata
    h5_con = res['values'][0]
    adata = read_h5_anndata(h5_con)
    
    # Clean up the file now that we're done with it
    h5_con.close()

    return(adata)

In [6]:

def process_file(file_uuid):
    adata = get_adata(file_uuid)
    sc.external.pp.scrublet(
        adata,
        random_state = 3030,
        verbose = False
    )
    result = adata.obs[['barcodes','predicted_doublet','doublet_score']]
    return result

Apply to each sample in parallel¶

Here, we'll use concurrent.futures to apply the function above to our samples in parallel.

In [7]:

results = []
file_uuids = meta_data['file.id'].tolist()
with ThreadPoolExecutor(max_workers = 20) as executor:  
    for result in executor.map(process_file, file_uuids):
        results.append(result)

In [8]:

final_result = pd.concat(results, ignore_index = True)

In [9]:

prediction_counts = final_result['predicted_doublet'].value_counts()
prediction_counts

Out[9]:

predicted_doublet
False    2065171
True       27907
Name: count, dtype: int64

In [10]:

prediction_counts[1] / sum(prediction_counts)

Out[10]:

0.013332995712534363

Save results to CSV¶

In [12]:

out_dir = 'output'
if not os.path.isdir(out_dir):
    os.makedirs(out_dir)

In [13]:

out_file = 'output/ref_scrublet_results_{d}.csv'.format(d = date.today())
final_result.to_csv(out_file)

Upload assembled data to HISE¶

Finally, we'll use hisepy.upload.upload_files() to send a copy of our output to HISE to use for downstream analysis steps.

In [14]:

study_space_uuid = '64097865-486d-43b3-8f94-74994e0a72e0'
title = 'Reference Scrublet results'

In [15]:

in_files = [sample_meta_file_uuid] + meta_data['file.id'].to_list()

In [16]:

out_files = [out_file]

In [17]:

hisepy.upload.upload_files(
    files = out_files,
    study_space_id = study_space_uuid,
    title = title,
    input_file_ids = in_files
)

you are trying to upload file_ids... ['output/ref_scrublet_results_2024-02-18.csv']. Do you truly want to proceed?

Out[17]:

{'trace_id': '6b0ec5e3-86ec-41a5-9900-9c52a464681c',
 'files': ['output/ref_scrublet_results_2024-02-18.csv']}

In [18]:

import session_info
session_info.show()

Out[18]:

Click to view session information

-----
anndata             0.10.3
h5py                3.10.0
hisepy              0.3.0
numpy               1.24.0
pandas              2.1.4
scanpy              1.9.6
scipy               1.11.4
session_info        1.0.0
-----

Click to view modules imported as dependencies

PIL                         10.0.1
annoy                       NA
anyio                       NA
arrow                       1.3.0
asttokens                   NA
attr                        23.2.0
attrs                       23.2.0
babel                       2.14.0
beatrix_jupyterlab          NA
brotli                      NA
cachetools                  5.3.1
certifi                     2023.11.17
cffi                        1.16.0
charset_normalizer          3.3.2
cloudpickle                 2.2.1
colorama                    0.4.6
comm                        0.1.4
cryptography                41.0.7
cycler                      0.10.0
cython_runtime              NA
dateutil                    2.8.2
db_dtypes                   1.1.1
debugpy                     1.8.0
decorator                   5.1.1
defusedxml                  0.7.1
deprecated                  1.2.14
exceptiongroup              1.2.0
executing                   2.0.1
fastjsonschema              NA
fqdn                        NA
google                      NA
greenlet                    2.0.2
grpc                        1.58.0
grpc_status                 NA
idna                        3.6
igraph                      0.10.8
importlib_metadata          NA
ipykernel                   6.28.0
ipython_genutils            0.2.0
ipywidgets                  8.1.1
isoduration                 NA
jedi                        0.19.1
jinja2                      3.1.2
joblib                      1.3.2
json5                       NA
jsonpointer                 2.4
jsonschema                  4.20.0
jsonschema_specifications   NA
jupyter_events              0.9.0
jupyter_server              2.12.1
jupyterlab_server           2.25.2
jwt                         2.8.0
kiwisolver                  1.4.5
lazy_loader                 NA
leidenalg                   0.10.1
llvmlite                    0.41.0
lz4                         4.3.2
markupsafe                  2.1.3
matplotlib                  3.8.0
matplotlib_inline           0.1.6
mpl_toolkits                NA
mpmath                      1.3.0
natsort                     8.4.0
nbformat                    5.9.2
numba                       0.58.0
opentelemetry               NA
overrides                   NA
packaging                   23.2
parso                       0.8.3
pexpect                     4.8.0
pickleshare                 0.7.5
pkg_resources               NA
platformdirs                4.1.0
plotly                      5.18.0
prettytable                 3.9.0
prometheus_client           NA
prompt_toolkit              3.0.42
proto                       NA
psutil                      NA
ptyprocess                  0.7.0
pure_eval                   0.2.2
pyarrow                     13.0.0
pycparser                   2.21
pydev_ipython               NA
pydevconsole                NA
pydevd                      2.9.5
pydevd_file_utils           NA
pydevd_plugins              NA
pydevd_tracing              NA
pygments                    2.17.2
pynvml                      NA
pyparsing                   3.1.1
pyreadr                     0.5.0
pythonjsonlogger            NA
pytz                        2023.3.post1
referencing                 NA
requests                    2.31.0
rfc3339_validator           0.1.4
rfc3986_validator           0.1.1
rpds                        NA
scrublet                    NA
send2trash                  NA
shapely                     1.8.5.post1
six                         1.16.0
skimage                     0.21.0
sklearn                     1.3.2
sniffio                     1.3.0
socks                       1.7.1
sparse                      0.14.0
sql                         NA
sqlalchemy                  2.0.21
sqlparse                    0.4.4
stack_data                  0.6.2
sympy                       1.12
termcolor                   NA
texttable                   1.7.0
threadpoolctl               3.2.0
torch                       2.1.2+cu121
torchgen                    NA
tornado                     6.3.3
tqdm                        4.66.1
traitlets                   5.9.0
typing_extensions           NA
uri_template                NA
urllib3                     1.26.18
wcwidth                     0.2.12
webcolors                   1.13
websocket                   1.7.0
wrapt                       1.15.0
xarray                      2023.12.0
yaml                        6.0.1
zipp                        NA
zmq                         25.1.2
zoneinfo                    NA
zstandard                   0.22.0

-----
IPython             8.19.0
jupyter_client      8.6.0
jupyter_core        5.6.1
jupyterlab          4.0.10
notebook            6.5.4
-----
Python 3.10.13 | packaged by conda-forge | (main, Dec 23 2023, 15:36:39) [GCC 12.3.0]
Linux-5.15.0-1051-gcp-x86_64-with-glibc2.31
-----
Session information updated at 2024-02-18 05:05

In [ ]: