Combine raw data across all samples¶

In this notebook, we'll use the file manifest we previously assembled to retrieve all of the human PBMC data that we'll use to assemble our reference dataset.

Here, we'll retrieve data from each each individual sample, stored in HDF5 format in HISE, and concatenate them into a single AnnData object.

Load Packages¶

anndata: Data structures for scRNA-seq
datetime: Date/Time methods
h5py: HDF5 file I/O
hisepy: The HISE SDK for Python
numpy: Mathematical data structures and computation
os: operating system calls
pandas: DataFrame data structures
re: Regular expressions
scanpy: scRNA-seq analysis
scipy.sparse: Spare matrix data structures
shutil: Shell utilities

In [1]:

import anndata
from datetime import date
import h5py
import hisepy
import numpy as np
import os
import pandas as pd
from pandas.api.types import is_object_dtype
import re
import scanpy as sc
import scipy.sparse as scs
import shutil

Helper functions¶

These functions assist in reading from our pipeline .h5 file format into AnnData format:

In [2]:

# define a function to read count data
def read_mat(h5_con):
    mat = scs.csc_matrix(
        (h5_con['matrix']['data'][:], # Count values
         h5_con['matrix']['indices'][:], # Row indices
         h5_con['matrix']['indptr'][:]), # Pointers for column positions
        shape = tuple(h5_con['matrix']['shape'][:]) # Matrix dimensions
    )
    return mat

# define a function to read obeservation metadata (i.e. cell metadata)
def read_obs(h5con):
    bc = h5con['matrix']['barcodes'][:]
    bc = [x.decode('UTF-8') for x in bc]

    # Initialized the DataFrame with cell barcodes
    obs_df = pd.DataFrame({ 'barcodes' : bc })

    # Get the list of available metadata columns
    obs_columns = h5con['matrix']['observations'].keys()

    # For each column
    for col in obs_columns:
        # Read the values
        values = h5con['matrix']['observations'][col][:]
        # Check for byte storage
        if(isinstance(values[0], (bytes, bytearray))):
            # Decode byte strings
            values = [x.decode('UTF-8') for x in values]
        # Add column to the DataFrame
        obs_df[col] = values

    obs_df = obs_df.set_index('barcodes', drop = False)
    
    return obs_df

# define a function to construct anndata object from a h5 file
def read_h5_anndata(h5_con):
    #h5_con = h5py.File(h5_file, mode = 'r')
    # extract the expression matrix
    mat = read_mat(h5_con)
    # extract gene names
    genes = h5_con['matrix']['features']['name'][:]
    genes = [x.decode('UTF-8') for x in genes]
    # extract metadata
    obs_df = read_obs(h5_con)
    # construct anndata
    adata = anndata.AnnData(mat.T,
                             obs = obs_df)
    # make sure the gene names aligned
    adata.var_names = genes

    adata.var_names_make_unique()
    return adata

Retrieve file list from HISE¶

First, we'll pull the manifest of samples and associated files that we want to retrieve for building our reference dataset. We previously assembled this file and loaded it into HISE via a Watchfolder.

In [3]:

sample_meta_file_uuid = '2da66a1a-17cc-498b-9129-6858cf639caf'
file_query = hisepy.reader.read_files(
    [sample_meta_file_uuid]
)

In [4]:

meta_data = file_query['values']

read_files will return a dictionary with an entry, values, that contains a list of h5py.File() objects. We can use these directly to read in each .h5 file to an AnnData object with our helper function, read_h5_anndata(), defined above.

In [5]:

def get_adata(uuid):
    # Load the file using HISE
    res = hisepy.reader.read_files([uuid])
    
    # Read the file to adata
    h5_con = res['values'][0]
    adata = read_h5_anndata(h5_con)
    
    # Clean up the file now that we're done with it
    h5_con.close()

    return(adata)

Here, we'll iterate over each file in our manifest

In [6]:

adata_list = []
for i in range(meta_data.shape[0]):
    uuid = meta_data['file.id'][i]
    adata_list.append(get_adata(uuid))

Concatenate all of the datasets into a single object

In [7]:

adata = anndata.concat(adata_list)

Add sample metadata¶

When retrieved from HISE, our .h5 files have a sample identifier (pbmc_sample_id), but don't carry other sample and subject metadata. Let's add this information from our meta_data DataFrame to make this information more complete.

First, we'll convert pbmc_sample_id to sample.sampleKitGuid using a regular expression. PBMC samples are derived from kits in our LIMS system, so both share the same numerical core. The difference is that there can be multiple PBMC samples collected at the same time, so PBMC samples are prefixed with PB to indicate their sample type, and suffixed with -XX to indicate an aliquot number.

In [8]:

obs = adata.obs

In [9]:

def sample_to_kit(sample):
    kit = re.sub('PB([0-9]+)-.+','KT\\1',sample)
    return(kit)

In [10]:

obs['sample.sampleKitGuid'] = [sample_to_kit(sample) for sample in obs['pbmc_sample_id']]

We only need to keep some of the metadata columns that pertain to cohort, subject, and sample. We'll also keep the originating File GUID to help us keep track of provenance. Let's select just these columns:

In [11]:

keep_meta = [
    'cohort.cohortGuid',
    'subject.subjectGuid', 'subject.biologicalSex', 
    'subject.race', 'subject.ethnicity', 'subject.birthYear',
    'sample.sampleKitGuid', 'sample.visitName', 'sample.drawDate',
    'file.id'
]

In [12]:

selected_meta = meta_data[keep_meta]

Now, we'll join this metadata to our observations using those sample IDs:

In [13]:

obs = obs.merge(
    selected_meta,
    how = 'left',
    on = 'sample.sampleKitGuid'
)

Retrieve labs, labels and doublet detection from HISE¶

In previous steps, we retrieved CMV and BMI clinical lab data and performed label transfer with CellTypist and Seurat, as well as doublet detection using Scrublet. We'll retrieve these results and add them to the observations in our AnnData file to assist with cell type labeling.

For this purpose, we'll use the CellTypist labels generated using the Immune All Low model, which has (ironically) high-resolution cell type labels.

In [14]:

obs_uuids = {
    'cmv': 'be338216-0c90-47df-923c-7f4d7c35bac4',
    'bmi': '9e39e86d-025c-4ad1-bbfb-63302b99c3f1',
    'celltypist': 'dbe10e73-4959-480a-8618-e40652901ab7',
    'seurat': 'ef5baaf1-1364-4e07-b01c-0252fd917fbb',
    'scrublet': '69d4e3a1-ff03-4173-a7e1-e4acff1f3d09',
}

In [15]:

obs_dfs = {}
for name,uuid in obs_uuids.items():
    hise_res = hisepy.reader.read_files([uuid])
    hise_df = hise_res['values']
    obs_dfs[name] = hise_df

We only need to retain some of the columns from each to assemble our cell metadata. Let's make and apply some column selections:

In [16]:

keep_columns = {
    'cmv': ['subject.subjectGuid', 'subject.cmv'],
    'bmi': ['subject.subjectGuid', 'subject.bmi'],
    'celltypist': ['barcodes', 'predicted_labels'],
    'seurat': ['barcodes',
               'predicted.celltype.l1', 'predicted.celltype.l1.score',
               'predicted.celltype.l2', 'predicted.celltype.l2.score',
               'predicted.celltype.l2.5', 'predicted.celltype.l2.5.score',
               'predicted.celltype.l3', 'predicted.celltype.l3.score'
              ],
    'scrublet': ['barcodes', 'predicted_doublet', 'doublet_score']
}

In [17]:

for name,df in obs_dfs.items():
    keep_cols = keep_columns[name]
    selected_df = df[keep_cols]
    obs_dfs[name] = selected_df

To make the sources clear later, let's rename the celltypist and seurat columns:

In [18]:

obs_dfs['celltypist'] = obs_dfs['celltypist'].rename(
    columns = {'predicted_labels': 'celltypist.low'}
)
obs_dfs['seurat'] = obs_dfs['seurat'].rename(
    columns = {
        'predicted.celltype.l1': 'seurat.l1', 'predicted.celltype.l1.score': 'seurat.l1.score',
        'predicted.celltype.l2': 'seurat.l2', 'predicted.celltype.l2.score': 'seurat.l2.score',
        'predicted.celltype.l2.5': 'seurat.l2.5', 'predicted.celltype.l2.5.score': 'seurat.l2.5.score',
        'predicted.celltype.l3': 'seurat.l3', 'predicted.celltype.l3.score': 'seurat.l3.score'
    }
)

And now we can add our sample metadata and these labels to the observations stored in the anndata object:

In [19]:

join_on_dict = {
    'cmv': 'subject.subjectGuid',
    'bmi': 'subject.subjectGuid',
    'celltypist': 'barcodes',
    'seurat': 'barcodes',
    'scrublet': 'barcodes'
}

In [20]:

for name,df in obs_dfs.items():
    join_col = join_on_dict[name]
    obs = obs.merge(
        df,
        how = 'left',
        on = join_col
    )

To keep things tidy, we'll also drop the seurat_pbmc_type, seurat_pbmc_type_score, and UMAP coordinates generated by our processing pipeline.

These cell type assignments are from a now-outdated reference dataset, and the UMAP coordinates are generated for viewing individual samples - not helpful for our full dataset.

In [21]:

obs = obs.drop([
    'seurat_pbmc_type','seurat_pbmc_type_score',
    'umap_1', 'umap_2'
], axis = 1)

Next, we'll convert all of our text columns to categorical. This is used to make storage of text data more efficient when we write our output file, as we'll only need to store a single instance of our strings.

We do this for all columns except barcodes, which we need to retain as a string type for use as an index.

In [22]:

cat_obs = obs
for i in range(cat_obs.shape[1]):
    col_name = cat_obs.dtypes.index.tolist()[i]
    col_type = cat_obs.dtypes[col_name]
    if col_name == 'barcodes':
        cat_obs[col_name] = cat_obs[col_name].astype(str)
    elif is_object_dtype(col_type):
        cat_obs[col_name] = cat_obs[col_name].astype('category')

In [23]:

cat_obs.head()

Out[23]:

	barcodes	batch_id	cell_name	cell_uuid	chip_id	hto_barcode	hto_category	n_genes	n_mito_umis	n_reads	...	seurat.l1	seurat.l1.score	seurat.l2	seurat.l2.score	seurat.l2.5	seurat.l2.5.score	seurat.l3	seurat.l3.score	predicted_doublet	doublet_score
0	cf71f47048b611ea8957bafe6d70929e	B001	weathered_pernicious_polliwog	cf71f47048b611ea8957bafe6d70929e	B001-P1C1	TGATGGCCTATTGGG	singlet	1081	115	9307	...	other T	0.634406	MAIT	0.634406	MAIT	0.634406	MAIT	0.634406	False	0.045226
1	cf71f54248b611ea8957bafe6d70929e	B001	untidy_emulsive_hamadryad	cf71f54248b611ea8957bafe6d70929e	B001-P1C1	TGATGGCCTATTGGG	singlet	1923	178	22729	...	CD4 T	0.974953	CD4 TCM	0.974953	CD4 TCM	0.974953	CD4 TCM_1	0.974953	False	0.110978
2	cf71fa1048b611ea8957bafe6d70929e	B001	impatient_familial_cuckoo	cf71fa1048b611ea8957bafe6d70929e	B001-P1C1	TGATGGCCTATTGGG	singlet	1246	204	11107	...	Mono	1.000000	CD14 Mono	1.000000	CD14 Mono	1.000000	CD14 Mono	1.000000	False	0.047836
3	cf71fb7848b611ea8957bafe6d70929e	B001	long_weakminded_roebuck	cf71fb7848b611ea8957bafe6d70929e	B001-P1C1	TGATGGCCTATTGGG	singlet	1118	77	12990	...	CD4 T	0.995058	CD4 TCM	0.950569	CD4 TCM	0.950569	CD4 TCM_1	0.684051	False	0.040517
4	cf71ffba48b611ea8957bafe6d70929e	B001	dastardly_wintery_airedale	cf71ffba48b611ea8957bafe6d70929e	B001-P1C1	TGATGGCCTATTGGG	singlet	1965	363	15979	...	Mono	1.000000	CD14 Mono	1.000000	CD14 Mono	1.000000	CD14 Mono	1.000000	False	0.046076

5 rows × 38 columns

In [24]:

cat_obs.columns

Out[24]:

Index(['barcodes', 'batch_id', 'cell_name', 'cell_uuid', 'chip_id',
       'hto_barcode', 'hto_category', 'n_genes', 'n_mito_umis', 'n_reads',
       'n_umis', 'original_barcodes', 'pbmc_sample_id', 'pool_id', 'well_id',
       'sample.sampleKitGuid', 'cohort.cohortGuid', 'subject.subjectGuid',
       'subject.biologicalSex', 'subject.race', 'subject.ethnicity',
       'subject.birthYear', 'sample.visitName', 'sample.drawDate', 'file.id',
       'subject.cmv', 'subject.bmi', 'celltypist.low', 'seurat.l1',
       'seurat.l1.score', 'seurat.l2', 'seurat.l2.score', 'seurat.l2.5',
       'seurat.l2.5.score', 'seurat.l3', 'seurat.l3.score',
       'predicted_doublet', 'doublet_score'],
      dtype='object')

Finally, we need to reinstate the cell barcodes as the index of the DataFrame, and substitute our new metadata for the observations in the AnnData object.

In [25]:

cat_obs = cat_obs.set_index('barcodes', drop = False)

In [26]:

adata.obs = cat_obs

In [27]:

out_dir = 'output'
if not os.path.isdir(out_dir):
    os.makedirs(out_dir)

Save combined object to .h5ad

In [28]:

out_h5ad = 'output/ref_pbmc_raw_{d}.h5ad'.format(d = date.today())
adata.write_h5ad(out_h5ad)

We'll also save just the observations as a .csv file in case we need it for generating figures about the full, raw dataset and labels.

For CSV, we don't need the version with categorical columns.

In [ ]:

out_csv = 'output/ref_pbmc_raw_meta_{d}.csv'.format(d = date.today())
obs.to_csv(out_csv)

Upload assembled data to HISE¶

Finally, we'll use hisepy.upload.upload_files() to send a copy of our output to HISE to use for downstream analysis steps.

In [30]:

study_space_uuid = '64097865-486d-43b3-8f94-74994e0a72e0'
title = 'Assembled Raw PBMC .h5ad {d}'.format(d = date.today())

In [39]:

obs_uuid_list = list(obs_uuids.values())[0]

In [40]:

in_files = [sample_meta_file_uuid] + [obs_uuid_list] + meta_data['file.id'].to_list()

In [42]:

out_files = [out_h5ad, out_csv]

In [43]:

hisepy.upload.upload_files(
    files = out_files,
    study_space_id = study_space_uuid,
    title = title,
    input_file_ids = in_files
)

you are trying to upload file_ids... ['output/ref_pbmc_raw_2024-02-18.h5ad', 'output/ref_pbmc_raw_meta_2024-02-18.csv']. Do you truly want to proceed?

Out[43]:

{'trace_id': 'a0719172-a68d-4b02-896e-9eaa02a0710b',
 'files': ['output/ref_pbmc_raw_2024-02-18.h5ad',
  'output/ref_pbmc_raw_meta_2024-02-18.csv']}

In [44]:

import session_info
session_info.show()

Out[44]:

Click to view session information

-----
anndata             0.10.3
h5py                3.10.0
hisepy              0.3.0
numpy               1.24.0
pandas              2.1.4
scanpy              1.9.6
scipy               1.11.4
session_info        1.0.0
-----

Click to view modules imported as dependencies

PIL                         10.0.1
anyio                       NA
arrow                       1.3.0
asttokens                   NA
attr                        23.2.0
attrs                       23.2.0
babel                       2.14.0
beatrix_jupyterlab          NA
brotli                      NA
cachetools                  5.3.1
certifi                     2023.11.17
cffi                        1.16.0
charset_normalizer          3.3.2
cloudpickle                 2.2.1
colorama                    0.4.6
comm                        0.1.4
cryptography                41.0.7
cycler                      0.10.0
cython_runtime              NA
dateutil                    2.8.2
db_dtypes                   1.1.1
debugpy                     1.8.0
decorator                   5.1.1
defusedxml                  0.7.1
deprecated                  1.2.14
exceptiongroup              1.2.0
executing                   2.0.1
fastjsonschema              NA
fqdn                        NA
google                      NA
greenlet                    2.0.2
grpc                        1.58.0
grpc_status                 NA
idna                        3.6
igraph                      0.10.8
importlib_metadata          NA
ipykernel                   6.28.0
ipython_genutils            0.2.0
ipywidgets                  8.1.1
isoduration                 NA
jedi                        0.19.1
jinja2                      3.1.2
joblib                      1.3.2
json5                       NA
jsonpointer                 2.4
jsonschema                  4.20.0
jsonschema_specifications   NA
jupyter_events              0.9.0
jupyter_server              2.12.1
jupyterlab_server           2.25.2
jwt                         2.8.0
kiwisolver                  1.4.5
leidenalg                   0.10.1
llvmlite                    0.41.0
lz4                         4.3.2
markupsafe                  2.1.3
matplotlib                  3.8.0
matplotlib_inline           0.1.6
mpl_toolkits                NA
mpmath                      1.3.0
natsort                     8.4.0
nbformat                    5.9.2
numba                       0.58.0
opentelemetry               NA
overrides                   NA
packaging                   23.2
parso                       0.8.3
pexpect                     4.8.0
pickleshare                 0.7.5
pkg_resources               NA
platformdirs                4.1.0
plotly                      5.18.0
prettytable                 3.9.0
prometheus_client           NA
prompt_toolkit              3.0.42
proto                       NA
psutil                      NA
ptyprocess                  0.7.0
pure_eval                   0.2.2
pyarrow                     13.0.0
pydev_ipython               NA
pydevconsole                NA
pydevd                      2.9.5
pydevd_file_utils           NA
pydevd_plugins              NA
pydevd_tracing              NA
pygments                    2.17.2
pynvml                      NA
pyparsing                   3.1.1
pyreadr                     0.5.0
pythonjsonlogger            NA
pytz                        2023.3.post1
referencing                 NA
requests                    2.31.0
rfc3339_validator           0.1.4
rfc3986_validator           0.1.1
rpds                        NA
send2trash                  NA
shapely                     1.8.5.post1
six                         1.16.0
sklearn                     1.3.2
sniffio                     1.3.0
socks                       1.7.1
sql                         NA
sqlalchemy                  2.0.21
sqlparse                    0.4.4
stack_data                  0.6.2
sympy                       1.12
termcolor                   NA
texttable                   1.7.0
threadpoolctl               3.2.0
torch                       2.1.2+cu121
torchgen                    NA
tornado                     6.3.3
tqdm                        4.66.1
traitlets                   5.9.0
typing_extensions           NA
uri_template                NA
urllib3                     1.26.18
wcwidth                     0.2.12
webcolors                   1.13
websocket                   1.7.0
wrapt                       1.15.0
xarray                      2023.12.0
yaml                        6.0.1
zipp                        NA
zmq                         25.1.2
zoneinfo                    NA
zstandard                   0.22.0

-----
IPython             8.19.0
jupyter_client      8.6.0
jupyter_core        5.6.1
jupyterlab          4.0.10
notebook            6.5.4
-----
Python 3.10.13 | packaged by conda-forge | (main, Dec 23 2023, 15:36:39) [GCC 12.3.0]
Linux-5.15.0-1051-gcp-x86_64-with-glibc2.31
-----
Session information updated at 2024-02-18 22:17

In [ ]: