The Python version of the HISE SDK allows retrieval of sample metadata regardless of pipeline sample processing. We'll use this SDK to pull these results.
datetime: Used to add today's date to our output files
hisepy: the HISE SDK
os: Operating System files (used to make an output folder)
pandas: DataFrames for Python
session_info: displays the versioning Python and all of the packages we used
warnings: Used to suppress some annoying warnings that don't impact data retrieval
from datetime import date
import hisepy
import os
import pandas
import session_info
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
out_dir = 'output'
if not os.path.isdir(out_dir):
os.makedirs(out_dir)
sample_meta_file_uuid = '2da66a1a-17cc-498b-9129-6858cf639caf'
res = hisepy.reader.read_files([sample_meta_file_uuid])
sample_meta = res['values']
First, we make a dictionary (with curly braces) that defines what we want to get. In this case, we want to get all of the samples from the BR2 cohort. In this query, cohort is found as 'cohortGuid'.
Each entry in the dictionary has to be a list (square braces), even if it has a single entry.
subject_ids = sample_meta['subject.subjectGuid'].tolist()
query_dict = {
'subjectGuid': subject_ids
}
Now, we send this dictionary to HISE via hisepy
sample_data = hisepy.reader.read_samples(
query_dict = query_dict
)
What we get back is a dictionary containing multiple kinds of information.
We can see what these are called with they .keys()
method:
sample_data.keys()
dict_keys(['metadata', 'specimens', 'survey', 'labResults'])
CMV and the weight and height data we'll need for our analysis are available in labResults.
CMV status is stored as 'CMV IgG Serology Result Interpretation' or 'CMV Ab Screen Result'.
Height and weight are simply 'Height' and 'Weight'.
Let's select columns from these results to get just subject, sample kit, CMV status, height, and weight.
For CMV status, the results can be assigned to different samples, and to different columns. We'll select only rows where there's a result and then check that we have results for all of our subjects.
keep_columns = ['subjectGuid', 'CMV IgG Serology Result Interpretation', 'CMV Ab Screen Result']
cmv_data = sample_data['labResults'][keep_columns]
Keep rows where either of the result columns have Positive or Negative values:
keep_rows = []
keep_values = ['Negative','Positive']
for i in range(cmv_data.shape[0]):
ig = cmv_data['CMV IgG Serology Result Interpretation'][i]
ab = cmv_data['CMV Ab Screen Result'][i]
if (ig in keep_values) or (ab in keep_values):
keep_rows.append(True)
else:
keep_rows.append(False)
cmv_data = cmv_data.iloc[keep_rows,:]
Keep rows where the subjects correspond to the subjects under study:
keep_rows = cmv_data['subjectGuid'].isin(sample_meta['subject.subjectGuid'])
cmv_data = cmv_data.iloc[keep_rows.tolist(),:]
cmv_data.shape
(115, 3)
Some subjects have multiple entries. We'll combine these, and assign Positive if any entry is Positive, and Negative otherwise.
ig_results = []
ab_results = []
cmv_results = []
for s in sample_meta['subject.subjectGuid'].tolist():
sample_cmv = cmv_data.iloc[cmv_data.subjectGuid.isin([s]).tolist(),:]
ig = sample_cmv['CMV IgG Serology Result Interpretation'].tolist()
ab = sample_cmv['CMV Ab Screen Result'].tolist()
if 'Positive' in ig:
ig_results.append('Positive')
elif 'Negative' in ig:
ig_results.append('Negative')
else:
ig_results.append('NA')
if 'Positive' in ab:
ab_results.append('Positive')
elif 'Negative' in ab:
ab_results.append('Negative')
else:
ab_results.append('NA')
both = ig + ab
if 'Positive' in both:
cmv_results.append('Positive')
else:
cmv_results.append('Negative')
cmv_df = pandas.DataFrame(
{
'subject.subjectGuid': sample_meta['subject.subjectGuid'].tolist(),
'CMV IgG Serology Result Interpretation': ig_results,
'CMV Ab Screen Result': ab_results,
'subject.cmv': cmv_results
}
)
cmv_file = 'output/ref_subject_cmv_lab_results_{date}.csv'.format(date = date.today())
cmv_df.to_csv(cmv_file)
Height and weight can be selected using the specific sampleKitGuid that corresponds to the samples in our reference.
keep_columns = ['subjectGuid', 'sampleKitGuid', 'Height', 'Weight']
hw_data = sample_data['labResults'][keep_columns]
keep_rows = hw_data['sampleKitGuid'].isin(sample_meta['sample.sampleKitGuid']).tolist()
hw_data = hw_data.iloc[keep_rows,:]
There are some duplicate entries with missing values. We'll remove these.
keep_rows = [not x for x in hw_data['Height'].isnull()]
hw_data = hw_data.iloc[keep_rows,:]
hw_data.shape
(104, 4)
Based on the number of remaining rows, we'll be missing values for 4 samples.
We can compute BMI based on Height (in cm) and weight (in kg):
BMI = weight / (height^2) where weight is in kg and height is in meters.
So, we'll use BMI = Weight / ( (Height / 100)^2 ) to account for the use of cm in height.
h_results = []
w_results = []
bmi_results = []
for s in sample_meta['subject.subjectGuid'].tolist():
sample_hw = hw_data.iloc[hw_data.subjectGuid.isin([s]).tolist(),:]
if sample_hw.shape[0] == 0:
h_results.append('NA')
w_results.append('NA')
bmi_results.append('NA')
else:
h = sample_hw['Height'].tolist()[0]
if h == '':
h_results.append('NA')
else:
h = float(h)
h_results.append(h)
w = sample_hw['Weight'].tolist()[0]
if w == '':
w_results.append('NA')
else:
w = float(w)
w_results.append(w)
if isinstance(h, str) | isinstance(w, str):
bmi_results.append('NA')
else:
bmi = w / ( pow(h / 100,2) )
bmi_results.append(bmi)
len(bmi_results)
108
hw_df = pandas.DataFrame(
{
'subject.subjectGuid': sample_meta['subject.subjectGuid'].tolist(),
'sample.sampleKitGuid': sample_meta['sample.sampleKitGuid'].tolist(),
'Height': h_results,
'Weight': w_results,
'subject.bmi': bmi_results
}
)
hw_df = hw_df.sort_values('subject.subjectGuid')
hw_file = 'output/ref_subject_bmi_results_{date}.csv'.format(date = date.today())
hw_df.to_csv(hw_file)
Finally, we'll use hisepy.upload.upload_files()
to send a copy of our output to HISE to use for downstream analysis steps.
study_space_uuid = '64097865-486d-43b3-8f94-74994e0a72e0'
title = 'PBMC Ref. CMV and BMI clinical labs {d}'.format(d = date.today())
in_files = [sample_meta_file_uuid]
out_files = [cmv_file, hw_file]
hisepy.upload.upload_files(
files = out_files,
study_space_id = study_space_uuid,
title = title,
input_file_ids = in_files
)
you are trying to upload file_ids... ['output/ref_subject_cmv_lab_results_2024-02-18.csv', 'output/ref_subject_bmi_results_2024-02-18.csv']. Do you truly want to proceed?
{'trace_id': 'f8ff0c05-b47c-4476-a98c-fa580728f317', 'files': ['output/ref_subject_cmv_lab_results_2024-02-18.csv', 'output/ref_subject_bmi_results_2024-02-18.csv']}
session_info.show()
----- hisepy 0.3.0 pandas 2.1.4 session_info 1.0.0 -----
PIL 10.0.1 anyio NA arrow 1.3.0 asttokens NA attr 23.2.0 attrs 23.2.0 babel 2.14.0 beatrix_jupyterlab NA brotli NA cachetools 5.3.1 certifi 2023.11.17 cffi 1.16.0 charset_normalizer 3.3.2 cloudpickle 2.2.1 colorama 0.4.6 comm 0.1.4 cryptography 41.0.7 cycler 0.10.0 cython_runtime NA dateutil 2.8.2 db_dtypes 1.1.1 debugpy 1.8.0 decorator 5.1.1 defusedxml 0.7.1 deprecated 1.2.14 exceptiongroup 1.2.0 executing 2.0.1 fastjsonschema NA fqdn NA google NA greenlet 2.0.2 grpc 1.58.0 grpc_status NA h5py 3.10.0 idna 3.6 importlib_metadata NA ipykernel 6.28.0 ipython_genutils 0.2.0 ipywidgets 8.1.1 isoduration NA jedi 0.19.1 jinja2 3.1.2 json5 NA jsonpointer 2.4 jsonschema 4.20.0 jsonschema_specifications NA jupyter_events 0.9.0 jupyter_server 2.12.1 jupyterlab_server 2.25.2 jwt 2.8.0 kiwisolver 1.4.5 markupsafe 2.1.3 matplotlib 3.8.0 matplotlib_inline 0.1.6 mpl_toolkits NA nbformat 5.9.2 numpy 1.26.2 opentelemetry NA overrides NA packaging 23.2 parso 0.8.3 pexpect 4.8.0 pickleshare 0.7.5 pkg_resources NA platformdirs 4.1.0 plotly 5.18.0 prettytable 3.9.0 prometheus_client NA prompt_toolkit 3.0.42 proto NA psutil NA ptyprocess 0.7.0 pure_eval 0.2.2 pyarrow 13.0.0 pydev_ipython NA pydevconsole NA pydevd 2.9.5 pydevd_file_utils NA pydevd_plugins NA pydevd_tracing NA pygments 2.17.2 pyparsing 3.1.1 pyreadr 0.5.0 pythonjsonlogger NA pytz 2023.3.post1 referencing NA requests 2.31.0 rfc3339_validator 0.1.4 rfc3986_validator 0.1.1 rpds NA send2trash NA shapely 1.8.5.post1 six 1.16.0 sniffio 1.3.0 socks 1.7.1 sql NA sqlalchemy 2.0.21 sqlparse 0.4.4 stack_data 0.6.2 termcolor NA tornado 6.3.3 tqdm 4.66.1 traitlets 5.9.0 typing_extensions NA uri_template NA urllib3 1.26.18 wcwidth 0.2.12 webcolors 1.13 websocket 1.7.0 wrapt 1.15.0 xarray 2023.12.0 yaml 6.0.1 zipp NA zmq 25.1.2 zoneinfo NA zstandard 0.22.0
----- IPython 8.19.0 jupyter_client 8.6.0 jupyter_core 5.6.1 jupyterlab 4.0.10 notebook 6.5.4 ----- Python 3.10.13 | packaged by conda-forge | (main, Dec 23 2023, 15:36:39) [GCC 12.3.0] Linux-5.15.0-1051-gcp-x86_64-with-glibc2.31 ----- Session information updated at 2024-02-18 02:15