Create ISA-API Investigation from Datascriptor Study Design configuration¶

Crossover Study with two dietary treatments on dogs¶

In this notebook I will show you how you can use a study design configuration is JSON format as produce by datascriptor (https://gitlab.com/datascriptor/datascriptor) to generate a single-study ISA investigation and how you can then serialise it in JSON and tabular (i.e. CSV) format.

Or study design configuration consists of:

a 4-arm study design. Each arm has 10 subjects
subjects are humans. There is an observational factor, named "status" with two values: "healthy" and "diseased"
a crossover of two drug treatments, a proper treatment ("hypertena" 20 mg/day for 14 days) and a control treatment ("placebo" 20 mg/day for 14 days)
four non-treatment phases: screen (7 days), washout (14 days) and follow-up (180 days)
three sample types collected: blood and saliva
three assay types:
- DNA methylation profiling using nucleic acid sequencing on saliva samples
- clinical chemistry with marker on blood samples

1. Setup¶

Let's import all the required libraries

In [23]:

from time import time
import os
import json

In [24]:

## ISA-API related imports
from isatools.model import Investigation, Study

In [25]:

## ISA-API create mode related imports
from isatools.create.model import StudyDesign
from isatools.create.connectors import generate_study_design

# serializer from ISA Investigation to JSON
from isatools.isajson import ISAJSONEncoder

# ISA-Tab serialisation
from isatools import isatab

In [26]:

## ISA-API create mode related imports
from isatools.create import model
from isatools import isajson

2. Load the Study Design JSON configuration¶

First of all we load the study design configurator with all the specs defined above

In [27]:

with open(os.path.abspath(os.path.join(
    "isa-study-design-as-json", "datascriptor", "crossover-study-human.json"
)), "r") as config_file:
    study_design_config = json.load(config_file)

3. Generate the ISA Study Design from the JSON configuration¶

To perform the conversion we just need to use the function generate_isa_study_design() (name possibly subject to change, should we drop the "isa" and "datascriptor" qualifiers?)

In [28]:

study_design = generate_study_design(study_design_config)
assert isinstance(study_design, StudyDesign)

4. Generate the ISA Study from the StudyDesign and embed it into an ISA Investigation¶

The StudyDesign.generate_isa_study() method returns the complete ISA-API Study object.

In [29]:

start = time()
study = study_design.generate_isa_study()
end = time()
print('The generation of the study design took {:.2f} s.'.format(end - start))
assert isinstance(study, Study)
investigation = Investigation(identifier='inv01', studies=[study])

The generation of the study design took 2.02 s.

5. Serialize and save the JSON representation of the generated ISA Investigation¶

In [30]:

start = time()
inv_json = json.dumps(investigation, cls=ISAJSONEncoder, sort_keys=True, indent=4, separators=(',', ': '))
end = time()
print('The JSON serialisation of the ISA investigation took {:.2f} s.'.format(end - start))

The JSON serialisation of the ISA investigation took 0.55 s.

In [31]:

directory = os.path.abspath(os.path.join('output', 'crossover-2-treatments-mice'))
os.makedirs(directory, exist_ok=True)
with open(os.path.abspath(os.path.join(directory, 'isa-investigation-crossover-2-treatments-mice.json')), 'w') as out_fp:
    json.dump(json.loads(inv_json), out_fp)

6. Dump the ISA Investigation to ISA-Tab¶

In [32]:

start = time()
isatab.dump(investigation, directory)
end = time()
print('The Tab serialisation of the ISA investigation took {:.2f} s.'.format(end - start))

The Tab serialisation of the ISA investigation took 21.58 s.

To use them on the notebook we can also dump the tables to pandas DataFrames, using the dump_tables_to_dataframes function rather than dump

In [33]:

dataframes = isatab.dump_tables_to_dataframes(investigation)

In [34]:

len(dataframes)

Out[34]:

7. Check the correctness of the ISA-Tab DataFrames¶

We have 1 study file and 2 assay files (one for MS and one for NMR). Let's check the names:

In [35]:

for key in dataframes.keys():
    display(key)

's_study_01.txt'

'a_AT5_DNA-methylation-profiling_nucleic-acid-sequencing.txt'

'a_AT11_clinical-chemistry_marker-panel.txt'

7.1 Count of subjects and samples¶

We have 10 subjects in the each of the 4 arms for a total of 40 subjects.

We collect:

5 blood samples per subject (50 samples * 4 arms = 200 total samples)
2 blood samples per subject (20 samples * 4 arms = 80 total samples)

Across the 4 study arms a total of 280 samples are collected (70 samples per arm)

In [36]:

study_frame = dataframes['s_study_01.txt']
count_arm0_samples = len(study_frame[study_frame['Source Name'].apply(lambda el: 'GRP0' in el)])
count_arm1_samples = len(study_frame[study_frame['Source Name'].apply(lambda el: 'GRP1' in el)])
count_arm2_samples = len(study_frame[study_frame['Source Name'].apply(lambda el: 'GRP2' in el)])
count_arm3_samples = len(study_frame[study_frame['Source Name'].apply(lambda el: 'GRP3' in el)])
print("There are {} samples in the GRP0 arm (i.e. group)".format(count_arm0_samples))
print("There are {} samples in the GRP1 arm (i.e. group)".format(count_arm1_samples))
print("There are {} samples in the GRP2 arm (i.e. group)".format(count_arm2_samples))
print("There are {} samples in the GRP3 arm (i.e. group)".format(count_arm3_samples))

There are 70 samples in the GRP0 arm (i.e. group)
There are 70 samples in the GRP1 arm (i.e. group)
There are 70 samples in the GRP2 arm (i.e. group)
There are 70 samples in the GRP3 arm (i.e. group)

7.2 Study Table Overview¶

The study table provides an overview of the subjects (sources) and samples

In [37]:

study_frame

Out[37]:

	Source Name	Characteristics[Study Subject]	Term Accession Number	Characteristics[status]	Protocol REF	Parameter Value[Sampling order]	Parameter Value[Study cell]	Date	Performer	Sample Name	Characteristics[organism part]	Term Accession Number.1	Comment[study step with treatment]	Factor Value[Sequence Order]	Factor Value[DURATION]	Unit	Factor Value[AGENT]	Factor Value[INTENSITY]	Unit.1
0	GRP0_SBJ01	Homo sapiens	http://purl.obolibrary.org/obo/NCBITaxon_9606	healthy	sample collection	031	A0E3	2021-06-30	Unknown	GRP0_SBJ01_A0E3_SMP-Blood-Sample-1	Blood Sample	http://purl.obolibrary.org/obo/NCIT_C17610	YES	3	14	days	placebo	20.0	mg/day
1	GRP0_SBJ01	Homo sapiens	http://purl.obolibrary.org/obo/NCBITaxon_9606	healthy	sample collection	001	A0E1	2021-06-30	Unknown	GRP0_SBJ01_A0E1_SMP-Saliva-Sample-1	Saliva Sample	http://purl.obolibrary.org/obo/NCIT_C174119	YES	1	14	days	hypertena	20.0	mg/day
2	GRP0_SBJ01	Homo sapiens	http://purl.obolibrary.org/obo/NCBITaxon_9606	healthy	sample collection	042	A0E4	2021-06-30	Unknown	GRP0_SBJ01_A0E4_SMP-Blood-Sample-2	Blood Sample	http://purl.obolibrary.org/obo/NCIT_C17610	NO	4	180	days
3	GRP0_SBJ01	Homo sapiens	http://purl.obolibrary.org/obo/NCBITaxon_9606	healthy	sample collection	011	A0E1	2021-06-30	Unknown	GRP0_SBJ01_A0E1_SMP-Blood-Sample-1	Blood Sample	http://purl.obolibrary.org/obo/NCIT_C17610	YES	1	14	days	hypertena	20.0	mg/day
4	GRP0_SBJ01	Homo sapiens	http://purl.obolibrary.org/obo/NCBITaxon_9606	healthy	sample collection	043	A0E4	2021-06-30	Unknown	GRP0_SBJ01_A0E4_SMP-Blood-Sample-3	Blood Sample	http://purl.obolibrary.org/obo/NCIT_C17610	NO	4	180	days
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
275	GRP3_SBJ10	Homo sapiens	http://purl.obolibrary.org/obo/NCBITaxon_9606	diseased	sample collection	242	A3E3	2021-06-30	Unknown	GRP3_SBJ10_A3E3_SMP-Blood-Sample-1	Blood Sample	http://purl.obolibrary.org/obo/NCIT_C17610	YES	3	14	days	hypertena	20.0	mg/day
276	GRP3_SBJ10	Homo sapiens	http://purl.obolibrary.org/obo/NCBITaxon_9606	diseased	sample collection	256	A3E4	2021-06-30	Unknown	GRP3_SBJ10_A3E4_SMP-Blood-Sample-3	Blood Sample	http://purl.obolibrary.org/obo/NCIT_C17610	NO	4	180	days
277	GRP3_SBJ10	Homo sapiens	http://purl.obolibrary.org/obo/NCBITaxon_9606	diseased	sample collection	254	A3E4	2021-06-30	Unknown	GRP3_SBJ10_A3E4_SMP-Blood-Sample-1	Blood Sample	http://purl.obolibrary.org/obo/NCIT_C17610	NO	4	180	days
278	GRP3_SBJ10	Homo sapiens	http://purl.obolibrary.org/obo/NCBITaxon_9606	diseased	sample collection	232	A3E3	2021-06-30	Unknown	GRP3_SBJ10_A3E3_SMP-Saliva-Sample-1	Saliva Sample	http://purl.obolibrary.org/obo/NCIT_C174119	YES	3	14	days	hypertena	20.0	mg/day
279	GRP3_SBJ10	Homo sapiens	http://purl.obolibrary.org/obo/NCBITaxon_9606	diseased	sample collection	212	A3E1	2021-06-30	Unknown	GRP3_SBJ10_A3E1_SMP-Saliva-Sample-1	Saliva Sample	http://purl.obolibrary.org/obo/NCIT_C174119	YES	1	14	days	placebo	20.0	mg/day

280 rows × 19 columns

7.3 First Assay: DNA Methylation Profiling using nucleic acid sequencing¶

This assay takes urine samples as input

In [38]:

dataframes['a_AT5_DNA-methylation-profiling_nucleic-acid-sequencing.txt']

Out[38]:

	Sample Name	Comment[study step with treatment]	Protocol REF	Parameter Value[cross linking]	Parameter Value[DNA fragmentation]	Parameter Value[DNA fragment size]	Parameter Value[immunoprecipitation antibody]	Performer	Extract Name	Characteristics[extract type]	Protocol REF.1	Parameter Value[instrument]	Parameter Value[library_orientation]	Parameter Value[library_strategy]	Parameter Value[library_selection]	Parameter Value[multiplex identifier]	Performer.1	Raw Data File
0	GRP0_SBJ01_A0E1_SMP-Saliva-Sample-1	YES	extraction	di-tert-butyl peroxide	nebulization	a	d	Unknown	AT5-S81-Extract-R1	gDNA	library_preparation	GridION	single	MBD-Seq	MF	f	Unknown	AT5-S81-raw_data_file-R2.raw
1	GRP0_SBJ01_A0E1_SMP-Saliva-Sample-1	YES	extraction	uv-light	nebulization	a	d	Unknown	AT5-S1-Extract-R1	DNA	library_preparation	GridION	paired	MBD-Seq	MF	f	Unknown	AT5-S1-raw_data_file-R1.raw
2	GRP0_SBJ01_A0E1_SMP-Saliva-Sample-1	YES	extraction	di-tert-butyl peroxide	nebulization	a	d	Unknown	AT5-S81-Extract-R1	gDNA	library_preparation	GridION	paired	MBD-Seq	MF	f	Unknown	AT5-S81-raw_data_file-R3.raw
3	GRP0_SBJ01_A0E1_SMP-Saliva-Sample-1	YES	extraction	uv-light	nebulization	a	d	Unknown	AT5-S1-Extract-R1	DNA	library_preparation	GridION	paired	MBD-Seq	MF	f	Unknown	AT5-S1-raw_data_file-R2.raw
4	GRP0_SBJ01_A0E1_SMP-Saliva-Sample-1	YES	extraction	di-tert-butyl peroxide	nebulization	a	d	Unknown	AT5-S81-Extract-R1	gDNA	library_preparation	GridION	paired	MBD-Seq	MF	f	Unknown	AT5-S81-raw_data_file-R4.raw
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
1275	GRP3_SBJ10_A3E3_SMP-Saliva-Sample-1	YES	extraction	di-tert-butyl peroxide	nebulization	a	d	Unknown	AT5-S152-Extract-R1	gDNA	library_preparation	GridION	single	MBD-Seq	MF	f	Unknown	AT5-S152-raw_data_file-R1.raw
1276	GRP3_SBJ10_A3E3_SMP-Saliva-Sample-1	YES	extraction	di-tert-butyl peroxide	nebulization	a	d	Unknown	AT5-S152-Extract-R2	DNA	library_preparation	GridION	single	MBD-Seq	MF	f	Unknown	AT5-S152-raw_data_file-R5.raw
1277	GRP3_SBJ10_A3E3_SMP-Saliva-Sample-1	YES	extraction	di-tert-butyl peroxide	nebulization	a	d	Unknown	AT5-S152-Extract-R2	DNA	library_preparation	GridION	single	MBD-Seq	MF	f	Unknown	AT5-S152-raw_data_file-R6.raw
1278	GRP3_SBJ10_A3E3_SMP-Saliva-Sample-1	YES	extraction	di-tert-butyl peroxide	nebulization	a	d	Unknown	AT5-S152-Extract-R1	gDNA	library_preparation	GridION	paired	MBD-Seq	MF	f	Unknown	AT5-S152-raw_data_file-R3.raw
1279	GRP3_SBJ10_A3E3_SMP-Saliva-Sample-1	YES	extraction	uv-light	nebulization	a	d	Unknown	AT5-S72-Extract-R2	gDNA	library_preparation	GridION	single	MBD-Seq	MF	f	Unknown	AT5-S72-raw_data_file-R6.raw

1280 rows × 18 columns

7.3.1 Nucleic acid sequencing stats Stats¶

For this assay we have 280 urine samples. 280 DNA extracts are extracted from the samples. The 280 extracts are subsequently labeled. For each labeled extract, 4 mass.spec analyses are run (using Agilent QTQF 6510, positive acquisition mode, 2 replicates each for LC and FIA injection mode), for a total of 1120 mass. spec. processes and 1120 raw spectral data files

In [39]:

dataframes['a_AT5_DNA-methylation-profiling_nucleic-acid-sequencing.txt'].nunique(axis=0, dropna=True)

Out[39]:

Sample Name                                        80
Comment[study step with treatment]                  1
Protocol REF                                        1
Parameter Value[cross linking]                      2
Parameter Value[DNA fragmentation]                  1
Parameter Value[DNA fragment size]                  1
Parameter Value[immunoprecipitation antibody]       1
Performer                                           1
Extract Name                                      320
Characteristics[extract type]                       2
Protocol REF.1                                      1
Parameter Value[instrument]                         1
Parameter Value[library_orientation]                2
Parameter Value[library_strategy]                   1
Parameter Value[library_selection]                  1
Parameter Value[multiplex identifier]               1
Performer.1                                         1
Raw Data File                                    1280
dtype: int64

7.4 Second Assay: Clinical Chemistry Marker Panel¶

This assay takes blood samples as input

In [40]:

dataframes['a_AT11_clinical-chemistry_marker-panel.txt']

Out[40]:

	Sample Name	Comment[study step with treatment]	Protocol REF	Performer	Raw Data File
0	GRP0_SBJ01_A0E1_SMP-Blood-Sample-1	YES	sample preparation	Unknown	AT11-S1-raw_data_file-R1
1	GRP0_SBJ01_A0E3_SMP-Blood-Sample-1	YES	sample preparation	Unknown	AT11-S11-raw_data_file-R1
2	GRP0_SBJ01_A0E4_SMP-Blood-Sample-1	NO	sample preparation	Unknown	AT11-S21-raw_data_file-R1
3	GRP0_SBJ01_A0E4_SMP-Blood-Sample-2	NO	sample preparation	Unknown	AT11-S22-raw_data_file-R1
4	GRP0_SBJ01_A0E4_SMP-Blood-Sample-3	NO	sample preparation	Unknown	AT11-S23-raw_data_file-R1
...	...	...	...	...	...
195	GRP3_SBJ10_A3E1_SMP-Blood-Sample-1	YES	sample preparation	Unknown	AT11-S152-raw_data_file-R1
196	GRP3_SBJ10_A3E3_SMP-Blood-Sample-1	YES	sample preparation	Unknown	AT11-S162-raw_data_file-R1
197	GRP3_SBJ10_A3E4_SMP-Blood-Sample-1	NO	sample preparation	Unknown	AT11-S174-raw_data_file-R1
198	GRP3_SBJ10_A3E4_SMP-Blood-Sample-2	NO	sample preparation	Unknown	AT11-S175-raw_data_file-R1
199	GRP3_SBJ10_A3E4_SMP-Blood-Sample-3	NO	sample preparation	Unknown	AT11-S176-raw_data_file-R1

200 rows × 5 columns

7.4.1 Marker Panel Stats¶

For this assay we use 320 blood samples. For each sample three chemical marker assays are run, producing a total of 960 sample preparation processes and 960 raw data files

In [41]:

dataframes['a_AT11_clinical-chemistry_marker-panel.txt'].nunique(axis=0, dropna=True)

Out[41]:

Sample Name                           200
Comment[study step with treatment]      2
Protocol REF                            1
Performer                               1
Raw Data File                         200
dtype: int64