#!/usr/bin/env python # coding: utf-8 # Jupyter Notebooks can sometime hard to work with. Some magic methods will be really handy when things don't seem to work out. The following cell reloads all changed modules. # In[1]: import tempfile get_ipython().run_line_magic('load_ext', 'autoreload') get_ipython().run_line_magic('autoreload', '2') # # Table of Contents # 1. [Installation](#installation) # 2. [Motivation](#motivation) # 3. [Importing a dataset](#importing) # 4. [Attributes](#attributes) # 5. [Accessing Samples](#accessing) # 6. [Iteration over samples](#iteration) # 7. [Subset selection](#subset) # 8. [Saving/Loading a dataset](#saving) # 9. [Combining datasets](#merging) # ## Installation # In[2]: # !pip install MRdataset # In[3]: from pathlib import Path from MRdataset import import_dataset # check if install worked # check version # upgrade command # ## Motivation # Large scale neuroimaging datasets play an essential role in brain-behavior relationships. While neuroimaging studies have shown promising results, reproducibility can be affected by differences in acquisition parameters at the scanner level. The motivation behind creating MRdataset is to provide a unified interface to access image acquisition data, across various formats such as XNAT, BIDS, LONI etc. # # Having a unified interface to access image acquisition data is important because it allows users to easily and consistently access and manipulate the data, regardless of the specific format or source of the data. This can save time and reduce the potential for errors, as users do not need to worry about dealing with the nuances of different data formats or sources. In addition, a unified interface can make it easier to integrate image acquisition data with other systems and processes, allowing for more efficient and effective analysis and use of the data. # ## Importing a dataset # To provide concrete examples, let's jump to an example right away. We will use an example dicom dataset to provide an example. Note that the outputs will be quite different based on your data. We will complete this tutorial using a dicom dataset. However, the libraries also support BIDS datasets. Example code for BIDS dataset would be discussed later in this tutorial # # Let's get started! # A dataset can be imported from disk, simply using the function `import_dataset`. Observe that it includes the functionality to add a `name` to the dataset, and also `ds_format` is specified which can be one of either `dicom` or `bids`. As of now, we have an empty dataset. # In[4]: import zipfile MRdataset_root = Path(__name__).resolve().parents[1] DATA_ARCHIVE = MRdataset_root / 'MRdataset/tests/resources/example_dicom_data.zip' DATA_ROOT = Path('/tmp/') with zipfile.ZipFile(DATA_ARCHIVE, 'r') as zip_ref: zip_ref.extractall(DATA_ROOT) # In[5]: DATA_ARCHIVE.is_file() # In[6]: import tempfile tmp_output_dir = tempfile.gettempdir() config_filepath = MRdataset_root / 'MRdataset/tests/resources/mri-config.json' dicom_dataset = import_dataset(data_source=DATA_ROOT/'example_dicom_data', ds_format='dicom', name='dummy_study_experiment', config_path=config_filepath, output_dir=tmp_output_dir) # If a dataset is empty, it means that there is no data stored in it. This can be a problem if the dataset is supposed to contain data that is needed for a particular analysis or task. In such cases, the absence of data can prevent the analysis or task from being performed, or it can lead to incorrect or incomplete results. # # Using the `print()` method to see the contents of a dataset can be beneficial because it allows the user to quickly and easily view the data, without having to write additional code to extract and display the data. # In[7]: print(dicom_dataset) # It prints concise information about the dataset, i.e. the number of subjects inside dataset (10). It also mentions the number of sessions in the dataset. Any file which identified as a localizer is skipped. In general, these are not required, but you still want to include them in your dataset, you can specify in config file. # ## Structure # Going further, lets dig deeper what are the elements present in our dataset. It is essential to describe the elements of a dataset because it helps to provide context and information about the data that is contained in the dataset. This can be useful for other users or researchers who are working with the MRdataset, as it allows them to understand the structure and how it should be interpreted. In addition, describing the elements of a dataset can help to ensure that the data structure is being used correctly and consistently, and it can also facilitate the integration of the dataset with other data sources or systems. # # The library has hierarchichal structure as displayed below: # ![alt text](images/hierarchy.jpg "Title") # The above figure shows a simple schematic to depict the structure of MRdataset object. # # Different MRI modalities, such as T1-weighted, T2-weighted, and diffusion-weighted imaging, can provide different types of information about the structure and composition of tissues in the body. Additionally, MRI scans are often performed on multiple subjects, such as healthy individuals and patients with a specific condition, in order to compare and contrast the differences in their anatomy and physiology. This can help researchers to better understand the underlying mechanisms of a particular condition or disease, and to develop more effective treatments. # # Similarly, the MRdataset object is a hierarchical data structure that is made up of different elements of a neuroimaging experiment, such as modalities, subjects, sessions and runs. Each element is represented as a node in a tree, and the edges connect the nodes to show hierarchical relationship between data elements. # # So, the **dataset** is at the top of the tree, and the various modalities beneath it, like T1-weighted, T2-weighted and diffusion-weighted are branching out of the dataset. The term **modality/sequence** refers to the specific technique that is used to acquire the imaging data. Each modality contains several subjects, which are part of the experiments. Observe that different modalities may typically have common subjects, in order to compare and contrast the differences in their brain anatomy and function. # # Each **subject** may have one or more sessions for a modality. The term **session** refers to a specific imaging session that is performed on a given subject. Typically, there would be multiple sessions in order to obtain multiple sets of data for a given subject. Often, a subject return to MR Research center several time during a span of 1-2 years, which helps in tracking longitudnal changes in the brain. # # Finally, **run** refers to a specific set if imaging data that is acquired during a given session. Often, a single session will involve multiple runs to obtain a comprehensive acquisition. For example, an fMRI experiment involves multiple runs, each of which acquires information about particular brain region or might even have a different behavioral task. # We follow this hierarchical structure in our dataset `dicom_dataset` object. And we provide methods for accessing each of these elements. These methods are `traverse_horizontal` and `traverse_vertical2`. For example, we can traverse through all the subjects for a given sequence/modality using `traverse_horizontal` method. Similarly, we can traverse through all the sequences/modality for a given subject using `traverse_vertical2` method. Let's see an example. # In[8]: print(f"{dicom_dataset.name} dataset contains following modalities:") for sequence in dicom_dataset.get_sequence_ids(): print('\t',sequence) # And we can browse through each one of modalities, to see that they contain several subjects. # In[9]: seq_name = '3D_T2_FLAIR' for subject, session, run, sequence in dicom_dataset.traverse_horizontal(seq_name): print(f"Subject: {subject},\nSession: {session},\nRun: {run},\nSequence: {sequence}") break # Note that it returns a protocol.BaseSequence object. This object contains information about the sequence, such as the name of the sequence, and the various acquisition parameters that were used to acquire the data. For more details about the Sequence object, please refer to the documentation for protocol library. # # Similarly we can use `traverse_vertical2` method to traverse through all the pair of sequences for a given subject. It is helpful for retrieving epi and fieldmaps for a given subject. Let's see an example. # In[10]: seq_id1 = 'me_fMRI' seq_id2 = 'me_FieldMap_GRE' for subject, session, run1, run2, seq1, seq2 in dicom_dataset.traverse_vertical2(seq_id1, seq_id2): print(seq1) print(seq2) break # ## A similar example with BIDS dataset would be: # In[11]: import zipfile MRdataset_root = Path(__name__).resolve().parents[1] DATA_ARCHIVE = MRdataset_root / 'MRdataset/tests/resources/example_bids_data.zip' DATA_ROOT = Path('/tmp/') with zipfile.ZipFile(DATA_ARCHIVE, 'r') as zip_ref: zip_ref.extractall(DATA_ROOT) print((DATA_ROOT/'example_bids_dataset').exists()) bids_dataset = import_dataset(data_source=DATA_ROOT/'example_bids_dataset', ds_format='bids') # In[12]: print(bids_dataset) # In[13]: seq_name = 'func' for subject, session, run, sequence in bids_dataset.traverse_horizontal(seq_name): print(f"Subject: {subject},\nSession: {session},\nRun: {run},\nSequence: {sequence}") # ## Saving and Loading a dataset # Saving and loading a dataset is important because it allows you to store and retrieve your data for later use. This is especially useful when you have a large dataset that takes a long time to process or generate, or when you want to share your dataset with others. # # By saving your dataset, you can avoid having to recreate it each time you want to use it, which can save you a significant amount of time and resources. Additionally, storing your dataset in a structured and organized way can make it easier to analyze and manipulate later on. # # Overall, the ability to save and load a dataset is a valuable tool that can help you work more efficiently and effectively with your data. # # We use `save_mr_dataset` and `load_mr_dataset` to save and load MRdataset objects, respectively. Let's see an example. The function `save_mr_dataset` takes the filepath to which the dataset is to be saved. The dataset is saved with an extension *.mrds.pkl* # # In[16]: from MRdataset import save_mr_dataset, load_mr_dataset save_mr_dataset(filepath=DATA_ROOT/'example_dicom.mrds.pkl', mrds_obj=dicom_dataset) # If you recieve a error something like this: # ```python # PicklingError: Can't pickle : it's not the same object as protocol.imaging.DwellTime # ``` # The issue is due to the Jupyter Notebook environment. You can fix the problem by restarting the kernel and running the notebook again. This is a known issue with Jupyter Notebook. If you are using a python script, you won't face this issue. # We can read back the dataset, we just saved to disk. # In[17]: saved_dataset = load_mr_dataset(filepath=DATA_ROOT/'example_dicom.mrds.pkl') print(saved_dataset) # Thus, by cross-checking that the saved dataset is same as the original dataset ensures integrity of the data. It also verifies that while saving, the data has not been modified or corrupted in some way, thus improving reliability of your analysis and modeling. # The library is highly extensible, and a developer can extend it to their own neuroimaging formats. For example, to create an interface with a new format, say # NID (NeuroImaging Dataset), inherit ``MRdataset.base.BaseDataset`` in a file ``NID_dataset.py`` # In[18]: from MRdataset import BaseDataset class NIDDataset(BaseDataset): def __init__(data_source): super().init(data_source) pass def load(): pass # In[ ]: