#!/usr/bin/env python # coding: utf-8 # ## About this resource # All of the tutorial notebooks as well as information about the dependent package (`nma-ibl`) can be found at [nma-ibl GitHub repository](https://github.com/int-brain-lab/nma-ibl). # ## Setting up the environment (particularly for Colab users) # Please execute the cells below the install the necessary dependencies and prepare the environment. # In[ ]: # install IBL pipeline package to access and navigate the pipeline get_ipython().system('pip install --quiet nma-ibl') # # Explore IBL data Pipeline # # Welcome! This notebook gives you a brief introduction on the [International Brain Laboratory (IBL)](https://www.internationalbrainlab.com/) data pipeline, and provide quick guidance on how to explore the data pipeline utilizing [DataJoint](https://datajoint.io) -- the framework in which the pipeline has been developed. # ## A little bit about DataJoint # [DataJoint](https://datajoint.io) is a general data pipeline framework offered in Python & MATLAB, that allows users to interact with a relational database (e.g. MySQL) intuitively and efficiently. Among many features, it offers a way to develop data pipelines with built-in integrity and consistency checks, and provides facility to define tables for computations with tools to quickly parallelize the processing over multiple computers! To learn more, please visit the website [DataJoint.io](https://datajoint.io). There you will find links to our interactive playgrounds and tutorials. # # [Vathes](https://vathes.com) and, in particular, [DataJoint Neuro](https://djneuro.io) is the team that develops and maintains DataJoint, and also help researchers and labs to adopt DataJoint and design pipeline. # Using DataJoint framework, we have been processing data daily within IBL project, and provides a website that displays data and visualizations for the daily training results at https://data.internationalbrainlab.org. # ## Navigating the IBL data pipeline # Here we will go through some useful tools to navigate through and get familiar with the data as foundin the pipeline. # # First thing first, let's **import DataJoint and the IBL pipeline package**. # In[ ]: import datajoint as dj # We'll then need to configure DataJoint to connect to the IBL public database hosted at datajoint-public.internationalbrainlab.org. # Please use the public credentials `ibl-public`: # In[ ]: dj.config['database.host'] = 'datajoint-public.internationalbrainlab.org' dj.config['database.user'] = 'ibl-public' dj.config['database.password'] = 'ibl-public' # and now we'll import specific IBL data pipeline resources to navigate and access the data (this may take some time to run as it establishes access and fetch relevant resources from the database). # In[ ]: from nma_ibl import reference, subject, action, acquisition, data, behavior, behavior_analyses # # 1. Browse schemas you have access to - `dj.list_schemas()` # # One or more related tables are grouped together into "schemas", providing for logical structure and organization of tables. Tables belonging to the schema are related to each other in intuitive fashion. # # You can list what schemas can be found in the IBL data pipeline as follows: # In[ ]: dj.list_schemas() # ## Major schemas: # # Different schemas in the pipeline correspond to different grouping of related information, and typically associated with different *data sources*. # # 1. Meta data stored and imported from [**Alyx**](https://alyx.readthedocs.io/en/latest/): # # 1) `ibl_reference`: general information about labs, users and projects # 2) `ibl_subject`: information about the experiment subject including housing and genetics # 3) `ibl_action`: information about performed surgeries # 4) `ibl_acquisition`: information about experimental sessions # 5) `ibl_data`: information pertaining to underlying data resource handling # # 2. Behavioral data stored and imported from **FlatIron** data stores: `ibl_behavior` - trial information of the task. # 3. Computed alalysis results on the behavioral data: `ibl_analyses_behavior` # # 2. Browse tables in a schema - `dj.Diagram` # # `dj.Diagram` shows the tables and the dependencies among them. There are four types of DataJoint tables, referred to as different *tiers*, and different types of dependencies: # ### Table tiers # # **Manual table**: (green box) Entries in a manual table are inserted "manually" or by recording systems. This is the kind of data that if lost, it basically cannot be recovered without proper backups present. A typical example of a manual table is the `subject` table that saves information about animal subjects. # # **Lookup table**: (gray box) Lookup tables hold general facts that rarely changes, such as a listing of possible genetic lines of an animals, or parameter information for a model. # # **Imported table**: (blue oval) Entries in an imported table are loaded from as external data source, such as a recording file produced by the recording system. DataJoint offers a clean way to trigger and process the "ingestion" of such external data into an imported table. If deleted, the entries may be recoverable if external data source still exists. Examples of imported tables are tables for experimental recordings (e.g. behavioral traces or electrophysiology signals). # # **Computed table**: (red circle) Entries in an imported table are computed from data entirely present in the other tables within the pipeline. If deleted, the entries are typically easily recoverable through recomputation. Examples of computed tables are tables for behavioral analysis results. # # **Part table**: (plain text) Part tables are tables tightly link to its master table, signifying that entires within the part tables should not be deleted without a corresponding deletion of the master table. Data for part tables are usually imported through the master table's logic. # ### Dependencies # # **One-to-one primary**: (thick solid line) Tables with one-to-one primary dependency have the exact same definition of the primary key. Under this dependency, (up to) only one entry can exist in the child table for every entry in the parent table. # # **One-to-many primary**: (thin solid line) The child table inherits the primary key definition from its parent, but has additional field as part of the primary key as well. There may exist any number of child table entries corresponding to a single entry in the parent table. # # **Secondary foreign key reference**: (dashed line) A secondary attribute references to another table, but does not establish primary key dependencies. If any attributes are renamed during the dependency establishment, the link contains an orange dot. # Below we'll see how you can use `dj.Diagram` to view various parts of a schema, as well as to combine multilpe schemas into one diagram to capture dependencies acorss schemas. # ### Show tables in the whole schema # In[ ]: dj.Diagram(reference) # ### Show diagram combining aribtrary parts of the pipeline # A combination of aribtrary tables: # In[ ]: dj.Diagram(subject.Subject) + dj.Diagram(subject.Line) + dj.Diagram(subject.SubjectLab) + dj.Diagram(subject.SubjectProject) # The tables could be from different schemas: # In[ ]: dj.Diagram(subject.Subject) + dj.Diagram(acquisition.Session) # # 3. Getting the detailed definition of a table - `table.describe()` # You can get "definition" of any table by calling `.describe()` method: # In[ ]: subject.Subject.describe(); # # 4. Browsing of data - queries # We now show how you can navigate and access contents of the tables by writing `queries`. Below we provide a very brief overview of DataJoint queries with examples. For more thorough discussions and examples, please visit [DataJoint.io](https://datajoint.io) and [DataJoint Tutorials](https://tutorials.datajoint.io). # A simplest query is the table itself - which simply corresponds to all entries in the table. # In[ ]: subject.Subject() # Notice that each query only gives you a quick preview of the data. We'll soon talk about how to "fetch" the data once you formulated a query capturing the data you want to grab. # ## Restriction `&`: filtering data # Restriction operation allows for narrow down the query to entries matcihng certain criteria. # ### Restriction: Query one subject # In[ ]: # restrict by string subject.Subject & 'subject_nickname="IBL-T1"' # In[ ]: # restrict by a dictionary. subject.Subject & {'subject_uuid': 'cff66dc1-765e-4e04-abef-92de05bf4f19'} # In[ ]: subject.Subject & {'sex': 'm'} # ### Restriction: Query subjects born after a date # In[ ]: subject.Subject & 'subject_birth_date > "2019-01-01"' # ### Restriction: subjects within a range of dates # In[ ]: subject.Subject & 'subject_birth_date between "2019-01-01" and "2019-04-01"' # ### Restriction: Query subjects on multiple attributes # In[ ]: subject.Subject & 'subject_birth_date > "2019-01-01"' & 'sex="M"' # ### Restriction: Query dead subjects # In[ ]: # subjects that are dead subject.Subject & subject.Death # In[ ]: # subjects that are alive subject.Subject - subject.Death # ## Join `*`: gather information from different tables # You can combine or "join" two tables with the `join` operator (`*`). The result represents *every valid combination* of entries across the two tables. # In[ ]: subject.Subject * acquisition.Session # ## Projection `.proj()`: focus on attributes of interest # projection allows you to narrow down the table by dropping unnecessary table attributes (except for primary key attributes which cannot be dropped). # In[ ]: # drop all but the primary key attributes subject.Subject.proj() # In[ ]: # keep `subject_birth_date` and `sex` subject.Subject.proj('subject_birth_date', 'sex') # ### rename attribute with ***proj()*** # You could also use proj to rename attributes. # In[ ]: subject.Subject.proj('sex', dob='subject_birth_date') # ### perform simple computations with ***proj*** # projection also provides for capacity to compute a new field on the fly based on attributes present in the query. # **Example 1: Get date of a session:** # In[ ]: sessions_with_date = acquisition.Session.proj(session_date='DATE(session_start_time)') # In[ ]: sessions_with_date # **Example 2: Age of the animal when performing each session?** # In[ ]: # First get the date of birth and the session date into the same query q = subject.Subject * acquisition.Session # In[ ]: q # In[ ]: # Then compute the age q_with_age = q.proj(age='DATEDIFF(session_start_time, subject_birth_date)') q_with_age # ### Aggregation `.aggr()`: simple computation of one table against another table # Aggreation operation allows you to group one or more entries based on shared attributes, and perform computations within each grouping. # Example: how many sessions does each subject do so far? # In[ ]: subject.Subject.aggr(acquisition.Session, 'subject_nickname', n='count(*)') # ## Combining queries # As you saw, a result of a query can be used in the next query with no restriction. This makes it possible to build a fairly complex query by combining multiple steps of simple queries! # # 5. Fetching data # Now you have seen how you can manipuate queries to get exactly what you are interested, you can "fetch" the data to pull them into your working environment and start performing computaitons on the fetched data. # ## Fetch all fields: `fetch()` # You can fetch everything by calling `.fetch()` on the query. This returns a NumPy structured array by default. # In[ ]: # fetch all data from a table subjs = subject.Subject.fetch() subjs # In[ ]: subjs['subject_uuid'] # In[ ]: subjs['subject_birth_date'] # Alternatively, you can fetch them as a list of dictionaries. # In[ ]: # fetch as a list of dictionaries subjs_dict = subject.Subject.fetch(as_dict=True) subjs_dict # ...or as [pandas](https://pandas.pydata.org/) dataframe (`pandas` library has to be installed) # In[ ]: # fetch as pandas dataframe subjs_df = subject.Subject.fetch(format='frame') subjs_df # You can use special keyword `'KEY'` to just fetch the primary keys are list of dictionaries: # In[ ]: # fetch the primary key pk = subject.Subject.fetch('KEY') pk # Also you can specify exact attributes you would like to retrieve and they will be returned as their own (index aligned) lists: # In[ ]: # fetch specific attributes dob, sex = subject.Subject.fetch('subject_birth_date', 'sex') # In[ ]: dob[:10] # In[ ]: sex[:10] # If you specify `as_dict=True` while specifying specific attributes, you can get back a list of dictionaries with only those attributes: # In[ ]: info = subject.Subject.fetch('subject_birth_date', 'sex', as_dict=True) info # ## fetch data only from one entry: `fetch1` # Sometimes you know that there is only exactly one entry in your query. In that case, rather than getting back a list with one entry, you can already expand the listing by using `fetch1` instead # In[ ]: CSHL_015 = (subject.Subject & {'subject_nickname': 'CSHL_015'}).fetch1('KEY') # "fetch1()" because we know there's only one # In[ ]: CSHL_015 # In[ ]: IBL_T1 = (subject.Subject & {'subject_nickname': 'IBL-T1'}).fetch1() # In[ ]: IBL_T1 # # 5. Behavioral trial information # Among the most interesting and important information in the IBL public data pipeline as it stands is the behavioral trials for the 2 alternative force choice (2AFC) task. # # Let's go into a bit more details on the relevant tables housing the behvaioral information -- `behaivor` schema. # Among the most important table in this schema is `behavior.TrialSet` # In[ ]: behavior.TrialSet() # The `TrialSet` table stores some summary statistics for each session of the behavioral experiment. To also show information about the subject, we could join the table with `subject` related tables. # In[ ]: behavior.TrialSet * subject.Subject * subject.SubjectLab * subject.SubjectProject # We could restrict to one session, for example to a behavior session that started exactly on `2019-09-16 13:44:46` by: # In[ ]: import datetime q = behavior.TrialSet * subject.Subject * subject.SubjectLab * subject.SubjectProject & \ 'subject_nickname="CSHL_015"' & {'session_start_time': datetime.datetime(2019, 9, 16, 13, 44, 46)} q # alternatively you could use string for date value # In[ ]: q = behavior.TrialSet * subject.Subject * subject.SubjectLab * subject.SubjectProject & \ 'subject_nickname="CSHL_015"' & {'session_start_time': "2019-9-16 13:44:46"} q # The trial-by-trial information are shown in a **part table** of the `behavior.TrialSet`: `behavior.TrialSet.Trial` # In[ ]: behavior.TrialSet.Trial() # We could check a brief description for each of the column with `describe()` # In[ ]: behavior.TrialSet.Trial.describe(); # To fetch some part of data, we could do use the fetch method after we formulate query to narrow down our results: # For example, you might like to see the stimulus contrasts, and the animal choices for the one session that we already queried and saved as `q`. # To do so, we could do: # In[ ]: # we could fetch the fields as a list of dictionary, only for the trials with a choice # q corresponds to a single session we narrowed down from a query above data = (behavior.TrialSet.Trial & q & 'trial_response_choice !="No Go"').fetch( 'trial_stim_contrast_left', 'trial_stim_contrast_right', 'trial_response_choice', as_dict=True) # In[ ]: # We could then convert the data to a dataframe for further analyses import pandas as pd df = pd.DataFrame(data) df # Then we can explore and perform computations on the fetched data. # As an example, let us first compute the signed contrast, so that the presence of contrasts on the right are indicated by positive value and negative for left. # In[ ]: df['signed_contrast'] = df['trial_stim_contrast_right'] - df['trial_stim_contrast_left'] # In[ ]: df['report_right'] = df['trial_response_choice'] == "CCW" # In[ ]: report_right = df.groupby(['signed_contrast'], as_index=False).mean() report_right # Now, let's plot the psychometric curve: prob_report_right vs signed_contrast: # In[ ]: import matplotlib.pyplot as plt plt.plot(report_right['signed_contrast'], report_right['report_right'], 'o') plt.xlabel('Signed Contrast') plt.ylabel('Probability reporting right') # Some basic analyses such as computations of psychometric curves have been pre-processed and saved in one of the downstream tables, such as `behavior_analyses.PsychResults`. # # In the next notebook, we will step through how to fetch data from the computational analysis result tables present in the data pipeline to recreate the psychometric curves for any particular session. # You can find additional learning resources for DataJoint at [DataJoint.io](https://datajoint.io), and you can also sign up to our DataJoint Slack group (link on the website) to join the vibrant DataJoint user community!