Access Saved Data

This tutorial shows Databoker 2.0.0, which is in prelease. It is currently being evaluated and tested at some NSLS-II beamlines.

To run this, first we need to point Juptyer at a separate Python environment. Execute the line below, and the click "Python 3" in the top-right corner of this notebook, and choose "Python 3 (preview)" from the options that appear.

In [ ]:
!/srv/conda/envs/preview/bin/python -m ipykernel install --user --name preview --display-name "Python 3 (preview)"

In this tutorial we will acquire data with the Bluesky RunEngine, persist it in a database (MongoDB), and then use the Databroker/Tiled Python client to access it.

In [ ]:
from bluesky_tutorial_utils import setup_data_saving_future_version
from bluesky import RunEngine

RE = RunEngine()
catalog = setup_data_saving_future_version(RE)
In [ ]:
from bluesky.plans import scan, count
from ophyd.sim import det, motor, motor1, motor2

from bluesky.preprocessors import SupplementalData

# Record positions of motor1 and motor2 and the beginning and end of
# every run in the "baseline" stream.
sd = SupplementalData(baseline=[motor1, motor2])
RE.preprocessors.append(sd)
In [ ]:
RE(count([det], 3), purpose="calibration")
In [ ]:
RE(scan([det], motor, -1, 1, 5), mood="optimistic", sample={"color": "red", "composition": "Ni"})
In [ ]:
RE(scan([det], motor, -1, 1, 5), mood="skeptical", sample={"color": "red", "composition": "Ni"})
In [ ]:
(uid,) = RE(scan([det], motor, -1, 1, 5), mood="optimistic", sample={"color": "blue", "composition": "Cu"})

What can you do with a Catalog?

In [ ]:
catalog

Look up by recency.

In [ ]:
catalog[-1]

Look up by scan_id.

In [ ]:
catalog[1]

Look up by (partial) universally unique ID.

In [ ]:
uid
In [ ]:
catalog[uid]
In [ ]:
uid[:8]
In [ ]:
catalog[uid[:8]]

Iterate over entries like a dictionary.

In [ ]:
for uid, run in catalog.items():
    print(f"{uid[:8]}: {run}")

Or do anything you can do with a (read-only) dict. This shows that catalog implements Python's standard "mapping" interface.

In [ ]:
import collections.abc

isinstance(catalog, collections.abc.Mapping)

In summary:

catalog[-1]  # the most recent Run
catalog[-5]  # the fifth-most-recent Run
catalog[3]  # 'scan_id' == 3 (if ambiguous, returns the most recent match)
catalog["6f3ee9a1-ff4b-47ba-a439-9027cd9e6ced"]  # a full globally unique ID...
catalog["6f3ee9"]  # ...or just enough characters to uniquely identify it (6-8 usually suffices)

The globally unique ID is best for use in scripts, but the others are nice for interactive use. All of these incantations return a BlueskyRun.

In [ ]:
run = catalog[-1]
run

Catalog also support search.

In [ ]:
from databroker.queries import FullText, TimeRange, RawMongo  # more to come...
In [ ]:
catalog.search(RawMongo(start={"plan_name": "count"}))
In [ ]:
catalog.search(FullText("optimistic"))
In [ ]:
catalog.search(TimeRange(since="2020", until="2020-03-01", timezone="Canada/Central"))
In [ ]:
catalog.search(TimeRange(since="2020", timezone="Canada/Central"))

When you search on a Catalog, you get another Catalog with a subset of the entries. You can search on this in turn, progressively narrowing the results.

In [ ]:
catalog.search(RawMongo(start={"sample.color": "red"}))
In [ ]:
catalog.search(RawMongo(start={"sample.color": "red"})).search(FullText("optimistic"))

Exercise

Try various searches.

In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 

What can you with a BlueskyRun?

A BlueskyRun bundles together some metadata and several logical tables ("streams") of data. First, the metadata. It always comes in two sections, "start" and "stop".

In [ ]:
run.start  # Everything we know before the measurements start.

The above contains a mixture of things that bluesky automatically recorded (e.g. the time), things the bluesky plan reported (e.g. which motor(s) are scanned), and things the user told us (e.g. the name of the operator).

In [ ]:
run.stop  # Everything we only know after the measurements stop.

You can dig into the contents in the usual way.

In [ ]:
run.start["num_points"]
In [ ]:
run.stop["exit_status"] == "success"

As we said, a Run bundles together any number of "streams" of data. Picture these as tables or spreadsheets. The stream names are shown when we print run.

In [ ]:
run

We can also list them programmatically.

In [ ]:
list(run)

We can access a particular stream like run["primary"].read(). Dot access also works — run.primary.read() — if the stream name is a valid Python identifier and does not collide with any other attributes.

In [ ]:
ds = run["primary"].read()
ds

This is an xarray.Dataset. At this point Bluesky and Data Broker have served their purpose and handed us a useful, general-purpose scientific Python data structure with our data in it.

What can you do with an xarray.Dataset?

We can easily generate scatter plots of one dimension vs another.

In [ ]:
ds.plot.scatter(x="time", y="det")

We can pull out specific columns. (Each column in an xarray.Dataset is called an xarray.DataArray.)

In [ ]:
motor = ds["motor"]
motor

Inside this xarray.DataArray is a plain old numpy array.

In [ ]:
type(motor.values)

The extra context provided by xarray is very useful. Notice that the dimensions have names, so we can perform aggregations over named axes without remembering the order of the dimensions.

The plot method on xarray.DataArray often just "does the right thing" based on the dimensionality of the data. It even labels our axes for us!

In [ ]:
motor.plot()

For a quick overview of xarray see the xarray documentation.

Exercises

  1. Coming back to our run
In [ ]:
run

read the "baseline" stream. The baseline stream conventionally includes readings taken just before and after a scan to record all potentially-relevant positions and temperatures and note if they have drifted.

In [ ]:
# Try your solution here.
In [ ]:
%load solutions/access_baseline_data.py