An Inventory of the Shared Datasets in the LSST Science Platform¶

Owner(s): Phil Marshall (@drphilmarshall), Rob Morgan (@rmorgan10)
Last Verified to Run: 2019-08-13
Verified Stack Release: 18.1

In this notebook we'll take a look at some of the datasets available on the LSST Science Platform.

Learning Objectives:¶

After working through this tutorial you should be able to:

Start figuring out which of the available datasets is going to be of most use to you in any given project;

When it is finished, you should be able to use the stackclub.Taster to: 2. Report on the available data in a given dataset; 3. Plot the patches and tracts in a given dataset on the sky.

Outstanding Issue: The Taster augments the functionality of the Gen-2 butler, which provides limited capabilities for discovery what data actually exist. Specifically, the Taster is relying heavily on the queryMetadata functionality of the Gen-2 butler, which is limited to a small number of datasets and does not actually guarentee that those datasets exist. The user should beware of over-interpreting the true existence of datasets queried by the Taster. This should be improved greatly with the Gen-3 butler.

Logistics¶

This notebook is intended to be runnable on lsst-lsp-stable.ncsa.illinois.edu from a local git clone of https://github.com/LSSTScienceCollaborations/StackClub.

Set-up¶

We'll need the stackclub package to be installed. If you are not developing this package, you can install it using pip, like this:

pip install git+git://github.com/LSSTScienceCollaborations/StackClub.git#egg=stackclub

If you are developing the stackclub package (eg by adding modules to it to support the Stack Club tutorial that you are writing, you'll need to make a local, editable installation. In the top level folder of the StackClub repo, do:

In [ ]:

! cd .. && python setup.py -q develop --user && cd -

You may need to restart the kernel after doing this. When editing the stackclub package files, we want the latest version to be imported when we re-run the import command. To enable this, we need the %autoreload magic command.

In [ ]:

%load_ext autoreload
%autoreload 2

To just get a taste of the data that the Butler will deliver for a chosen dataset, we have added a taster class to the stackclub library. All needed imports are contained in that file, so we only need to import the stackclub library to work through this notebook.

In [ ]:

import numpy as np
%matplotlib inline

import stackclub

You can find the Stack version that this notebook is running by using eups list -s on the terminal command line:

In [ ]:

# What version of the Stack am I using?
! echo $HOSTNAME
! eups list -s lsst_distrib

Listing the Available Datasets¶

First, let's look at what is currently available. There are several shared data folders in the LSP, the read-only /datasets folder, the project-group-writeable folder /project/shared/data, and the Stack Club shared directory /project/stack-club. Let's take a look at what's in /project/shared/data. Specifically, we want to see butler-friendly data repositories, distinguished by their containing a file called _mapper, or repositoryCfg.yaml in their top level.

/project/shared/data: These datasets are designed to be small test sets, ideal for tutorials.

In [ ]:

shared_repos_with_mappers = ! ls -d /project/shared/data/*/_mapper | grep -v README | cut -d'/' -f1-5 | sort | uniq
shared_repos_with_yaml_files = ! ls -d /project/shared/data/*/repositoryCfg.yaml | grep -v README | cut -d'/' -f1-5 | sort | uniq
shared_repos = np.unique(shared_repos_with_mappers + shared_repos_with_yaml_files)

shared_repos

In [ ]:

for repo in shared_repos:
    ! du -sh $repo

/datasets: These are typically much bigger: to measure the size, uncomment the second cell below and edit it to target the dataset you are interested in. Running du on all folders takes several minutes.

In [ ]:

repos_with_mappers = ! ls -d /datasets/*/repo/_mapper  |& grep -v "No such" | cut -d'/' -f1-4 | sort | uniq
repos_with_yaml_files = ! ls -d /datasets/*/repo/repositoryCfg.yaml |& grep -v "No such" | cut -d'/' -f1-4 | sort | uniq
repos = np.unique(repos_with_mappers + repos_with_yaml_files)

repos

In [ ]:

"""
for repo in repos:
    ! du -sh $repo
""";

Exploring the Data Repo with the Stack Club `Taster`¶

The stackclub library provides a Taster class, to explore the datasets in a given repo. As an example, let's take a look at some HSC data using the Taster. When instantiating the Taster, if you plan to use it for visualizing sky coverage, you can provide it with a path to the tracts from the main repo.

Initializing the `Taster`¶

In [ ]:

# Parent repo
repo = '/datasets/hsc/repo/'

#Location of tracts for a particular rerun and depth relative to main repo
rerun = 'DM-13666' # DM-13666, DM-10404 
depth = 'WIDE' # WIDE, DEEP, UDEEP
tract_location = 'rerun/' + rerun + '/' + depth

Execute one of the following two cells. The latter will make tarquin aware of the tracts for the dataset while the former will just look at the repo as a whole and not visualize any sky area.

In [ ]:

tarquin = stackclub.Taster(repo, vb=True)

In [ ]:

tarquin = stackclub.Taster(repo, vb=True, path_to_tracts=tract_location)

Properties of the `Taster`¶

The taster, tarquin, carries a butler around with it:

In [ ]:

type(tarquin.butler)

If we ask the taster to investigate a folder that is not a repo, its butler will be None

In [ ]:

failed = stackclub.Taster('not-a-repo', vb=True)

In [ ]:

print(failed.butler)

The taster uses its butler to query the metadata of the repo for datasets, skymaps etc.

In [ ]:

tarquin.look_for_datasets_of_type(['raw', 'calexp', 'deepCoadd_calexp', 'deepCoadd_mergeDet'])

PROBLEM: these last two datatypes are not listed in the repo metadata. This is one of the issues with the Gen-2 butler and theTaster is not smart enough to search the tract folders for catalog files. This should be updated with Gen-3.

In [ ]:

tarquin.look_for_skymap()

The what_exists method searches for everything "interesting". In the taster.py class, interesting currently consists of

'raw'
'calexp'
'src'
'deepCoadd_calexp'
'deepCoadd_meas'

but this method can easily be updated to include more dataset types.

In [ ]:

tarquin.what_exists()

If one wishes to check the existance of all dataset types, you can use the all parameter of the what_exists() method to do exactly that. Checking all dataset types may take a minute or so (while the Taster does a lot of database queries).

In [ ]:

tarquin.what_exists(all=True)

A dictionary with existence information is stored in the exists attribute:

In [ ]:

tarquin.exists

The Taster can report on the data available, counting the number of visits, sources, etc, according to what's in the repo. It uses methods like this one:

In [ ]:

tarquin.estimate_sky_area()

and this one:

In [ ]:

tarquin.count_things()

print(tarquin.counts)

When the estimate_sky_area method runs, tarquin collects all the tracts associated with the repo. A list of the tracts is stored in the attribute tarquin.tracts.

In [ ]:

tarquin.tracts

Using the tracts, we can get a rough estimate for what parts of the sky have been targeted in the dataset. The method for doing this is tarquin.plot_sky_coverage, and follows the example code given in Exploring_A_Data_Repo.ipynb.

In [ ]:

tarquin.plot_sky_coverage()

To have your Taster do all the above, and just report on what it finds, do:

In [ ]:

tarquin.report()

If you are interested in learning which fields, filters, visits, etc. have been counted by tarquin, remember that tarquin carries an instance of the Butler with it, so you can run typical Butler methods. For example, if you found the number of filters being 13 odd, you can look at the filters like this:

In [ ]:

tarquin.butler.queryMetadata('calexp', ['filter'])

For more on the Taster's methods, do, for example:

In [ ]:

# help(tarquin)

Example Tastings¶

Let's compare the WIDE, DEEP and UDEEP parts of the HSC dataset.

In [ ]:

repo = '/datasets/hsc/repo/'
rerun = 'DM-13666'

for depth in ['WIDE', 'DEEP', 'UDEEP']:
    tract_location = 'rerun/' + rerun + '/' + depth
    
    taster = stackclub.Taster(repo, path_to_tracts=tract_location)
    taster.report()

You may notice that all Metadata Characteristics beginning with "Number of" are the same for the three depths. This is a result of tarquin's Butler getting this information from the repo as a whole, rather than the specific depth we specified for the tracts. There is more information on why the Butler works in this way in the Exploring_A_Data_Repo.ipynb notebook.

Summary¶

In this notebook we took a first look at the datasets available to us in two shared directories in the LSST science platform filesystem, and used the stackclub.Taster class to report on their basic properties, and their sky coverage. Details on the methods used by the Taster can be found in the Exploring_A_Data_Repo.ipynb notebook, or by executing the follwoing cell:

In [ ]:

help(tarquin)

STILL TODO¶

Build defensiveness into the Taster so that it can handle a wider variety of datasets.
Update Taster to use Gen-3 butler

Looking at other shared datasets and repos¶

The following loops over all shared datasets fails in interesting ways: some folders don't seem to be Butler-friendly. We need to do a bit more work to identify the actual repos available to us, and then use the Taster to provide a guide to all of them.

In [ ]:

for repo in shared_repos:
    try:
        taster = stackclub.Taster(repo)
        taster.report()
    except:
        print("Taster failed to explore repo ",repo)

In [ ]:

for repo in repos:
    try:
        taster = stackclub.Taster(repo)
        taster.report()
    except:
        print("Taster failed to explore repo ",repo)

An Inventory of the Shared Datasets in the LSST Science Platform¶

Learning Objectives:¶

Logistics¶

Set-up¶

Listing the Available Datasets¶

Exploring the Data Repo with the Stack Club Taster¶

Initializing the Taster¶

Properties of the Taster¶

Example Tastings¶

Summary¶

STILL TODO¶

Looking at other shared datasets and repos¶

Exploring the Data Repo with the Stack Club `Taster`¶

Initializing the `Taster`¶

Properties of the `Taster`¶