Owner(s): Phil Marshall (@drphilmarshall), Rob Morgan (@rmorgan10)
Last Verified to Run: 2019-08-13
Verified Stack Release: 18.1
In this notebook we'll take a look at some of the datasets available on the LSST Science Platform.
After working through this tutorial you should be able to:
When it is finished, you should be able to use the stackclub.Taster
to:
2. Report on the available data in a given dataset;
3. Plot the patches and tracts in a given dataset on the sky.
Outstanding Issue: The Taster
augments the functionality of the Gen-2 butler, which provides limited capabilities for discovery what data actually exist. Specifically, the Taster
is relying heavily on the queryMetadata
functionality of the Gen-2 butler, which is limited to a small number of datasets and does not actually guarentee that those datasets exist. The user should beware of over-interpreting the true existence of datasets queried by the Taster
. This should be improved greatly with the Gen-3 butler.
This notebook is intended to be runnable on lsst-lsp-stable.ncsa.illinois.edu
from a local git clone of https://github.com/LSSTScienceCollaborations/StackClub.
We'll need the stackclub
package to be installed. If you are not developing this package, you can install it using pip
, like this:
pip install git+git://github.com/LSSTScienceCollaborations/StackClub.git#egg=stackclub
If you are developing the stackclub
package (eg by adding modules to it to support the Stack Club tutorial that you are writing, you'll need to make a local, editable installation. In the top level folder of the StackClub
repo, do:
! cd .. && python setup.py -q develop --user && cd -
You may need to restart the kernel after doing this. When editing the stackclub
package files, we want the latest version to be imported when we re-run the import command. To enable this, we need the %autoreload
magic command.
%load_ext autoreload
%autoreload 2
To just get a taste of the data that the Butler will deliver for a chosen dataset, we have added a taster
class to the stackclub
library. All needed imports are contained in that file, so we only need to import the stackclub
library to work through this notebook.
import numpy as np
%matplotlib inline
import stackclub
You can find the Stack version that this notebook is running by using eups list -s on the terminal command line:
# What version of the Stack am I using?
! echo $HOSTNAME
! eups list -s lsst_distrib
First, let's look at what is currently available. There are several shared data folders in the LSP, the read-only /datasets
folder, the project-group-writeable folder /project/shared/data
, and the Stack Club shared directory /project/stack-club
. Let's take a look at what's in /project/shared/data
. Specifically, we want to see butler-friendly data repositories, distinguished by their containing a file called _mapper
, or repositoryCfg.yaml
in their top level.
/project/shared/data
: These datasets are designed to be small test sets, ideal for tutorials.
shared_repos_with_mappers = ! ls -d /project/shared/data/*/_mapper | grep -v README | cut -d'/' -f1-5 | sort | uniq
shared_repos_with_yaml_files = ! ls -d /project/shared/data/*/repositoryCfg.yaml | grep -v README | cut -d'/' -f1-5 | sort | uniq
shared_repos = np.unique(shared_repos_with_mappers + shared_repos_with_yaml_files)
shared_repos
for repo in shared_repos:
! du -sh $repo
/datasets
:
These are typically much bigger: to measure the size, uncomment the second cell below and edit it to target the dataset you are interested in. Running du
on all folders takes several minutes.
repos_with_mappers = ! ls -d /datasets/*/repo/_mapper |& grep -v "No such" | cut -d'/' -f1-4 | sort | uniq
repos_with_yaml_files = ! ls -d /datasets/*/repo/repositoryCfg.yaml |& grep -v "No such" | cut -d'/' -f1-4 | sort | uniq
repos = np.unique(repos_with_mappers + repos_with_yaml_files)
repos
"""
for repo in repos:
! du -sh $repo
""";
Taster
¶The stackclub
library provides a Taster
class, to explore the datasets in a given repo. As an example, let's take a look at some HSC data using the Taster
. When instantiating the Taster
, if you plan to use it for visualizing sky coverage, you can provide it with a path to the tracts from the main repo.
Taster
¶# Parent repo
repo = '/datasets/hsc/repo/'
#Location of tracts for a particular rerun and depth relative to main repo
rerun = 'DM-13666' # DM-13666, DM-10404
depth = 'WIDE' # WIDE, DEEP, UDEEP
tract_location = 'rerun/' + rerun + '/' + depth
Execute one of the following two cells. The latter will make tarquin
aware of the tracts for the dataset while the former will just look at the repo as a whole and not visualize any sky area.
tarquin = stackclub.Taster(repo, vb=True)
tarquin = stackclub.Taster(repo, vb=True, path_to_tracts=tract_location)
Taster
¶The taster, tarquin
, carries a butler around with it:
type(tarquin.butler)
If we ask the taster to investigate a folder that is not a repo, its butler will be None
failed = stackclub.Taster('not-a-repo', vb=True)
print(failed.butler)
The taster uses its butler to query the metadata of the repo for datasets, skymaps etc.
tarquin.look_for_datasets_of_type(['raw', 'calexp', 'deepCoadd_calexp', 'deepCoadd_mergeDet'])
PROBLEM: these last two datatypes are not listed in the repo metadata. This is one of the issues with the Gen-2 butler and the
Taster
is not smart enough to search the tract folders for catalog files. This should be updated with Gen-3.
tarquin.look_for_skymap()
The what_exists
method searches for everything "interesting". In the taster.py
class, interesting currently consists of
'raw'
'calexp'
'src'
'deepCoadd_calexp'
'deepCoadd_meas'
but this method can easily be updated to include more dataset types.
tarquin.what_exists()
If one wishes to check the existance of all dataset types, you can use the all
parameter of the what_exists()
method to do exactly that. Checking all dataset types may take a minute or so (while the Taster
does a lot of database queries).
tarquin.what_exists(all=True)
A dictionary with existence information is stored in the exists
attribute:
tarquin.exists
The Taster
can report on the data available, counting the number of visits, sources, etc, according to what's in the repo. It uses methods like this one:
tarquin.estimate_sky_area()
and this one:
tarquin.count_things()
print(tarquin.counts)
When the estimate_sky_area
method runs, tarquin
collects all the tracts associated with the repo. A list of the tracts is stored in the attribute tarquin.tracts
.
tarquin.tracts
Using the tracts, we can get a rough estimate for what parts of the sky have been targeted in the dataset. The method for doing this is tarquin.plot_sky_coverage
, and follows the example code given in Exploring_A_Data_Repo.ipynb.
tarquin.plot_sky_coverage()
To have your Taster
do all the above, and just report on what it finds, do:
tarquin.report()
If you are interested in learning which fields, filters, visits, etc. have been counted by tarquin
, remember that tarquin
carries an instance of the Butler
with it, so you can run typical Butler
methods. For example, if you found the number of filters being 13 odd, you can look at the filters like this:
tarquin.butler.queryMetadata('calexp', ['filter'])
For more on the Taster
's methods, do, for example:
# help(tarquin)
Let's compare the WIDE, DEEP and UDEEP parts of the HSC dataset.
repo = '/datasets/hsc/repo/'
rerun = 'DM-13666'
for depth in ['WIDE', 'DEEP', 'UDEEP']:
tract_location = 'rerun/' + rerun + '/' + depth
taster = stackclub.Taster(repo, path_to_tracts=tract_location)
taster.report()
You may notice that all Metadata Characteristics beginning with "Number of" are the same for the three depths. This is a result of tarquin
's Butler
getting this information from the repo as a whole, rather than the specific depth we specified for the tracts. There is more information on why the Butler
works in this way in the Exploring_A_Data_Repo.ipynb notebook.
In this notebook we took a first look at the datasets available to us in two shared directories in the LSST science platform filesystem, and used the stackclub.Taster
class to report on their basic properties, and their sky coverage. Details on the methods used by the Taster
can be found in the Exploring_A_Data_Repo.ipynb notebook, or by executing the follwoing cell:
help(tarquin)
Taster
so that it can handle a wider variety of datasets.Taster
to use Gen-3 butlerThe following loops over all shared datasets fails in interesting ways: some folders don't seem to be Butler
-friendly. We need to do a bit more work to identify the actual repos available to us, and then use the Taster
to provide a guide to all of them.
for repo in shared_repos:
try:
taster = stackclub.Taster(repo)
taster.report()
except:
print("Taster failed to explore repo ",repo)
for repo in repos:
try:
taster = stackclub.Taster(repo)
taster.report()
except:
print("Taster failed to explore repo ",repo)