Notebook

Data Access with the Gen-2 and Gen-3 Butlers¶

Owner: Daniel Perrefort (@djperrefort)
Updated for DC2 by: Douglas Tucker (@douglasleetucker) following in part work for DESC by Yao-Yuan Mao (@yymao) and Johann Cohen-Tanugi (@johannct)
Last Verified to Run: 2021-03-09
Verified Stack Release: w_2021_18

Core Concepts¶

This notebook provides a hands-on overview of how to interact with either the Gen-2 or the Gen-3 Butler. (A more detailed Gen-3-focused version of this tutorial can be found in DC2Gen3ButlerTutorial.ipynb.) The Butler provides a way to access information using a uniform interface without needing to keep track of how the information is internally stored or organized. Data access with Butler has three levels you need to be aware of:

Each instantiated Butler object provides access to a collection of datasets called a repository. Each repository is defined by Butler using the local file directory where the data is stored.
Each data set in a repository is assigned a unique name called a type. These types are strings that describe the data set and should not be confused with an "object type" as defined by Python.
Individual entries in a data set are identified using a unique data identifier, which is a dictionary who's allowed keys and values depend on the data set you are working with.

Learning Objectives:¶

This notebook demonstrates how to use the Gen-2 Butler object from the DM stack to access and manipulate data. After finishing this notebook, users will know how to:

Load and access a data repository using Butler
Select subsets of data and convert data into familiar data structures
Use Butler to access coordinate information and cutout postage stamps
Use Butler to access a skymap

In [ ]:

import matplotlib.pyplot as plt
import numpy as np
import matplotlib.pyplot as plt

import lsst.afw.display as afwDisplay
from lsst.daf.persistence import Butler # gen2 butler
import lsst.daf.butler as dafButler     # gen3 butler
import lsst.geom
from lsst.geom import SpherePoint, Angle

%matplotlib inline

Loading Data¶

To start we instantiate a Butler object by providing it with the directory of the repository we want to access. Next, we load a type of dataset and select data from a single data identifier. For this demonstration we consider the deepCoadd_ref dataset, which contains tables of information concerning coadded images used in the differencing image pipeline. The id values for this data set include two required values: tract and patch which denote sections of the sky.

We have a choice of two data sets: a data set from the HyperSuprimeCam (HSC), and the DESC DC2 Run2.2i WFD data set (DC2). In either case, we choose an 'arbitrary' tract and patch (Want to figure out how we found this tract and patch? Check out the notebooks Exploring_An_HSC_Data_Repo.ipynb and Exploring_A_DC2_Data_Repo.ipynb.) For later, we also choose a filter.

In [ ]:

#dataset='HSC'
dataset='DC2'
#genvers='gen2'
genvers='gen3'

# Temporary "fix" so one does not need to restart kernel 
#  when switching from DC2 to HSC...
# See also:  https://lsstc.slack.com/archives/C3UCAEW3D/p1584386779038000
#import lsst.afw.image as afwImage
#print(afwImage.Filter.getNames())
#afwImage.Filter.reset()
import lsst.obs.base as obsBase
obsBase.FilterDefinitionCollection.reset()
#print(afwImage.Filter.getNames())

if dataset == 'HSC':
    # Access HSC RC calexp gen2 repository
    repo = '/project/shared/data/DATA_ci_hsc/rerun/coaddForcedPhot'
    tract = 0
    patch = '1,1'
    filter = 'HSC-I'
    
    # Open the buter for this gen2 repo...
    butler = Butler(repo)

elif dataset == 'DC2':

    if genvers == 'gen2':

        # Access DC2 calexp gen2 repository
        repo = '/datasets/DC2/DR6/Run2.2i/patched/2021-02-10/rerun/run2.2i-coadd-wfd-dr6-v1'

        #tract = 4851
        #patch = '1,4'
        # This DC2 tract and patch should contain a cluster at z=0.66 M=1.5e15:
        tract = 4024
        patch = '3,4'
        filter = 'i'

        # Open the butler for this gen2 repo...
        butler = Butler(repo)

    elif genvers == 'gen3':

        # Access DC2 gen3 repository
        repo='/repo/dc2'
        collection='2.2i/runs/DP0.1'
        
        #tract = 4851
        #patch = 29   #gen3 analog to gen2 patch id '1,4'
        # This DC2 tract and patch should contain a cluster at z=0.66 M=1.5e15:
        tract = 4024
        patch = 31    #gen3 analog to gen2 patch id '3,4'
        filter = 'i'

        # Open the butler for this gen3 repo...
        butler = dafButler.Butler(repo,collections=collection)

    else:
        msg = "Unrecognized gen version: %s"%genvers
        raise Exception(msg)

        
else:
    msg = "Unrecognized dataset: %s"%dataset
    raise Exception(msg)    

In [ ]:

# Define the data id and data set type:
data_id = {'tract': tract, 'patch': patch}
dataset_type = 'deepCoadd_ref'

# We can check that the data exists before we try to read it.
# This works for both gen2 and gen3 bulters...
data_exists = butler.datasetExists(dataset_type, dataId=data_id)
print('Data exists for ID:', data_exists)

data_entry = butler.get(dataset_type, dataId=data_id)
data_entry

The data table returned above is formatted as a SourceCatalog object, which is essentially a collection of numpy arrays. We can see this when we index a particular column.

In [ ]:

print(type(data_entry['merge_measurement_i']))

SourceCatalog objects have their own set of methods for table manipulations (sorting, appending rows, etc.). However, we can also work with the data in a more familiar format, such as an astropy Table or a pandas DataFrame.

In [ ]:

data_frame = data_entry.asAstropy().to_pandas()
data_frame.head()

It is important to note that Butler objects do not always return tabular data. We will see an example of this later when we load and parse image data.

Selecting Subsets of Data¶

In practice, you may not know the format of the data identifier for a given data set. For the Gen-2 butler, the getKeys() method can be used to determine the key values expected in a data identifier:

In [ ]:

if genvers == 'gen2':
    data_id_format = butler.getKeys(dataset_type)
    print('Expected data id format:', data_id_format)

For the Gen-3 butler, however, you will want to use the following:

In [ ]:

if genvers == 'gen3':
    data_id_format = butler.registry.getDatasetType(dataset_type).dimensions.required.names
    # You can also use:
    # data_id_format = butler.registry.getDatasetType(dataset_type).dimensions.names
    print('Expected data id format:', data_id_format)

It is important to note that if you specify a key that is not part of the data type, the Butler will silently ignore it. This can be misleading. For example, in the previous example we read in a table that has a column of booleans named merge_footprint_i. If you specify merge_footprint_i: True in the dataID and rerun the example, Butler will ignore the extra key silently. As a result, you might expect the returned table to only include values where merge_footprint_i is True, but that isn't what happens.

Here is an example of the correct way to select data from the returned table:

In [ ]:

# An example of what not to do!!
#
# new_data_id = {'tract': tract, 'patch': patch, 'merge_footprint_i': True}
# merged_i_data = butler.get(dataset_type, dataId=new_data_id)
# assert merged_i_data['merge_measurement_i'].all()

# Do this instead...
new_data_id = {'tract': tract, 'patch': patch}
merged_i_data = butler.get(dataset_type, dataId=new_data_id)
merged_i_data = merged_i_data[merged_i_data['merge_measurement_i']].copy(True)

# Check that the returned table does in fact have only entries where
# merge_footprint_i is True
print(merged_i_data['merge_measurement_i'].all())

Important: Note the use of copy above which is required to create an array that is contiguous in memory (yay!)

You can also select all complete dataIds for a dataset type that match a partial (or empty) dataId. For example, for a Gen-2 butler, the below cell iterates over all possible ids and checks if the corresponding file exists:

In [ ]:

if genvers == 'gen2':
    subset = butler.subset(dataset_type, dataId=data_id)
    id_list = [dr.dataId for dr in subset if dr.datasetExists()]
    print(f'Available Ids:\n {id_list}')

Or, for a Gen3 butler, one can use the queryDatasets command on the butler's registry:

In [ ]:

if genvers == 'gen3':
    registry = butler.registry
    datasetRefs = registry.queryDatasets(datasetType=dataset_type,dataId=data_id, collections=collection)
    print(f'Available Ids:\n')
    for i,ref in enumerate(datasetRefs):
        print(ref.dataId.full)
        try: uri = butler.getURI(ref)
        except: print("File not found...")

Creating Postage Stamps¶

When dealing with image data, we can use Butler to generate postage stamps at a given set of coordinates. For this example, we consider the deepCoadd data set, which has one extra key value than the previous example.

In [ ]:

coadd_type = 'deepCoadd'

if genvers == 'gen2':
    butler.getKeys(coadd_type)
elif genvers == 'gen3':
    print(butler.registry.getDatasetType(coadd_type).dimensions.required.names)

In order to generate a postage stamp, we need to define the center and size of the cutout. First, we pick an RA and Dec from our previous example.

(We note that the DESC version of this notebook has a very nice query to pick some large galaxies -- see https://github.com/LSSTDESC/StackClub/blob/master/Basics/ButlerTutorial.ipynb -- but this capability does not currently exist for the NCSA RSP, as it requires installation of the easyquery package. -- Douglas Tucker, 5 March 2021)

In [ ]:

if dataset == 'HSC':

    nice_galaxy_indices = np.where((merged_i_data['base_ClassificationExtendedness_value'] == 1) & 
                                   (merged_i_data['modelfit_CModel_instFlux'] > 5000) &
                                   (merged_i_data['modelfit_CModel_instFlux'] / merged_i_data['modelfit_CModel_instFluxErr'] > 10) &
                                   (merged_i_data['base_PixelFlags_flag'] == 0) &
                                   (merged_i_data['merge_footprint_r']) & 
                                   (merged_i_data['detect_isPatchInner'])
                                  )[0]

    # Another possible option for HSC is to find indices of all targets with a flux between 100 and 500:
    # mask = np.where((merged_i_data['base_PsfFlux_flux'] > 100) & (merged_i_data['base_PsfFlux_flux'] < 500))

    
elif dataset == 'DC2':

    nice_galaxy_indices = np.where((merged_i_data['base_ClassificationExtendedness_value'] == 1) & 
                                   (merged_i_data['modelfit_CModel_instFlux'] > 5000) &
                                   (merged_i_data['modelfit_CModel_instFlux'] / merged_i_data['modelfit_CModel_instFluxErr'] > 10) &
                                   (merged_i_data['base_PixelFlags_flag'] == 0) &
                                   (merged_i_data['merge_footprint_g']) & 
                                   (merged_i_data['merge_footprint_r']) & 
                                   (merged_i_data['detect_isPatchInner'])
                                  )[0]

else:
    
    nice_galaxy_indices = np.full((2),1000)
    

print(nice_galaxy_indices)

In [ ]:

# Pick an RA and Dec
i = nice_galaxy_indices[1]
ra =  np.degrees(merged_i_data['coord_ra'][i])
dec = np.degrees(merged_i_data['coord_dec'][i])
print(ra, dec)

Next, we retrieve the image using the Butler.

In [ ]:

# Retrieve the image using butler
if genvers == 'gen2':
    coadd_id = {'tract': tract, 'patch': patch, 'filter': filter}
    image = butler.get(coadd_type, dataId=coadd_id)
elif genvers == 'gen3':
    coadd_id = {'tract': tract, 'patch': patch, 'band': filter}
    image = butler.get(coadd_type, dataId=coadd_id)

Since the postage stamp was generated using Butler, it is represented as an afwImage object. This is a data type from the DM stack that is used to represent images. Since it is a DM object, we choose to plot it using the DM afwDisplay module.

In [ ]:

# Let's take a look at the full image first
fig = plt.figure(figsize=(10,10))
display = afwDisplay.Display(frame=1, backend='matplotlib')
display.scale("linear", "zscale")
display.mtv(image.getMaskedImage().getImage())

Next we plot our cutout from the full image.

In [ ]:

# Define the center and size of our cutout
radec = SpherePoint(ra, dec, lsst.geom.degrees)
cutout_size = 300
cutout_extent = lsst.geom.ExtentI(cutout_size, cutout_size)

# Cutout and optionally save the postage stamp to file
postage_stamp = image.getCutout(radec, cutout_extent)
# postage_stamp.writeFits(<output_filename>)

# Convert RA,DEC on the sky to X,Y in the image
xy = postage_stamp.getWcs().skyToPixel(radec)

# Display image
display = afwDisplay.Display(frame=1, backend='matplotlib')
display.mtv(postage_stamp)
display.scale("linear", "zscale")
display.dot('o', xy.getX(), xy.getY(), ctype='red')
display.show_colorbar()

plt.xlabel('x')
plt.ylabel('y')
plt.show()

Note that the cutout image is aware of the masks and pixel values of the original image. This is why the axis labels in the above cutout are so large. We also note that the orientation of the postage stamp is in the x, y orientation of the original coadded image.

Creating an RGB picture of a coadd patch¶¶

NOTE: This section will not work for the HSC data set, since the HSC data set only includes 2 filter bands ('HSC-I' and 'HSC-R'). This section, however, does work for the DC2 data set. If you are not working with the DC data set, you may wish to skip down to the "Selecting an Area on the Sky with a Sky Map" section below.

A nice and simple interface is also available to create pretty pictures of patch areas (stolen from D. Boutigny). We are using the same patch as above, and define our three colors as bands r,i and g. Then we ask the deepCoadd type from the butler, which corresponds to coadded images. We finally make use of the afw.display interface to build the RGB image.

In [ ]:

if dataset == 'DC2':

    import lsst.afw.display.rgb as rgb

    dataId = {'tract':tract, 'patch':patch}

    bandpass_color_map = {'green':'r', 'red':'i', 'blue':'g'}

    exposures = {}
    for bandpass in bandpass_color_map.values():
        if genvers == 'gen2':
            dataId['filter'] = bandpass
        elif genvers == 'gen3':
            dataId['band'] = bandpass
        exposures[bandpass] = butler.get(coadd_type, dataId=dataId)

    fig = plt.figure(figsize=(10,10))
    rgb_im = rgb.makeRGB(*(exposures[bandpass_color_map[color]].getMaskedImage().getImage()
                           for color in ('red', 'green', 'blue')), Q=8, dataRange=1.0, 
                           xSize=None, ySize=None)

    rgb.displayRGB(rgb_im)
    
else:
    
    print('Sorry, it looks as though this section only works for the DC2 data set.')

In the RGB map the cluster appears very red!

Now we can also create RGB cutout images!

In [ ]:

if dataset == 'DC2':

    rgb_im = rgb.makeRGB(*(exposures[bandpass_color_map[color]].getCutout(radec, cutout_extent).getMaskedImage().getImage()
                           for color in ('red', 'green', 'blue')), Q=8, dataRange=1.0, 
                           xSize=None, ySize=None)

    rgb.displayRGB(rgb_im)

else:
    
    print('Sorry, it looks as though this section only works for the DC2 data set.')

Selecting an Area on the Sky with a Sky Map¶

As a final example, we consider a third type of data that can be accessed via Butler called a skyMap. Sky maps allow you to look up information for a given tract and patch. You may notice from the below example that data set types tend to follow the naming convention of having a base name (e.g. 'deepCoadd') followed by a descriptor (e.g. '_skyMap').

In [ ]:

if genvers == 'gen2':
    skymap = butler.get('deepCoadd_skyMap')
elif genvers == 'gen3':
    skymap = butler.get("skyMap", skymap='DC2')
tract_info = skymap[tract]
tract_info

In [ ]:

patch_info = tract_info.getPatchInfo((1,1))
patch_info

In [ ]:

tract_bbox = tract_info.getBBox()
tract_pix_corners = lsst.geom.Box2D(tract_bbox).getCorners()
print('Tract corners in pixels:\n', tract_pix_corners)

wcs = tract_info.getWcs()
tract_deg_corners = wcs.pixelToSky(tract_pix_corners)
tract_deg_corners = [[c.getRa().asDegrees(), c.getDec().asDegrees()] for c in tract_deg_corners]
print('\nTract corners in degrees:\n', tract_deg_corners)

In [ ]:

#You can also go in reverse to find the tract, patch that contains a coordinate (320.8,-0.4)
coordList = [SpherePoint(Angle(np.radians(320.8)),Angle(np.radians(-0.4)))]
tractInfo = skymap.findClosestTractPatchList(coordList)
print(tractInfo[0][0])
print(tractInfo[0][1])

In [ ]: