Owner: Daniel Perrefort (@djperrefort)
Updated for DC2 by: Douglas Tucker (@douglasleetucker) following in part work for DESC by Yao-Yuan Mao (@yymao) and Johann Cohen-Tanugi (@johannct)
Last Verified to Run: 2021-03-09
Verified Stack Release: w_2021_18
This notebook provides a hands-on overview of how to interact with either the Gen-2 or the Gen-3 Butler
. (A more detailed Gen-3-focused version of this tutorial can be found in DC2Gen3ButlerTutorial.ipynb
.) The Butler
provides a way to access information using a uniform interface without needing to keep track of how the information is internally stored or organized. Data access with Butler
has three levels you need to be aware of:
Butler
object provides access to a collection of datasets called a repository. Each repository is defined by Butler using the local file directory where the data is stored.This notebook demonstrates how to use the Gen-2 Butler
object from the DM stack to access and manipulate data. After finishing this notebook, users will know how to:
Butler
Butler
to access coordinate information and cutout postage stampsButler
to access a skymapimport matplotlib.pyplot as plt
import numpy as np
import matplotlib.pyplot as plt
import lsst.afw.display as afwDisplay
from lsst.daf.persistence import Butler # gen2 butler
import lsst.daf.butler as dafButler # gen3 butler
import lsst.geom
from lsst.geom import SpherePoint, Angle
%matplotlib inline
To start we instantiate a Butler
object by providing it with the directory of the repository we want to access. Next, we load a type of dataset and select data from a single data identifier. For this demonstration we consider the deepCoadd_ref
dataset, which contains tables of information concerning coadded images used in the differencing image pipeline. The id values for this data set include two required values: tract
and patch
which denote sections of the sky.
We have a choice of two data sets: a data set from the HyperSuprimeCam (HSC), and the DESC DC2 Run2.2i WFD data set (DC2). In either case, we choose an 'arbitrary' tract
and patch
(Want to figure out how we found this tract and patch? Check out the notebooks Exploring_An_HSC_Data_Repo.ipynb
and Exploring_A_DC2_Data_Repo.ipynb
.) For later, we also choose a filter.
#dataset='HSC'
dataset='DC2'
#genvers='gen2'
genvers='gen3'
# Temporary "fix" so one does not need to restart kernel
# when switching from DC2 to HSC...
# See also: https://lsstc.slack.com/archives/C3UCAEW3D/p1584386779038000
#import lsst.afw.image as afwImage
#print(afwImage.Filter.getNames())
#afwImage.Filter.reset()
import lsst.obs.base as obsBase
obsBase.FilterDefinitionCollection.reset()
#print(afwImage.Filter.getNames())
if dataset == 'HSC':
# Access HSC RC calexp gen2 repository
repo = '/project/shared/data/DATA_ci_hsc/rerun/coaddForcedPhot'
tract = 0
patch = '1,1'
filter = 'HSC-I'
# Open the buter for this gen2 repo...
butler = Butler(repo)
elif dataset == 'DC2':
if genvers == 'gen2':
# Access DC2 calexp gen2 repository
repo = '/datasets/DC2/DR6/Run2.2i/patched/2021-02-10/rerun/run2.2i-coadd-wfd-dr6-v1'
#tract = 4851
#patch = '1,4'
# This DC2 tract and patch should contain a cluster at z=0.66 M=1.5e15:
tract = 4024
patch = '3,4'
filter = 'i'
# Open the butler for this gen2 repo...
butler = Butler(repo)
elif genvers == 'gen3':
# Access DC2 gen3 repository
repo='/repo/dc2'
collection='2.2i/runs/DP0.1'
#tract = 4851
#patch = 29 #gen3 analog to gen2 patch id '1,4'
# This DC2 tract and patch should contain a cluster at z=0.66 M=1.5e15:
tract = 4024
patch = 31 #gen3 analog to gen2 patch id '3,4'
filter = 'i'
# Open the butler for this gen3 repo...
butler = dafButler.Butler(repo,collections=collection)
else:
msg = "Unrecognized gen version: %s"%genvers
raise Exception(msg)
else:
msg = "Unrecognized dataset: %s"%dataset
raise Exception(msg)
# Define the data id and data set type:
data_id = {'tract': tract, 'patch': patch}
dataset_type = 'deepCoadd_ref'
# We can check that the data exists before we try to read it.
# This works for both gen2 and gen3 bulters...
data_exists = butler.datasetExists(dataset_type, dataId=data_id)
print('Data exists for ID:', data_exists)
data_entry = butler.get(dataset_type, dataId=data_id)
data_entry
The data table returned above is formatted as a SourceCatalog
object, which is essentially a collection of numpy
arrays. We can see this when we index a particular column.
print(type(data_entry['merge_measurement_i']))
SourceCatalog
objects have their own set of methods for table manipulations (sorting, appending rows, etc.). However, we can also work with the data in a more familiar format, such as an astropy Table
or a pandas DataFrame
.
data_frame = data_entry.asAstropy().to_pandas()
data_frame.head()
It is important to note that Butler
objects do not always return tabular data. We will see an example of this later when we load and parse image data.
In practice, you may not know the format of the data identifier for a given data set. For the Gen-2
butler, the getKeys()
method can be used to determine the key values expected in a data identifier:
if genvers == 'gen2':
data_id_format = butler.getKeys(dataset_type)
print('Expected data id format:', data_id_format)
For the Gen-3
butler, however, you will want to use the following:
if genvers == 'gen3':
data_id_format = butler.registry.getDatasetType(dataset_type).dimensions.required.names
# You can also use:
# data_id_format = butler.registry.getDatasetType(dataset_type).dimensions.names
print('Expected data id format:', data_id_format)
It is important to note that if you specify a key that is not part of the data type, the Butler
will silently ignore it. This can be misleading. For example, in the previous example we read in a table that has a column of booleans named merge_footprint_i
. If you specify merge_footprint_i: True
in the dataID and rerun the example, Butler
will ignore the extra key silently. As a result, you might expect the returned table to only include values where merge_footprint_i
is True
, but that isn't what happens.
Here is an example of the correct way to select data from the returned table:
# An example of what not to do!!
#
# new_data_id = {'tract': tract, 'patch': patch, 'merge_footprint_i': True}
# merged_i_data = butler.get(dataset_type, dataId=new_data_id)
# assert merged_i_data['merge_measurement_i'].all()
# Do this instead...
new_data_id = {'tract': tract, 'patch': patch}
merged_i_data = butler.get(dataset_type, dataId=new_data_id)
merged_i_data = merged_i_data[merged_i_data['merge_measurement_i']].copy(True)
# Check that the returned table does in fact have only entries where
# merge_footprint_i is True
print(merged_i_data['merge_measurement_i'].all())
Important: Note the use of copy
above which is required to create an array that is contiguous in memory (yay!)
You can also select all complete dataIds for a dataset type that match a partial (or empty) dataId. For example, for a Gen-2
butler, the below cell iterates over all possible ids and checks if the corresponding file exists:
if genvers == 'gen2':
subset = butler.subset(dataset_type, dataId=data_id)
id_list = [dr.dataId for dr in subset if dr.datasetExists()]
print(f'Available Ids:\n {id_list}')
Or, for a Gen3
butler, one can use the queryDatasets
command on the butler's registry
:
if genvers == 'gen3':
registry = butler.registry
datasetRefs = registry.queryDatasets(datasetType=dataset_type,dataId=data_id, collections=collection)
print(f'Available Ids:\n')
for i,ref in enumerate(datasetRefs):
print(ref.dataId.full)
try: uri = butler.getURI(ref)
except: print("File not found...")
When dealing with image data, we can use Butler
to generate postage stamps at a given set of coordinates. For this example, we consider the deepCoadd
data set, which has one extra key value than the previous example.
coadd_type = 'deepCoadd'
if genvers == 'gen2':
butler.getKeys(coadd_type)
elif genvers == 'gen3':
print(butler.registry.getDatasetType(coadd_type).dimensions.required.names)
In order to generate a postage stamp, we need to define the center and size of the cutout. First, we pick an RA and Dec from our previous example.
(We note that the DESC version of this notebook has a very nice query to pick some large galaxies -- see https://github.com/LSSTDESC/StackClub/blob/master/Basics/ButlerTutorial.ipynb -- but this capability does not currently exist for the NCSA RSP, as it requires installation of the easyquery
package. -- Douglas Tucker, 5 March 2021)
if dataset == 'HSC':
nice_galaxy_indices = np.where((merged_i_data['base_ClassificationExtendedness_value'] == 1) &
(merged_i_data['modelfit_CModel_instFlux'] > 5000) &
(merged_i_data['modelfit_CModel_instFlux'] / merged_i_data['modelfit_CModel_instFluxErr'] > 10) &
(merged_i_data['base_PixelFlags_flag'] == 0) &
(merged_i_data['merge_footprint_r']) &
(merged_i_data['detect_isPatchInner'])
)[0]
# Another possible option for HSC is to find indices of all targets with a flux between 100 and 500:
# mask = np.where((merged_i_data['base_PsfFlux_flux'] > 100) & (merged_i_data['base_PsfFlux_flux'] < 500))
elif dataset == 'DC2':
nice_galaxy_indices = np.where((merged_i_data['base_ClassificationExtendedness_value'] == 1) &
(merged_i_data['modelfit_CModel_instFlux'] > 5000) &
(merged_i_data['modelfit_CModel_instFlux'] / merged_i_data['modelfit_CModel_instFluxErr'] > 10) &
(merged_i_data['base_PixelFlags_flag'] == 0) &
(merged_i_data['merge_footprint_g']) &
(merged_i_data['merge_footprint_r']) &
(merged_i_data['detect_isPatchInner'])
)[0]
else:
nice_galaxy_indices = np.full((2),1000)
print(nice_galaxy_indices)
# Pick an RA and Dec
i = nice_galaxy_indices[1]
ra = np.degrees(merged_i_data['coord_ra'][i])
dec = np.degrees(merged_i_data['coord_dec'][i])
print(ra, dec)
Next, we retrieve the image using the Butler
.
# Retrieve the image using butler
if genvers == 'gen2':
coadd_id = {'tract': tract, 'patch': patch, 'filter': filter}
image = butler.get(coadd_type, dataId=coadd_id)
elif genvers == 'gen3':
coadd_id = {'tract': tract, 'patch': patch, 'band': filter}
image = butler.get(coadd_type, dataId=coadd_id)
Since the postage stamp was generated using Butler
, it is represented as an afwImage
object. This is a data type from the DM stack that is used to represent images. Since it is a DM object, we choose to plot it using the DM afwDisplay
module.
# Let's take a look at the full image first
fig = plt.figure(figsize=(10,10))
display = afwDisplay.Display(frame=1, backend='matplotlib')
display.scale("linear", "zscale")
display.mtv(image.getMaskedImage().getImage())
Next we plot our cutout from the full image.
# Define the center and size of our cutout
radec = SpherePoint(ra, dec, lsst.geom.degrees)
cutout_size = 300
cutout_extent = lsst.geom.ExtentI(cutout_size, cutout_size)
# Cutout and optionally save the postage stamp to file
postage_stamp = image.getCutout(radec, cutout_extent)
# postage_stamp.writeFits(<output_filename>)
# Convert RA,DEC on the sky to X,Y in the image
xy = postage_stamp.getWcs().skyToPixel(radec)
# Display image
display = afwDisplay.Display(frame=1, backend='matplotlib')
display.mtv(postage_stamp)
display.scale("linear", "zscale")
display.dot('o', xy.getX(), xy.getY(), ctype='red')
display.show_colorbar()
plt.xlabel('x')
plt.ylabel('y')
plt.show()
Note that the cutout image is aware of the masks and pixel values of the original image. This is why the axis labels in the above cutout are so large. We also note that the orientation of the postage stamp is in the x, y orientation of the original coadded image.
NOTE: This section will not work for the HSC data set, since the HSC data set only includes 2 filter bands ('HSC-I' and 'HSC-R'). This section, however, does work for the DC2 data set. If you are not working with the DC data set, you may wish to skip down to the "Selecting an Area on the Sky with a Sky Map" section below.
A nice and simple interface is also available to create pretty pictures of patch areas (stolen from D. Boutigny). We are using the same patch as above, and define our three colors as bands r,i and g. Then we ask the deepCoadd type from the butler, which corresponds to coadded images. We finally make use of the afw.display
interface to build the RGB image.
if dataset == 'DC2':
import lsst.afw.display.rgb as rgb
dataId = {'tract':tract, 'patch':patch}
bandpass_color_map = {'green':'r', 'red':'i', 'blue':'g'}
exposures = {}
for bandpass in bandpass_color_map.values():
if genvers == 'gen2':
dataId['filter'] = bandpass
elif genvers == 'gen3':
dataId['band'] = bandpass
exposures[bandpass] = butler.get(coadd_type, dataId=dataId)
fig = plt.figure(figsize=(10,10))
rgb_im = rgb.makeRGB(*(exposures[bandpass_color_map[color]].getMaskedImage().getImage()
for color in ('red', 'green', 'blue')), Q=8, dataRange=1.0,
xSize=None, ySize=None)
rgb.displayRGB(rgb_im)
else:
print('Sorry, it looks as though this section only works for the DC2 data set.')
In the RGB map the cluster appears very red!
Now we can also create RGB cutout images!
if dataset == 'DC2':
rgb_im = rgb.makeRGB(*(exposures[bandpass_color_map[color]].getCutout(radec, cutout_extent).getMaskedImage().getImage()
for color in ('red', 'green', 'blue')), Q=8, dataRange=1.0,
xSize=None, ySize=None)
rgb.displayRGB(rgb_im)
else:
print('Sorry, it looks as though this section only works for the DC2 data set.')
As a final example, we consider a third type of data that can be accessed via Butler
called a skyMap
. Sky maps allow you to look up information for a given tract
and patch
. You may notice from the below example that data set types tend to follow the naming convention of having a base name (e.g. 'deepCoadd'
) followed by a descriptor (e.g. '_skyMap'
).
if genvers == 'gen2':
skymap = butler.get('deepCoadd_skyMap')
elif genvers == 'gen3':
skymap = butler.get("skyMap", skymap='DC2')
tract_info = skymap[tract]
tract_info
patch_info = tract_info.getPatchInfo((1,1))
patch_info
tract_bbox = tract_info.getBBox()
tract_pix_corners = lsst.geom.Box2D(tract_bbox).getCorners()
print('Tract corners in pixels:\n', tract_pix_corners)
wcs = tract_info.getWcs()
tract_deg_corners = wcs.pixelToSky(tract_pix_corners)
tract_deg_corners = [[c.getRa().asDegrees(), c.getDec().asDegrees()] for c in tract_deg_corners]
print('\nTract corners in degrees:\n', tract_deg_corners)
#You can also go in reverse to find the tract, patch that contains a coordinate (320.8,-0.4)
coordList = [SpherePoint(Angle(np.radians(320.8)),Angle(np.radians(-0.4)))]
tractInfo = skymap.findClosestTractPatchList(coordList)
print(tractInfo[0][0])
print(tractInfo[0][1])