Owners: Francois Lanusse @EiffL, Javier Sanchez @fjaviersanchez
Last Verifed to Run: 2018-11-17 (by @yymao)
This notebook will illustrate the basics of accessing the DPDD-like object catalog, which contains the detected objects at the coadd level, through the Generic Catalog Reader (GCR, https://github.com/yymao/generic-catalog-reader) as well as how to select useful samples of stars/galaxies from the DM outputs.
Learning objectives:
After going through this notebook, you should be able to:
Logistics: This notebook is intended to be run through the JupyterHub NERSC interface available here: https://jupyter-dev.nersc.gov. To setup your NERSC environment, please follow the instructions available here: https://confluence.slac.stanford.edu/display/LSSTDESC/Using+Jupyter-dev+at+NERSC
Other notes:
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
The GCRCatalogs package is a DESC project which aims at gathering in one convenient location various simulation/data catalogs made available to the collaboration.
In this section, we illustrate how to use this tool to access the object catalogs from DC2 Run1.2i.
import GCRCatalogs
# Load the object catalog
catalog = GCRCatalogs.load_catalog('dc2_object_run1.2i')
A significant numbers of catalogs besides the DC2 object catalogs are already available, use sorted(GCRCatalogs.get_available_catalogs(False))
to see a full list and visit the DC2 Data Product page to see all the DC2 related catalogs.
To see the quantities available in the catalog, you can use the following:
sorted(catalog.list_all_quantities())
The meaning of these fields is documented in the SCHEMA.md file of the LSSTDESC/gcr-catalogs
repository.
As explained in that link, the values exposed here are not the native quantities produced by the Data Management stack, but instead this schema strives to follow the standard nomenclature of the LSST Data Products Definition Document DPDD.
The DPDD is an effort made by the LSST project to standardize the format of the official Data Release Products (DRP). While the native outputs of the DM stack are succeptible to change, the DPDD will be more stable. An early adoption of these conventions by the DESC will save time and energy down the road.
This being said, not all use-cases and relevant quantities are covered by these conventions yet, so the GCR preserves access to the underlying native DM stack fieds, all 2046 of which can be listed using:
sorted(catalog.list_all_native_quantities())
We can see that the catalog includes:
blendedness
, measure of how flux is affected by neighbors: (1 - flux.child/flux.parent) (see 4.9.11 of Bosch et al. 2018); extendedness
, classifies sources in extended and psf-like.While Run1.2i is still of manageable size, full DC2 will be much larger, accessing the whole data can be challenging. In order to access the data efficiently, it is important to understand how it is physically stored and how to access it, one piece at the time.
The coadds produced by the DM stack are structured in terms of large tracts
and smaller patches
, illustrated here for DC2:
You can learn more about how to make such a plot of the tract and patches here
The GCR object catalog preserves this structure of the data so that any particular quantity can be accessed on a tract/patch bases. The tracts available in the catalog can be listed using the following command:
# Query all available tracts and patches, only displays the first 5
catalog.available_tracts_and_patches[:5]
The DM stack includes functionality to get the tract and patch number corresponding to a certain position (RA,DEC)
. However, it is out of the scope of this tutorial.
GCR provides the following native_filters
mechanism, which you can use to speed up data access if you only need a certain chunks of the dataset.
For the object catalog, the chunks are broken into tract
and patch
, and hence those are the native_filters
you can use:
# Retrieve the ra,dec coordinates of all sources within tract number 4430
data = catalog.get_quantities(['ra', 'dec'], native_filters=['tract == 4430'])
# Plot a 2d histogram of sources
plt.figure(figsize=(10,7))
plt.hist2d(data['ra'], data['dec'],100); plt.gca().set_aspect('equal'); plt.colorbar(label='Number of objects')
plt.xlabel('RA [deg]');
plt.ylabel('dec [deg]');
native_filters
can be used to filter a catalog by some specific quantities that are related to its underlying data format
(use catalog._native_filter_quantities
to see them). You have more information here
The data returned by the GCR is structured as a native Python dictionary:
data
But it can also easily be converted into a Pandas DataFrame, if you are so inclined ;-)
import pandas
pdata = pandas.DataFrame(data)
pdata
As a simple test, you can show the advantage of loading one tract at a time compared to the entire catalog:
%time data = catalog.get_quantities(['ra', 'dec'], native_filters=['tract == 4431'])
%time data = catalog.get_quantities(['ra', 'dec']) #This cell takes a bit to execute so, if you are in a hurry you can skip this
In order to make accessing chunks of data convenient to the user, the catalog.get_quantities
also provides the option to return an iterator. This allows you to only read and work on one piece of data at a time while looping through the entire catalog. Remember that catalogs can be very large and might not fit in memory if you try to load the entire catalog at once. In the DC2 object catalog, as we follow the DM structure of data, the catalog will iterate over tract
and patch
when using return_iterator
.
You can find more general information about iterators here.
# Loop through all the patches of a given tract using an iterator
for d in catalog.get_quantities(['ra', 'dec'],
native_filters=['tract == 4850'],
return_iterator=True):
# Here we only handle a small amount of data at a time
plt.scatter(d['ra'], d['dec'], s=2);
plt.xlabel('RA');
plt.ylabel('Dec');
plt.title('Tract 4850');
In order to avoid returning unecessary data, the GCR has a functionality to filter out entries as it reads the files. Note that this is different from the native_filters
discussed above, which avoids reading part of the data altogether.
Defining these filters requires the GCRQuery
module of the GCR package and can then be applied during the call to get_quantities
:
from GCR import GCRQuery
# Simple cut to remove unreliable detections
# More cuts can be added, as a logical AND, by appending GCRQuery objects to this list
simple_cuts = [
GCRQuery('clean'), # The source has no flagged pixels (interpolated, saturated, edge, clipped...)
# and was not skipped by the deblender
]
# Loads the data after cut
data_cut = catalog.get_quantities(['ra', 'dec'],
filters = simple_cuts,
native_filters=['tract == 4849'])
# Loads data without cuts
data_full = catalog.get_quantities(['ra', 'dec'],
native_filters=['tract == 4849'])
print(len(data_cut['ra']),len(data_full['ra']))
# Plot a 2d histogram of sources
plt.figure(figsize=(15,7))
plt.subplot(121)
plt.hist2d(data_full['ra'], data_full['dec'],256); plt.gca().set_aspect('equal');
plt.xlabel('RA [deg]');
plt.ylabel('dec [deg]');
plt.title('Full sample')
plt.colorbar(label='Number of objects')
plt.subplot(122)
plt.hist2d(data_cut['ra'], data_cut['dec'],256); plt.gca().set_aspect('equal');
plt.xlabel('RA [deg]');
plt.ylabel('dec [deg]');
plt.title('Clean objects');
plt.colorbar(label='Number of objects');
Admittedly, this plot is a little underwhelming, these quality cuts only remove a very small number of objects. This is due in part to the fact that Run 1.2i ran from imSim outputs with essentially perfect Instrument Signature Removal (ISR). Also, Run 1.2i coadds only have a limited number of exposures, but defects will add up as the coadds get deeper.
To get a sense of the impact of these quality flags on real data, we can load with the GCR a tract of the HSC PDR1 data (Aihara et al (2018)) which is made available on cori. Note that this HSC catalog follows the same schema as Run 1.2i. This tract is part of the XMM subfield of HSC (find out more about the HSC data release here and here).
Let's load this catalog, and apply the same cuts:
cat_hsc = GCRCatalogs.load_catalog('hsc-pdr1-xmm')
# Loads the data after cut
data_cut = cat_hsc.get_quantities(['ra', 'dec'],
filters = simple_cuts)
# Loads data without cuts
data_full = cat_hsc.get_quantities(['ra', 'dec'])
plt.figure(figsize=(15,7))
plt.subplot(121)
plt.hist2d(data_full['ra'], data_full['dec'],256); plt.gca().set_aspect('equal');
plt.xlabel('RA [deg]');
plt.ylabel('dec [deg]');
plt.colorbar(label='Number of objects')
plt.title('Full sample')
plt.subplot(122)
plt.hist2d(data_cut['ra'], data_cut['dec'],256); plt.gca().set_aspect('equal');
plt.xlabel('RA [deg]');
plt.ylabel('dec [deg]');
plt.colorbar(label='Number of objects')
plt.title('Clean objects');
This is a more dramatic plot, and illustrates the importance of selecting clean samples of objets.
For now, we have extendedness == base_ClassificationExtendedness_value
as a tool for star/galaxy classification. An object is considered extended if the the difference between the PSF
magnitude and the CModel
magnitude is beyond certain threshold (0.0164). To know more about this see Bosch et al. 2018 section 4.9.10
star_cuts = [
GCRQuery('clean'), # The source has no flagged pixels (interpolated, saturated, edge, clipped...)
# and was not skipped by the deblender
GCRQuery('extendedness==0'),
GCRQuery((np.isfinite, 'mag_g_cModel')),
GCRQuery((np.isfinite, 'mag_r_cModel')),
GCRQuery((np.isfinite, 'mag_i_cModel')),
]
quantities = ['mag_g_cModel', 'mag_r_cModel', 'mag_i_cModel']
d = catalog.get_quantities(quantities,
filters=star_cuts,
native_filters=['tract == 4849'])
Note: The cell above will output some runtime warnings related to objects that have negative or zero measured fluxes but we can safely ignore the warning since those objects will not appear in our 2D histogram.
So now, we are selected what we think are stars. Let's take a look at the colors of these objects
plt.hexbin(d['mag_g_cModel']-d['mag_r_cModel'],
d['mag_r_cModel']-d['mag_i_cModel'],
bins='log', extent=[-1,2,-1,2]);
plt.xlabel('$g-r$')
plt.ylabel('$r-i$')
plt.colorbar(label='log(Number of objects)')
Q: What else can you do to improve the star selection?
If you want to see more go to Part II