Last Verifed to Run: 2020-06-09 (by @sschmidt23)
This notebook will show you how to access the "add-on" columns that provide the photometric redshift (photo-z) information for the DC2 Object Catalog (Run 2.2i).
Learning objectives: After going through this notebook, you should be able to:
Logistics: This notebook is intended to be run through the Jupyter Lab NERSC interface available here: https://jupyter.nersc.gov/. To setup your NERSC environment, please follow the instructions available here: https://confluence.slac.stanford.edu/display/LSSTDESC/Using+Jupyter+at+NERSC
Other notes: If you restart your kernel, or if it automatically restarts for some reason, all imports and variables will become undefined so, you will have to re-run everything.
import numpy as np import matplotlib.pyplot as plt %matplotlib inline
import GCRCatalogs from GCR import GCRQuery GCRCatalogs.__version__
Loading the object catalog with photo-z add-on. The catalog name is
It takes a few seconds for the catalog instance to initiate.
cat = GCRCatalogs.load_catalog('dc2_object_run2.2i_dr3_with_photoz')
There are several photo-z related quantities available in the catalog, a summary of which can be found on this Confluence page: https://confluence.slac.stanford.edu/display/LSSTDESC/List+of+available+DC2+catalogs+created+by+PhotoZ
There are photo-z estimates in the form of both a single number "point estimate" for each galaxy, as well as a 1D redshift probability density function (PDF) representing the posterior probability of the galaxy being at a given redshift calculated on a specific redshift grid.
There are multiple single point estimates:
photoz_mode: the mode of the redshift PDF, the highest peak of the posterior probability
photoz_mean: the weighted mean of the redshift PDF.
photoz_median: the redshift where the redshift CDF is equal to 0.5.
The redshift pdf is stored in the multi-valued column
photoz-pdf. The grid of redshifts at which the posterior probability is evaluated is stored in the catalog with the special attribute of
photoz_pdf_bin_centers. You can access this attribute for catalog cat with something like
zgrid = cat.photoz_pdf_bin_centers
There are three additional columns that can be used as various quality flags:
photoz_odds(see Benitez 2000) is a measure of the integrated amount or probaility within a fixed distance around
photoz_mode. If the redshift posterior is single peaked and narrow this number will be close to 1.0, if the posterior is multi-peaked and/or broad it is likely to be smaller. Thus, high values of
photoz_oddscan be used as an indicator of photo-z quality.
photoz_mode_ml_red_chi2is the reduced chi-squared value for the maximum likelihood estimate of the best fit template at the photo-z mode. If this chi-squared value is very large, it indicates that none of the SED templates employed by the photo-z code were good fits to the observed colors, and thus the redshift may be suspect. High values may also occur for very bright galaxies where photometric errors are small and thus chi-squared values can grow large.
We will demonstrate access methods for several of these quantities in detail. You can notice that all the photo-z columns have a prefix of
Let's first make sure that these columns are indeed available.
sorted(q for q in cat.list_all_quantities() if q.startswith('photoz_'))
['photoz_mean', 'photoz_median', 'photoz_mode', 'photoz_mode_ml', 'photoz_mode_ml_red_chi2', 'photoz_odds', 'photoz_pdf']
Let's now try access the photo-z data! Everything you already about the GCR access of object catalogs will still apply.
Including the use of
native_filters is used for selecting tracts mostly). We will access the single tract 4850.
data = cat.get_quantities(['photoz_mode'], filters=['photoz_mode < 0.2', 'mag_i < 26'], native_filters=['tract==4850']) # check if the filters work print((data['photoz_mode'] < 0.2).all())
/global/common/software/lsst/common/miniconda/py3.7-126.96.36.199/envs/desc-dev/lib/python3.7/site-packages/GCRCatalogs/dc2_dm_catalog.py:43: RuntimeWarning: invalid value encountered in log10 return -2.5 * np.log10(flux) + AB_mag_zp_wrt_nanoJansky
Now, if you want to make a plot of the PDF for a galaxy or galaxies, you will need to access the
photoz_pdf column. Note that each entry stores an array, so use with care!
As an example, let's just load one tract (using the
return_iterator feature) of the full PDFs and also peak and mean values at the same time. The first few tracts have very few objects, some of which are only detected in a handful of bands, so we'll load the 14th tract:
xcat = cat.get_quantities(['photoz_pdf', 'photoz_mode', 'photoz_mean'], return_iterator=True) for i in range(14): data = next(xcat)
There are 684,634 objects in this tract, and there are 301 bins in the photo-z PDF. The photoz_pdf data is a 1D array has a shape of (684634,), but each entry stores an array of 301 values.
Now, let's plot 10 example PDFs. We will plot every 100th PDF as this shows a bit more variety in PDF shapes.
The PDFs were evaluated on a set grid of redshift values. For run2.2 this grid extends to z=3.0.
To get the array of bin center values, you can access the
photoz_pdf_bin_centers attribute, as demonstrated below.
We overplot the
photoz_mean values as well.
fig, ax = plt.subplots(5, 2, figsize=(12,16)) for pdf, z_peak, z_mean, ax_this in zip(data['photoz_pdf'][::100], data['photoz_mode'][::100], data['photoz_mean'][::100], ax.flat): l = ax_this.plot(cat.photoz_pdf_bin_centers, pdf,label='p(z)'); ax_this.axvline(z_peak, color=l.get_color(), ls=':', lw=1,label='photoz_mode'); ax_this.axvline(z_mean,color='r',ls='--',lw=1,label='photoz_mean'); ax_this.set_xlabel('$z$'); ax_this.set_ylabel('$p(z)$'); ax_this.legend(loc='upper right')
We see that
photoz_mode does, indeed, correspond to the mode/peak of the posterior.
photoz_mean lies at the weighted mean redshift. For multi-peaked posteriors this position can actually be in a location of relatively low probability between two peaks. We see a variety of PDF shapes: narrow unimodal; broad and/or tailed; and multimodal distributions. This is due to a combination of the galaxy's observed colors and magnitude uncertainties.
Now that we have learned all the access methods, let's try to work out an example!
First of all, let's define a set of reasonable cuts to give us galaxies
cuts = [ GCRQuery('extendedness > 0'), # Extended objects GCRQuery((np.isfinite, 'mag_i')), # Select objects that have i-band magnitudes GCRQuery('clean'), # The source has no flagged pixels (interpolated, saturated, edge, clipped...) # and was not skipped by the deblender GCRQuery('snr_i_cModel > 20'), # SNR > 20 GCRQuery('snr_r_cModel > 20'), GCRQuery('snr_g_cModel > 20'), GCRQuery('mag_i_cModel < 23'), # cModel imag brighter than 23 GCRQuery('mag_i_cModel > 20') # cModel imag fainter than 20 (exclude super bright objects) ]
Now let's make some plots! Let's compare the histogram of photoz_mode values to the sum of the individual PDF values, the "stacked" PDF being a common (but not statistically correct) way of estimating redshift distributions:
data = cat.get_quantities(['photoz_mode', 'mag_g_cModel', 'mag_r_cModel', 'mag_i_cModel','photoz_pdf'], filters=cuts, native_filters=['tract==4850'])
/global/common/software/lsst/common/miniconda/py3.7-188.8.131.52/envs/desc-dev/lib/python3.7/site-packages/GCRCatalogs/dc2_dm_catalog.py:43: RuntimeWarning: divide by zero encountered in log10 return -2.5 * np.log10(flux) + AB_mag_zp_wrt_nanoJansky /global/common/software/lsst/common/miniconda/py3.7-184.108.40.206/envs/desc-dev/lib/python3.7/site-packages/GCRCatalogs/dc2_dm_catalog.py:43: RuntimeWarning: invalid value encountered in log10 return -2.5 * np.log10(flux) + AB_mag_zp_wrt_nanoJansky
We can construct a rough estimate for the N(z) by summing the individual galaxy PDFs in
photoz_pdf, and compare the results of this sum to the histogram of the single point
sumpdf = np.sum(data['photoz_pdf'],axis=0)
fig = plt.figure(figsize=(12,8)) plt.hist(data['photoz_mode'], 100, label="photo-z mode"); plt.plot(cat.photoz_pdf_bin_centers, sumpdf*3.,label="summed $p(z)$",lw=2,c='r'); plt.xlabel("redshift"); plt.legend(loc='upper right');
photoz_pdf shapes roughly agree, though we see a more smooth distribution with the full posteriors. We also see a reduction of the anomalous high redshift features that appear with
photoz_mode, where the posteriors are multi-peaked: using the full PDFs properly puts some of this probability at lower redshift rather than assigning to the single high redshift value.
Let's also plot a color-color diagram and color code by the
fig = plt.figure(figsize=(10,8)) plt.scatter(data['mag_g_cModel'] - data['mag_r_cModel'], data['mag_r_cModel'] - data['mag_i_cModel'], c=data['photoz_mode'], s=4, vmin=0, vmax=2.5,cmap='inferno'); plt.xlim(-1, 3); plt.ylim(-0.5, 2); plt.xlabel('$g-r$',fontsize=18); plt.ylabel('$r-i$',fontsize=18); plt.colorbar(label='photoz_mode'); # Food for thought: Look at how the photo-z values are distributed in this color-color space. Is this behavior expected?