cosmoDC2 extragalactic catalog photometric redshifts

Owner:Sam Schmidt @sschmidt23

Last Verifed to Run: 2020-06-09 (by @sschmidt23)

This notebook will show you how to access the "add-on" columns that provide the photometric redshift (photo-z) information for the extragalactic catalog (cosmoDC2_v1.1.4_image).

Learning objectives: After going through this notebook, you should be able to:

  1. Load and efficiently access a DC2 extragalactic catalog (+ photo-z) via the GCR for both a template based and machine learning-based photo-z methods
  2. Understand how the photo-z data are stored / represented
  3. Look at a few examples of galaxy photo-z distributions

Logistics: This notebook is intended to be run through the Jupyter Lab NERSC interface available here: To setup your NERSC environment, please follow the instructions available here:

Other notes: If you restart your kernel, or if it automatically restarts for some reason, all imports and variables will become undefined so, you will have to re-run everything. Several photo-z catalogs were renamed starting with GCRCatalogs version 0.18.0, so check that GCRCatalogs is at least this recent when running the notebook.

This notebook will be very similar to the earlier tutorial on useing the object catalogs, so if you have worked through that example much of this will be very familiar.

A rendered version of this notebook is available at:
if you want to follow along but do not wish to actually run the notebook live at NERSC.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
In [2]:
import GCRCatalogs
from GCR import GCRQuery

Load the catalog

Loading the cosmoDC2_v1.1.4_image catalog with photo-z add-on. The catalog name is cosmoDC2_v1.1.4_image_with_photoz_v1. This catalog contains photometric redshifts that were computed with the BPZ template-based code ( Later in the notebook we will compare these results to photo-z's produced with a machine learning based code.

This catalog is a composite of three large catalogs, so it may take a minute or more for the catalog instance to initiate.

In [3]:
cat = GCRCatalogs.load_catalog('cosmoDC2_v1.1.4_small_with_photozs_v1')
CPU times: user 1.35 s, sys: 1.22 s, total: 2.57 s
Wall time: 12.1 s

Photo-z access methods

There are several photo-z related quantities available in the catalog, a summary of which can be found on this Confluence page:

There are photo-z estimates in the form of both a single number "point estimate" for each galaxy, as well as a 1D redshift probability density function (PDF) representing the posterior probability of the galaxy being at a given redshift calculated on a specific redshift grid.

There are multiple single point estimates:

  1. photoz_mode: the mode of the redshift PDF, the highest peak of the posterior probability
  2. photoz_mean: the weighted mean of the redshift PDF.
  3. photoz_median: the redshift where the redshift CDF is equal to 0.5.

The redshift pdf is stored in the multi-valued column photoz-pdf. The grid of redshifts at which the posterior probability is evaluated is stored in the catalog with the special attribute of photoz_pdf_bin_centers. You can access this attribute for catalog cat with something like zgrid = cat.photoz_pdf_bin_centers

There are three additional columns that can be used as various quality flags:

  1. photoz_odds (see Benitez 2000) is a measure of the integrated amount or probaility within a fixed distance around photoz_mode. If the redshift posterior is single peaked and narrow this number will be close to 1.0, if the posterior is multi-peaked and/or broad it is likely to be smaller. Thus, high values of photoz_odds can be used as an indicator of photo-z quality.
  2. photoz_mode_ml_red_chi2 is the reduced chi-squared value for the maximum likelihood estimate of the best fit template at the photo-z mode. If this chi-squared value is very large, it indicates that none of the SED templates employed by the photo-z code were good fits to the observed colors, and thus the redshift may be suspect. High values may also occur for very bright galaxies where photometric errors are small and thus chi-squared values can grow large.

We will demonstrate access methods for several of these quantities in detail. You can notice that all the photo-z columns have a prefix of photoz_.

Let's first make sure that these columns are indeed available.

In [4]:
# uncomment the line below to see a list of *all* available quantities in the composite catalog
# print('\n'.join(sorted(cat.list_all_quantities(False))))
# or, just print the columns associated with photoz:
sorted(q for q in cat.list_all_quantities() if q.startswith('photoz_'))
In [5]:
data = cat.get_quantities(['photoz_mask','photoz_pdf','photoz_mean','photoz_mode','photoz_odds','photoz_mode_ml_red_chi2','mag_i_lsst','mag_i_photoz','redshift'],

Now, only a subset of the full cosmoDC2_v1.1.4_image and cosmoDC2_v1.1.4small entries have photo-z's computed: the photo-z group used a simple model to create mock photometry with 10-year depth magnitude uncertainties, added these uncertainties to the fluxes to compute "observed" magnitudes and errors (which are stored in the catalog as `mag{ugrizy}_photozandmagerr{ugrizy}_photoz. Only objects withmag_i_photoz < 26.5` had photometric redshifts calculated.

For efficient loading of the chunks of the catalog, the cosmoDC2_v1.1.4_image_with_photozs_v1 and cosmoDC2_v1.1.4_small_with_photozs_v1 catalogs load from multiple files with different numbers of rows. In order to properly match up the subset for which photo-z's have been computed, you will need the photoz_mask quantity,and you will have to mask the cosmoDC2_v1.1.4 quantities using this mask, which is a simple boolean flag.

As an example, redshift contains almost ten times as many entries as photo-z mode before masking, but after masking the arrays are now properly the same length, and are properly matched within the data. In our case we will have to mask mag_i_lsst, mag_i_photoz, and redshift.

In [6]:
redshift = data['redshift']
photoz_mask = data['photoz_mask']
photoz_mode = data['photoz_mode']
mag_i_photoz = data['mag_i_photoz']
In [7]:
redshift = redshift[photoz_mask]
mag_i_photoz = mag_i_photoz[photoz_mask]

Let's put things in a pandas dataframe for simplicity. Note that to load the photoz_pdf arrays, it's best to do so as a list:

In [8]:
pzdict = {'specz':redshift,'zmode':photoz_mode,'zmean':data['photoz_mean'],'odds':data['photoz_odds'],'ml_chi2':data['photoz_mode_ml_red_chi2'],
df = pd.DataFrame(pzdict)
In [9]:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1628434 entries, 0 to 1628433
Data columns (total 7 columns):
 #   Column        Non-Null Count    Dtype  
---  ------        --------------    -----  
 0   specz         1628434 non-null  float64
 1   zmode         1628434 non-null  float32
 2   zmean         1628434 non-null  float32
 3   odds          1628434 non-null  float32
 4   ml_chi2       1628434 non-null  float32
 5   mag_i_photoz  1628434 non-null  float32
 6   pdf           1628434 non-null  object 
dtypes: float32(5), float64(1), object(1)
memory usage: 55.9+ MB
In [10]:
magcut = 23.6

Let's start with a simple plot of the point estimate photoz_mode vs true_redshift, and compare the full mag_i_photoz<26.5 sample to a higher S/N cut of mag_i_photoz<23.6 for both photoz_mode and photoz_mean:

In [11]:
brightmask = (df['mag_i_photoz']<magcut)
brightdf= df[brightmask]

fig = plt.figure(figsize=(18,8))
fig = plt.subplot(121)
plt.legend(loc='upper left',fontsize=15)
plt.title("photoz_mode distribution",fontsize=20)
fig = plt.subplot(122)
plt.legend(loc='upper left',fontsize=15)
plt.title("photoz_mean distribution",fontsize=20)