Owner: Joanne Bogart @jrbogart
Last Verified to Run:
This notebook is an introduction to use of the PostgreSQL database at NERSC. Currently the only fully supported datasets are the object catalogs for Run1.2i and Run1.2p v4. Since object catalogs as well as other kinds of catalogs are also available via GCR one might question the need for another form of access. The justification (for those applications only using object catalogs) is performance. Typical queries such as the one labeled q5
below run significantly faster. Ingest also tends to be faster. The Run1.2p v4 data was available via PostgreSQL within a day of the completion of stack processing and transfer to NERSC.
Learning objectives:
After going through this notebook, you should be able to:
Logistics: This notebook is intended to be run through the JupyterHub NERSC interface available here: https://jupyter-dev.nersc.gov. To setup your NERSC environment, please follow the instructions available here: https://confluence.slac.stanford.edu/display/LSSTDESC/Using+Jupyter-dev+at+NERSC
nerscdb03.nersc.gov:54432:desc_dc2_drp:desc_dc2_drp_user:
password
This line allows you to use the desc_dc2_drp_user account, which has SELECT privileges on the database, without entering a password in plain text. There is a separate account for adding to or modifying the database. .pgpass must be protected so that only owner may read and write it.
You can obtain the file by running the script /global/common/software/lsst/dbaccess/postgres_reader.sh
. It will copy a suitable file to your home directory and set permissions.
If you already have a .pgpass
file in your home directory the script will stop without doing anything to avoid clobbering your file. In that case, see the file reader.pgpass
in the same directory. You can merge it into your .pgpass
file by hand.
This notebook uses psycopg2 directly for queries. It is also possible to use sqlalchemy but you will still need a PostgreSQL driver. Of these psycopg2 is the most popular.
import psycopg2
import numpy as np
%matplotlib inline
import matplotlib.pyplot as plt
import pandas as pd
Make the db connection
dbname = 'desc_dc2_drp'
dbuser = 'desc_dc2_drp_user'
dbhost = 'nerscdb03.nersc.gov'
dbconfig = {'dbname' : dbname, 'user' : dbuser, 'host' : dbhost}
dbconn = psycopg2.connect(**dbconfig)
schema = 'run12i' # change value to 'run12p_v4' for the Run 1.2p catalog
Tables for the Run1.2i data as well as a view to make dpdd quantities more easily accessible are in the schema
(acts like a namespace) run12i. The value for schema
above will change for other datasets.
There is a special system schema, information_schema, which contains tables describing the structure of user tables. Of these information_schema.columns is most likely to be useful. The following lists all tables and views belonging to schema run12i. (I will use the convention of writing SQL keywords in all caps in queries. It's not necessary; the SQL interpreter ignores case.)
q1 = "SELECT DISTINCT table_name FROM information_schema.columns WHERE table_schema='{schema}' ORDER BY table_name".format(**locals())
with dbconn.cursor() as cursor:
# Could have several queries interspersed with other code in this block
cursor.execute(q1)
for record in cursor:
print(record[0])
_temp:forced_patch is an artifact of the ingest process which is of no interest here.
All the remaining entries with the exception of dpdd are tables. Each table has rows of data, one per object in the catalog. The rows consist of "native quantities" as produced by running the dm stack on the simulated data. dpdd is a view* making the quantities identified in GCRCatalogs/SCHEMA.md available. Information is broken across several tables because there are too many columns for a single table. All tables have a field object_id
. In the dpdd view it's called objectId
because that's the official name for it. The following code will list all quantities in the dpdd view. Note the database ignores case; all quantities appear in lower case only.
*A view definition consists of references to quantities stored in the tables or computable from them; the view has no data of its own. The view name is used in queries just like a table name.
tbl = 'dpdd'
q2 = "SELECT column_name, data_type FROM information_schema.columns WHERE table_schema='{schema}' AND table_name='{tbl}' order by column_name ".format(**locals())
print(q2)
with dbconn.cursor() as cursor:
cursor.execute(q2)
records = cursor.fetchall()
print("There are {} columns in table {}. They are:\n".format(len(records), tbl))
print("Name Data Type")
for record in records:
print("{0!s:55} {1!s:20}".format(record[0], record[1]) )
Here is a similar query for the position table.
tbl = 'position'
q2_pos = "SELECT column_name, data_type FROM information_schema.columns WHERE table_schema='{schema}' AND table_name='{tbl}'".format(**locals())
with dbconn.cursor() as cursor:
cursor.execute(q2_pos)
records = cursor.fetchall()
print("There are {} columns in table {}. They are:\n".format(len(records), tbl))
print("Name Data Type")
for record in records:
print("{0!s:55} {1!s:20}".format(record[0], record[1]) )
Here is a query which counts up objects per tract and stores the results (queries return a list of tuples) in a pandas DataFrame. It makes use of a user-defined function (UDF*) tract_from_object_id
, which is by far the fastest way to determine the tract.
*The UDF tract_from_object_id
is one of several which minimize query time by making optimal use of the structure of the database. See the second tutorial in this series for a discussion of some of the others.
q3 = "SELECT tract_from_object_id(object_id), COUNT(object_id) FROM {schema}.position WHERE detect_isprimary GROUP BY tract_from_object_id(object_id)".format(**locals())
with dbconn.cursor() as cursor:
%time cursor.execute(q3)
df = pd.DataFrame(cursor.fetchall(), columns=['tract', 'count'])
print(df)
Here is the same query, but made on the dpdd view rather than the position table. There is no need to specify "is primary" because the dpdd view already has this constraint.
q4 = "SELECT tract_from_object_id(objectid), COUNT(objectid) FROM {schema}.dpdd GROUP BY tract_from_object_id(objectid)".format(**locals())
with dbconn.cursor() as cursor:
cursor.execute(q4)
df = pd.DataFrame(cursor.fetchall(), columns=['tract', 'count'])
print(df)
The following can be compared with a similar query in the GCR Intro notebook.
q5 = "SELECT ra,dec FROM {schema}.dpdd".format(**locals())
with dbconn.cursor() as cursor:
%time cursor.execute(q5)
%time records = cursor.fetchall()
Techniques are adapted from the notebook object_pandas_stellar_locus.
Put some typical cuts in a WHERE clause: select clean
objects (no flagged pixels; not skipped by deblender) for which signal to noise in bands of interest is acceptable.
global_cuts = 'clean '
t = None
min_SNR = 25
max_err = 1/min_SNR
band_cuts = ' (magerr_g < {}) AND (magerr_i < {}) AND (magerr_r < {}) '.format(max_err,max_err,max_err)
where = ' WHERE ' + global_cuts + ' AND ' + band_cuts
q6 = "SELECT mag_g,mag_r,mag_i FROM {schema}.dpdd ".format(**locals()) + where
print(q6)
records = []
with dbconn.cursor() as cursor:
%time cursor.execute(q6)
records = cursor.fetchall()
nObj = len(records)
df = pd.DataFrame(records, columns=['mag_g', 'mag_r', 'mag_i'])
def get_stellar_locus_davenport(color1='gmr', color2='rmi',
datafile='../tutorials/assets/Davenport_2014_MNRAS_440_3430_table1.txt'):
#datafile='assets/Davenport_2014_MNRAS_440_3430_table1.txt'):
data = pd.read_table(datafile, sep='\s+', header=1)
return data[color1], data[color2]
def plot_stellar_locus(color1='gmr', color2='rmi',
color='red', linestyle='--', linewidth=2.5):
model_gmr, model_rmi = get_stellar_locus_davenport(color1, color2)
plot_kwargs = {'linestyle': linestyle, 'linewidth': linewidth, 'color': color,
'scalex': False, 'scaley': False}
plt.plot(model_gmr, model_rmi, **plot_kwargs)
def plot_color_color(z, color1, color2, range1=(-1, +2), range2=(-1, +2), bins=31, title=None):
"""Plot a color-color diagram. Overlay stellar locus. """
band1, band2 = color1[0], color1[-1]
band3, band4 = color2[0], color2[-1]
H, xedges, yedges = np.histogram2d(
z['mag_%s' % band1] - z['mag_%s' % band2],
z['mag_%s' % band3] - z['mag_%s' % band4],
range=(range1, range2), bins=bins)
zi = H.T
xi = (xedges[1:] + xedges[:-1])/2
yi = (yedges[1:] + yedges[:-1])/2
cmap = 'viridis_r'
plt.figure(figsize=(8, 8))
plt.pcolormesh(xi, yi, zi, cmap=cmap)
plt.contour(xi, yi, zi)
plt.xlabel('%s-%s' % (band1, band2))
plt.ylabel('%s-%s' % (band3, band4))
if not title == None:
plt.suptitle(title, size='xx-large', y=0.92)
plot_stellar_locus(color1, color2)
plot_color_color(df, 'gmr', 'rmi')
print('Using schema {}, cut on max err={}, found {} objects'.format(schema, max_err, nObj))
Now make the same plot, but for Run 1.2p data. The query takes noticeably longer because the Run 1.2p catalog is about 5 times bigger than the one for Run 1.2i.
schema = 'run12p_v4'
global_cuts = 'clean '
t = None
min_SNR = 25
max_err = 1/min_SNR
band_cuts = ' (magerr_g < {}) AND (magerr_i < {}) AND (magerr_r < {}) '.format(max_err,max_err,max_err)
where = ' WHERE ' + global_cuts + ' AND ' + band_cuts
q7 = "SELECT mag_g,mag_r,mag_i FROM {schema}.dpdd ".format(**locals()) + where
print(q7)
records = []
with dbconn.cursor() as cursor:
%time cursor.execute(q7)
records = cursor.fetchall()
nObj = len(records)
df = pd.DataFrame(records, columns=['mag_g', 'mag_r', 'mag_i'])
plot_color_color(df, 'gmr', 'rmi', title=t)
print(q5)
print('Using schema {} , max err={}, found {} objects'.format(schema, max_err, nObj))