Author: Julien Peloton @JulienPeloton
Last Verifed to Run: 2019-01-06
Estimated running time: < 15 min.
Context
This notebook illustrates the basics of accessing the cosmoDC2 catalog through Apache Spark as well as how to select useful samples of data to study galaxy clusters.
Science results or validation?
While it describes some of the science results you can get with the DC2 catalog (e.g. halo mass-velocity dispersion relation), it shifted towards the validation side, especially on the halo mass distribution. So it is a bit of both! After going through this notebook, you should be able to:
Table of contents:
Logistics:
This notebook is intended to be run through the JupyterHub NERSC interface with the desc-pyspark kernel. The kernel is automatically installed in your environment when you use the kernel setup script:
source /global/common/software/lsst/common/miniconda/kernels/setup.sh
Thanks to:
@aphearin, @cwwalter, @dkorytov, @rmandelb, @salmanhabib, @yymao, for their help in understanding better how DC2 catalog was made, and solving issues while making this notebook!
from typing import Iterator, Generator, Any
import numpy as np
import pandas as pd
import matplotlib
%matplotlib inline
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
from IPython.display import Image
from scipy.optimize import curve_fit
from pyspark.sql.functions import pandas_udf , PandasUDFType
from pyspark.sql import functions as F
from pyspark.sql import DataFrame
Apache Spark has no efficient PySpark connector to read data in hdf5 file. Therefore we first converted the cosmoDC2 data set into parquet (similar to what DPDD tools offer). For the purpose of this notebook, we only convert a few columns of interest. The file is accessible at NERSC for DESC members.
# Path to the data
fn = "/global/cscratch1/sd/peloton/cosmodc2/xyz_v1.1.4_mass_and_mag.parquet"
# Load the data - this a lazy operation, no data movement yet!
df = spark.read.format("parquet").load(fn)
# Let's inspect the schema
df.printSchema()
# Number of objects in the catalog
print("Number of rows: {}".format(df.count()))
Let's have a look a some mass values. Apache Spark provides filter mechanisms, which you can use to speed up data access if you only need a certain chunks of the dataset:
# We reject synthetic halos - see for example discussion in
# https://github.com/LSSTDESC/cosmodc2/issues/82
cols = ["halo_mass", "stellar_mass", "blackHoleMass", "halo_id"]
df.filter("halo_id > 0").select(cols).show(5)
Note that halo_mass
is duplicated for all the members of the same halo.
We can also easily look at statistics about individual columns:
# Let's look at the stellar_mass and halo_mass distributions
df.select(["stellar_mass", "halo_mass", "redshift"]).describe().show(5)
For more about Apache Spark in the context of LSST DESC, see e.g. LSSTDESC/DC2-production, LSSTDESC/DC2-analysis, LSSTDESC/desc-spark, or connect to AstroLab Software!
To start this journey, let's look at the distribution of halo masses in the DC2 data set. We will select only clusters with positive halo-ids (i.e. we reject the synthetic halos that are added to host the ultra-faint galaxies). In addition, we will create two other populations: low-z (0.0 < z < 0.2), and high-z (2.5 < z < 3.1) clusters.
# We cache the data for speeding up later computations
df_mass = df.filter('is_central == True').filter("halo_id > 0").select("halo_mass").cache()
# Populations based on redshift range
df_mass_lowz = df_mass.filter("redshift < 0.2")
df_mass_highz = df_mass.filter("redshift > 2.5")
Since the number of clusters is rather high, we will capitalize on the fact that we are doing computation in parallel. The way to be faster is to distribute the computation which leads to the data to be plotted. Histograms are particularly easy to distribute:
# Illustration of the text above
Image('histo_spark.png')
Let's write such a method to be applied on each Spark partition to compute histograms in parallel (each would contain only a fraction of the data):
def partition_hist(partition: Iterator, bins : np.ndarray = None) -> Generator:
""" Produce histograms from partition data
Parameters
----------
partition : Iterator
Iterator containing partition data *[x].
Returns
----------
Generator yielding histograms for each partition.
Counts is an array of dimension nbins - 1.
"""
# Unwrap the iterator
mass = [*partition]
(counts, edges) = np.histogram(mass, bins)
yield counts
def parallel_hist(df: DataFrame, bins : np.ndarray, col: str = None) -> np.ndarray:
""" Build an histogram of the data in parallel
Parameters
----------
df : DataFrame
Apache Spark DataFrame.
bins : 1D numpy array
The bin edges, including the rightmost edge.
col : str, optional
Column to select. If None, the DataFrame is
expected to have only one column. Default is None.
Returns
----------
hist : 1D numpy array
Full histogram over `bins` for the selected DataFrame column.
hist has dimension len(bins) - 1.
"""
# Go from DataFrame to RDD world
if col is None:
rdd = df.rdd
else:
rdd = df.select(col).rdd
# Build the histograms in each partition
hist = rdd\
.mapPartitions(lambda partition: partition_hist(partition, bins))\
.reduce(lambda x, y: x+y)
return hist
# Binning of mass in log space
bins = np.logspace(8, 15, 50)
# Build histograms for the 3 populations
dataplot = []
for index, dataframe in enumerate([df_mass, df_mass_lowz, df_mass_highz]):
# This is the crucial part - build the plot data in parallel, and
# reduce the final result to the driver. Scalable to Billions of points!
dataplot.append(parallel_hist(dataframe, bins))
# Plot the result
matplotlib.rcParams.update({'font.size': 17})
labels = ["All redshifts", "0.0 < z < 0.2", "2.5 < z < 3.1"]
fig = plt.figure(figsize=(10, 7))
for index in range(len(dataplot)):
plt.loglog(bins[:-1], dataplot[index], label=labels[index])
print('Entries for {} : {} ({:.1f}%)'.format(
labels[index],
np.sum(dataplot[index]),
np.sum(dataplot[index])/np.sum(dataplot[0])*100))
plt.legend()
plt.xlim(2.5e10, 1e15)
plt.title('Halo Mass distribution in cosmoDC2 (v1.1.4)', fontsize=20)
plt.xlabel(r'${\rm M}_h [{\rm M}_\odot]$', fontsize=30)
plt.show()
We selected the clusters based on their halo_id
, and to avoid double counting we took only entries corresponding to central galaxies (all galaxies for a given halo_id
have the same halo_mass
).
If now we look closely at the halo_mass
distribution values, we observe a weird feature:
df1 = df.filter('is_central == True').filter("halo_id > 0").select('halo_mass')
df2 = df.filter('is_central == True').filter("halo_id > 0").select('halo_mass').distinct()
print("All halo_mass values: {}".format(df1.count()))
print("All distinct halo_mass values: {}".format(df2.count()))
There many different halos (i.e. different position on sky) with exactly the same mass! Note that we already select central galaxies to grab the value of the halo mass. Let's look at both distributions:
# Build plots in parallel - millions of points!
data1 = parallel_hist(df1, bins)
data2 = parallel_hist(df2, bins)
# Plot the result
matplotlib.rcParams.update({'font.size': 17})
plt.figure(figsize=(10, 7))
plt.loglog(bins[:-1], data1, label="All mass values")
plt.loglog(bins[:-1], data2, label="Unique mass values")
plt.title("Oooops!", fontsize=20)
plt.xlim(2.5e10, 1e15)
plt.xlabel(r'${\rm M}_h [{\rm M}_\odot]$', fontsize=30)
plt.legend()
plt.show()
That means many halos with mass lower than $10^{13} {\rm M}_\odot$ have the same mass.
Question: Is that expected?
Answer: Yes! This is the so-called mass quantization in N-body simulations: all halo masses must by definition be some integer times the particle mass value used in the simulations (a few times 10^9 solar masses). For a detailed discussion, see https://github.com/LSSTDESC/DC2-analysis/issues/55. OK, this one was easy to solve...!
Let's now focus on some selected galaxy clusters. We will select 5 rich ones ($n_{gal} > 50$), and look at their spatial morphology (in comoving cartesian coordinates), plus the 3D velocity of their members.
# We reject synthetic halos - see for example discussion in
# https://github.com/LSSTDESC/cosmodc2/issues/82
df_halo = df.filter("halo_id > 0")
# Group by halo_id, and count members
df_halo_grouped = df_halo.groupBy("halo_id").count().filter("count > 50")
df_halo_grouped.show(5)
# Take the 5 first
halos = df_halo_grouped.select("halo_id").take(5)
We will collect halos data into Pandas DataFrame. Why using Pandas? to show that you can go from Spark world to Pandas world easily (and also for fun):
# Select position, velocity, stellar mass and redshift
cols_3d = ["position_x", "position_y", "position_z", "velocity_x", "velocity_y", "velocity_z"]
cols = cols_3d + ["stellar_mass", "halo_mass", "redshift"]
pandas_dfs = []
for halo in halos:
# Keep only members of the halo
halo_id = np.array(halo)[0]
df_members = df_halo.filter("halo_id == {}".format(halo_id)).select(cols)
# Halo members
pandas_df = df_members.toPandas()
# Central galaxy
pandas_df_center = df_members.filter("is_central == True").toPandas()
pandas_dfs.append([pandas_df, pandas_df_center])
Note however that toPandas()
sends back the data from all the executors to the lonely poor driver. So be careful to not use that on TB of data!
matplotlib.rcParams.update({'font.size': 17})
from mpl_toolkits.mplot3d import Axes3D
fig = plt.figure(figsize=(15, 35))
fig.suptitle('Selected Halos', y=0.93, fontsize=30)
for index, halo, pdfs in zip(range(len(halos)), halos, pandas_dfs):
# First plot column: position of members, color coded with stellar mass
ax = fig.add_subplot(5, 2, 2*index + 1, projection='3d')
mass_scale = pdfs[0].stellar_mass / pdfs[0].stellar_mass.max() * 200
halo_id = np.array(halo)[0]
zmean = pdfs[1].redshift.mean()
halo_mass = pdfs[1].halo_mass.mean()
ax.scatter(
pdfs[0].position_x,
pdfs[0].position_y,
pdfs[0].position_z, s=mass_scale, c=mass_scale)
ax.scatter(
pdfs[1].position_x,
pdfs[1].position_y,
pdfs[1].position_z,
s=200, color="red", marker="x")
ax.set_title(
'Halo ID {} \n (z $\sim$ {:.2f}, $M_h = {:.2e} M_\odot$)'.format(
halo_id, zmean, halo_mass))
ax.set_xlabel(r'x [Mpc/h]', labelpad=20)
ax.set_ylabel(r'y [Mpc/h]', labelpad=20)
ax.set_zlabel(r'z [Mpc/h]', labelpad=20)
ax.tick_params(axis='both', labelsize=10)
# Second plot column: position of members + 3D velocity,
# color coded with stellar mass.
ax = fig.add_subplot(5, 2, 2*index + 2, projection='3d')
ax.scatter(
pdfs[0].position_x,
pdfs[0].position_y,
pdfs[0].position_z, s=mass_scale, c=mass_scale)
q = ax.quiver(
pdfs[0].position_x,
pdfs[0].position_y,
pdfs[0].position_z,
pdfs[0].velocity_x,
pdfs[0].velocity_y,
pdfs[0].velocity_z, length=0.0002, color='k')
ax.scatter(
pdfs[1].position_x,
pdfs[1].position_y,
pdfs[1].position_z,
s=200, color="red", marker="x")
ax.set_title(
'Halo ID {} \n (z $\sim$ {:.2f}, $M_h = {:.2e} M_\odot$)'.format(
halo_id, zmean, halo_mass))
ax.set_xlabel(r'x [Mpc/h]', labelpad=20)
ax.set_ylabel(r'y [Mpc/h]', labelpad=20)
ax.set_zlabel(r'z [Mpc/h]', labelpad=20)
ax.tick_params(axis='both', labelsize=10)
plt.show()
Color code:
Note that for the 2nd and 4th halos the velocities seem to be aligned along an axis as if they were following some flow. On the third halo, galaxies seem to rotate around the center.
Let's have a look at a wider region of space. We focus on a cube of 15 Mpc size (~ 2000 halos), compute and display the mean velocity for each halo:
# Select columns of interest
cols = ["halo_id", "position_x", "position_y", "position_z", "redshift", "is_central"]
cols_vel = ["velocity_x", "velocity_y", "velocity_z"]
cols_mass = ["halo_mass"]
df_sub = df.select(cols + cols_mass + cols_vel).filter("halo_id > 0")
# Get the mean position of our cube of data
df_stat = df_sub.select(["position_x", "position_y", "position_z"]).describe()
means = df_stat.collect()[1]
m_x = float(means['position_x'])
m_y = float(means['position_y'])
m_z = float(means['position_z'])
# Size of our data cube
Mpc = 15.
# Keep only this region of space
dfcut = df_sub.filter(
(F.abs(df_sub.position_x - m_x) < Mpc) &
(F.abs(df_sub.position_y - m_y) < Mpc) &
(F.abs(df_sub.position_z - m_z) < Mpc))
# Group data by halos and compute the mean velocity
df_disp = dfcut.groupBy("halo_id").agg({
"velocity_x": 'mean',
"velocity_y": 'mean',
"velocity_z": 'mean'})
# Add back original columns, and
# select only a few columns for the plot
data_joined = dfcut.join(df_disp, "halo_id")\
.filter("is_central == True")\
.select(
"avg(velocity_x)", "avg(velocity_y)", 'avg(velocity_z)',
"position_x", "position_y", "position_z",
'redshift', 'halo_mass')
# Show a few columns
data_joined.show(5)
# Go to Pandas world for the plot
pandas_df = data_joined.toPandas()
matplotlib.rcParams.update({'font.size': 17})
fig = plt.figure(figsize=(15, 15))
ax = fig.add_subplot(1, 1, 1, projection='3d')
q = ax.quiver(
pandas_df.position_x,
pandas_df.position_y,
pandas_df.position_z,
pandas_df["avg(velocity_x)"],
pandas_df["avg(velocity_y)"],
pandas_df["avg(velocity_z)"], length=0.0005, color='k')
ax.scatter(
pandas_df.position_x,
pandas_df.position_y,
pandas_df.position_z,
s=pandas_df.halo_mass/pandas_df.halo_mass.max()*400, c='r')
plt.title(
'Mean 3D velocity of halos (z $\sim$ {:.2f})'.format(pandas_df.redshift.mean()))
ax.set_xlabel(r'x [Mpc/h]', labelpad=20)
ax.set_ylabel(r'y [Mpc/h]', labelpad=20)
ax.set_zlabel(r'z [Mpc/h]', labelpad=20)
ax.tick_params(axis='both', labelsize=10)
plt.show()
We show the 3D mean velocity (black arrow) for each halo contained in this data cube (15 Mpc size). We also superimposed the halo mass (red circle). The bigger is the circle, the more massive is the halo. It is interesting to notice that most massive halos are surrounded by halos with large velocities, they seem to flow towards them.
With more times, that would be interesting to study other properties and the environment of those halos and see whether there is a correlation with the velocities.
To end this journey, let's have a look at the mean velocity distribution as a function of redshift:
# Redshift range
redshift_start = 0.0
redshift_stop = 3.0
redshift_step = 0.5
redshift_window = 0.1
values = np.arange(redshift_start, redshift_stop, redshift_step)
# start at 0.2 because stat is poor at very low redshift
values[0] = 0.2
pdfs = []
for index, start in enumerate(values[:]):
dfcut = df_sub.filter("redshift > {}".format(start))\
.filter("redshift <= {}".format(start + redshift_window))
# Group data by halos and compute the mean velocity
df_disp = dfcut.groupBy("halo_id").agg({
"velocity_x": 'mean',
"velocity_y": 'mean',
"velocity_z": 'mean'})
# Add back original columns, and
# select only a few columns for the plot
data_joined = dfcut.join(df_disp, "halo_id")\
.filter("is_central == True")\
.select("avg(velocity_x)", "avg(velocity_y)", 'avg(velocity_z)')
pdfs.append(data_joined.toPandas())
matplotlib.rcParams.update({'font.size': 17})
fig = plt.figure(figsize=(12, 8))
plt.title("Mean 3D velocity of halos")
for index, pandas_df in enumerate(pdfs):
# simple mean
meanv = np.sqrt(
pandas_df["avg(velocity_x)"]**2 +
pandas_df["avg(velocity_y)"]**2 +
pandas_df["avg(velocity_z)"]**2)
# histogram
plt.hist(
meanv, bins=400,
label="{:.1f} < z < {:.1f}".format(values[index], values[index] + redshift_window),
histtype="step")
plt.yscale("log")
plt.xlim(0, 6000)
plt.xlabel("Mean 3D velocity of halo [km/s]")
plt.ylabel("count")
plt.legend()
plt.grid()
plt.show()
It is interesting to notice that the peak of the distribution shifts towards larger values for high redshift. The distributions are also broader at high redshift.
In this section we compute the velocity dispersion–halo mass relation. The principle is similar to the M-$\sigma$ relation for stars around black holes. The idea is to highlight the fact that gravitational interaction/friction between galaxies has an effect for the cluster galaxy evolution.
from pyspark.sql.functions import stddev_pop
@pandas_udf('double', PandasUDFType.SCALAR)
def normv(vx: Any, vy: Any, vz: Any) -> pd.Series:
""" Compute the 3D velocity norm.
To be used as a User Defined Function for Spark.
Parameters
----------
v{i} : 1D array, or string, or DataFrame column
Velocity data
Returns
----------
One-dimensional ndarray with axis labels (including time series).
"""
v = np.sqrt(vx**2 + vy**2 + vz**2)
return pd.Series(v)
# We reject synthetic halos - see for example discussion in
# https://github.com/LSSTDESC/cosmodc2/issues/82
# In addition we select halo members with stellar mass > 5e10 M_o.
df_filt = df.filter("halo_id > 0").filter("stellar_mass > 5e10")
# Add a column which is the 3D velocity norm
df_meanv = df_filt\
.withColumn("normv", normv("velocity_x", "velocity_y", "velocity_z"))\
.select(["normv", "halo_mass", "is_central", "halo_id", "redshift"])
# Group data by halos and compute the velocity dispersion
df_disp = df_meanv.groupBy("halo_id").agg(stddev_pop("normv").alias("stdv"))
# Add back the original DataFrame columns
# and select only central member for halo
# (unique halo mass and velocity dispersion for a halo)
data_joined = df_meanv.join(df_disp, "halo_id")\
.filter("is_central == True")\
.filter("stdv > 0")\
.select("stdv", "halo_mass", 'redshift')\
.dropna()
data_joined.show(5)
# Collect the data from the executors to the driver
data = data_joined.collect()
vel, mass, redshift = np.transpose(data)
print("Number of entries: {}".format(len(vel)))
Let's now plot the velocity dispersion vs halo mass
matplotlib.rcParams.update({'font.size': 17})
fig = plt.figure(figsize=(10, 7))
cs = plt.hexbin(mass, vel, gridsize=150, xscale="log", yscale="log", mincnt=1)
plt.xlabel(r'${\rm M}_h [{\rm M}_\odot]$', fontsize=30)
plt.ylabel(r'$\sigma_v$ [km/s]', fontsize=30)
plt.colorbar(cs, label="halos per bin")
title = "Velocity dispersion–halo mass relation\n"
hypotheses = r"(${\rm M}_*$ > $5\times10^{10}{\rm M}_\odot$, $n_{gal/halo}$ > 1, 0 < z < 3)"
plt.title(title + hypotheses, fontsize=20)
plt.show()
One can see that there is a good relationship between the two. One can also see regular stripes at some halo masses. Let's inspect the 1D mass distribution:
# Let's inspect these stripes
import seaborn as sns
joint_kws = dict(gridsize=200, mincnt=1)
g = sns.jointplot(
np.log10(mass),
np.log10(vel),
height=8, space=0.5,
kind='hex', xlim=(11.5, 15), ylim=(0, 3), color='k',
joint_kws=joint_kws,
marginal_kws=dict(bins=200, rug=True))
g.set_axis_labels(r'$\log({\rm M}_h \, [{\rm M}_\odot])$', r'$\log(\sigma_v \, [km/s])$')
plt.show()
Hum, the peaks in the 1D halo mass distribution seem to be regularly spaced in log space. In order to study more in details this effect, I created another notebook: hackurdc2_spark_halos_appendix.ipynb. This is also discussed in detailed in LSSTDESC/DC2-analysis/issues/57.
Long story short: The tooth shape is expected from how galaxies in the Universe Machine are sampled into the Outer Rim halo light cone. This will put a limit on the accuracy with which cosmoDC2 observations could constrain the halo mass of stacked galaxy sample, and future simulations may implement a bin-free sampling method instead.
Following e.g. 1602.00611 (see section 4.1), we can model the mean relation between velocity dispersion and halo mass at a given redshift using a simple power-law of the form:
$$\sigma_v(\rm{M}_h) = \alpha \Big( \dfrac{\rm{M}_h}{10^{14}\rm{M}_\odot} \Big)^{\beta}$$and the dependency in redshift is given by
$$\sigma_v(z) = \sigma_v(0)\sqrt{1 + z}$$Let's split our data set into redshit bins, and fit for the coefficients:
matplotlib.rcParams.update({'font.size': 17})
redshift_start = np.min(redshift)
redshift_stop = 3.0
redshift_step = 0.4
redshift_window = 0.1
values = np.arange(redshift_start, redshift_stop, redshift_step)
# start at 0.2 because stat is poor at very low redshift
values[0] = 0.2
def vel_mass_relation(mass: np.ndarray, a: float, b: float) -> np.ndarray:
""" Model for the mean relation between velocity dispersion
and halo mass at a given redshift using a simple power-law.
The pivot mass is at 10^14 solar mass.
Parameters
----------
mass : 1D numpy array
List of masses
a, b: floats
Coefficient of the power law a*x**b
Returns
----------
out : 1D numpy array
Velocity dispersion according to the power law.
"""
return a * (mass / 1e14)**b
# Use for later pretty print of coefficient fit.
table = """
| redshift | $\\alpha$ [km/s] | $\\beta$ | entries (#) |
|----------|------------------|----------|-------------|
"""
fig, ax = plt.subplots(1, 2, figsize=(13, 7))
fig.suptitle("Fit to the mean relation between velocity dispersion \n and halo mass different redshifts", y=1.1)
for index, start in enumerate(values[:]):
mask = (redshift > start) & (redshift < start + redshift_window)
res = curve_fit(vel_mass_relation, mass[mask], vel[mask])
val = res[0]
err = np.diag(res[1])
table += "{:.2f} < z < {:.2f} | {:.2f} $\pm$ {:.2f} | {:.2E} $\pm$ {:.2E} | {} | \n".format(
start, start + redshift_window, val[0], err[0], val[1], err[1], np.sum(mask))
# Plot the fit
ax[0].loglog(mass[mask], vel_mass_relation(mass[mask], val[0], val[1]),
ls='-', marker='', color="C{}".format(index),
label=r"{:.2f} < z < {:.2f}".format(start, start + redshift_window))
# Rescale with the redshift dependency
# before fitting.
res = curve_fit(vel_mass_relation, mass[mask], vel[mask] / np.sqrt(1 + redshift[mask]))
val = res[0]
err = np.diag(res[1])
ax[1].loglog(mass[mask], vel_mass_relation(mass[mask], val[0], val[1]),
ls='-', marker='', color="C{}".format(index),
label=r"{:.2f} < z < {:.2f}".format(start, start + redshift_window))
for i in range(2):
ax[i].set_ylim(10, 1000)
ax[i].set_xlim(np.min(mass), np.max(mass))
ax[i].set_xlabel(r'${\rm M}_h [{\rm M}_\odot]$', fontsize=30)
ax[0].set_ylabel(r'$\sigma_v$ [km/s]', fontsize=30)
ax[1].set_ylabel(r'$\sigma_v / \sqrt{1 + z}$ [km/s]', fontsize=30)
ax[0].legend()
plt.tight_layout()
plt.show()
# uncomment below to print coefficient values
# print(table)
This is similar to what is shown in e.g. 1602.00611 in a smaller redshift range (see section 4.1). If we look at the coefficients, $\alpha$s are somewhat similar but there are some differences in $\beta$s (note: on GitHub, the table is not correctly displayed... Try on the JupyterLab interface directly):
redshift | $\alpha$ [km/s] | $\beta$ | entries (#) |
---|---|---|---|
0.20 < z < 0.30 | 308.53 $\pm$ 46.48 | 5.17E-01 $\pm$ 2.80E-04 | 691 | 0.45 < z < 0.55 | 330.33 $\pm$ 28.55 | 5.09E-01 $\pm$ 1.27E-04 | 2231 | 0.85 < z < 0.95 | 368.03 $\pm$ 40.36 | 5.06E-01 $\pm$ 1.04E-04 | 3506 | 1.25 < z < 1.35 | 403.02 $\pm$ 83.50 | 5.12E-01 $\pm$ 1.53E-04 | 3200 | 1.65 < z < 1.75 | 425.80 $\pm$ 193.15 | 5.22E-01 $\pm$ 2.76E-04 | 2236 | 2.05 < z < 2.15 | 402.48 $\pm$ 626.99 | 4.80E-01 $\pm$ 7.84E-04 | 1237 | 2.45 < z < 2.55 | 505.46 $\pm$ 2236.88 | 5.57E-01 $\pm$ 1.66E-03 | 698 | 2.85 < z < 2.95 | 546.90 $\pm$ 6745.14 | 5.54E-01 $\pm$ 3.75E-03 | 346 |
Part of the difference can come from different cuts applied for the minimal mass, or the definition of member galaxies. We note that the $\alpha$ coefficients are poorly fitted for $z >1.6$. Maybe a different relation takes place at high redshift? Or we need more statistics? Or something else?
However the scaling in redshift ($\sigma_v \propto \sqrt{1 + z}$) seems rather good accross the whole data set!
We usually plot the stellar mass - halo mass relation, or black hole mass - halo mass relation (see for example this notebook). Here we propose to visualize it in 3D directly! By visual inspection, one can easily have an intuition of the contribution of stars or black holes to halo masses.
# Select masses
cols = ["halo_id", "position_x", "position_y", "position_z", "redshift"]
cols_mass = ["halo_mass", "stellar_mass", "blackHoleMass"]
df_sub = df.select(cols + cols_mass).filter("halo_id > 0")
# Get the mean position of our cube of data
# (we could also have re-used this from a previous cell)
df_stat = df_sub.select(["position_x", "position_y", "position_z"]).describe()
means = df_stat.collect()[1]
m_x = float(means['position_x'])
m_y = float(means['position_y'])
m_z = float(means['position_z'])
# Size of our data cube
Mpc = 30.
dfcut = df_sub.filter(
(F.abs(df_sub.position_x - m_x) < Mpc) &
(F.abs(df_sub.position_y - m_y) < Mpc) &
(F.abs(df_sub.position_z - m_z) < Mpc))
# Go to Pandas for the plot
df_disp = dfcut.groupBy("halo_id").agg(F.sum("stellar_mass").alias("total_stellar_mass"))
data_joined = dfcut.join(df_disp, "halo_id")
p = data_joined.toPandas()
matplotlib.rcParams.update({'font.size': 17})
def fmass(p: pd.DataFrame, which: str) -> pd.Series:
""" Compute the fraction of mass relative to
the halo mass (in %)
Parameters
----------
p : Pandas DataFrame
Pandas DataFrame
which : str
Column name for which we want the ratio
Returns
----------
ratio: pd.Series
1D array with fractional masses.
"""
ratio = (p[which] / p['halo_mass']) * 100
return ratio
# Stellar mass / halo mass
fig = plt.figure(figsize=(12,12))
ax = fig.add_subplot(111, projection='3d')
ax.set_title('Fraction of stellar mass in Halos (z $\sim$ {:.2f})'.format(p.redshift.mean()))
c = ax.scatter(
p.position_x,
p.position_y,
p.position_z,
c=fmass(p, "total_stellar_mass"),
s=fmass(p, "total_stellar_mass"))
plt.colorbar(c, shrink=0.75, label=r"${\rm M}_*/{\rm M}_h$[%]", orientation="horizontal", pad=0.05)
ax.set_xlabel(r'x [Mpc/h]', labelpad=20)
ax.set_ylabel(r'y [Mpc/h]', labelpad=20)
ax.set_zlabel(r'z [Mpc/h]', labelpad=20)
plt.show()
# Black hole mass / halo mass
fig = plt.figure(figsize=(12,12))
ax = fig.add_subplot(111, projection='3d')
ax.set_title('Fraction of black hole mass in Halos (z $\sim$ {:.2f})'.format(p.redshift.mean()))
c = ax.scatter(
p.position_x,
p.position_y,
p.position_z,
c=fmass(p, "blackHoleMass"),
s=fmass(p, "blackHoleMass")*1000)
plt.colorbar(c, shrink=0.75, label=r"${\rm M}_{bh}/{\rm M}_h$[%]", orientation="horizontal", pad=0.05)
ax.set_xlabel(r'x [Mpc/h]', labelpad=20)
ax.set_ylabel(r'y [Mpc/h]', labelpad=20)
ax.set_zlabel(r'z [Mpc/h]', labelpad=20)
plt.show()
No surprises, most halo masses are not dominated by stellar masses ($< 5$%), and black holes only contribute to a few ($< 0.2$%)! Of course, one could use much better visualisation tools (combined with the power of Spark!), like is done in hackurdc2_TheMovie/tutorials/cosmoDC2_TheMovie.ipynb.
In this notebook, we have done science thanks to Apache Spark. Spark, is meant to be primarily used in a context of big data. One of its strength is its scalability, namely its capability of using the same piece of code regardless the underlying data volume. The performance of the code will then only depend on the resource used. E.g. for tasks without communications, execution time will be linear with data or resource.
Keep in mind:
So in this example, the smallish vloume of data is likely to lessen Spark performance (volume is just few tens of GB here), and these tests must also be ran on hundreds of GB of data with many processors!
More on Spark in the context of DESC can be found at: https://github.com/LSSTDESC/desc-spark