Notebook

JetNet Demo¶

Raghav Kansal
UC San Diego

PyHEP 2022 Workshop
Online, 12-16 September 2022

JetNet: For developing and reproducing ML + HEP projects.

Repo: github.com/jet-net/JetNet

Docs: jetnet.readthedocs.io

Paper: 2106.11535

Introduction¶

Problems:¶

How do I get started with machine learning in high energy physics?
How do I evaluate my results?
How do we reproduce and compare results?

Solution:¶

JetNet: Python package with easy-to-access datasets, standardised evaluation metrics, and more utilities for improving accessibility and reproducibility in ML + HEP.

Note: Still under development, with currently a limited number of datasets and metrics. Feedback and contributions welcome!¶

Today¶

Loading and looking at the JetNet dataset
Preparing the dataset for training a model

Data loading¶

We'll use the jetnet.datasets.JetNet.getData function to download and directly access the dataset.

First, we can check which particle and jet features are available in this dataset:

In [ ]:

from jetnet.datasets import JetNet
print(f"Particle features: {JetNet.all_particle_features}")
print(f"Jet features: {JetNet.all_jet_features}")

Next, let's load the data:

In [ ]:

data_args = {
    "jet_type": ["g", "t", "w"],  # gluon, top quark, and W boson jets
    "data_dir": "datasets/jetnet",
    # only selecting the kinematic features
    "particle_features": ["etarel", "phirel", "ptrel"],
    "num_particles": 30,
    "jet_features": ["type", "pt", "eta", "mass"],
}

particle_data, jet_data = JetNet.getData(**data_args)

Let's look at some of the data:

In [ ]:

print(f"Particle features of the 10 highest pT particles in the first jet\n{data_args['particle_features']}\n{particle_data[0, :10]}")
print(f"\nJet features of first jet\n{data_args['jet_features']}\n{jet_data[0]}")

We can also visualise these jets as images:

In [ ]:

from jetnet.utils import to_image
import matplotlib.pyplot as plt

num_images = 5
num_types = len(data_args["jet_type"])
im_size = 25  # number of pixels in height and width
maxR = 0.4  # max radius in (eta, phi) away from the jet axis

cm = plt.cm.jet.copy()
cm.set_under(color="white")
plt.rcParams.update({"font.size": 16})

fig, axes = plt.subplots(
    nrows=num_types,
    ncols=num_images,
    figsize=(40, 8 * num_types),
    gridspec_kw={"wspace": 0.25},
)

# get the index of each jet type using the JetNet.jet_types array
type_indices = {jet_type: JetNet.jet_types.index(jet_type) for jet_type in data_args["jet_type"]}

for j in range(num_types):
    jet_type = data_args["jet_type"][j]
    type_selector = jet_data[:, 0] == type_indices[jet_type]  # select jets based on jet_type feat

    axes[j][0].annotate(
            jet_type,
            xy=(0, -1),
            xytext=(-axes[j][0].yaxis.labelpad - 15, 0),
            xycoords=axes[j][0].yaxis.label,
            textcoords="offset points",
            ha="right",
            va="center",
            fontsize=24
        )

    for i in range(num_images):
        im = axes[j][i].imshow(
            to_image(particle_data[type_selector][i], im_size, maxR=maxR),
            cmap=cm,
            interpolation="nearest",
            vmin=1e-8,
            extent=[-maxR, maxR, -maxR, maxR],
            vmax=0.05,
        )
        axes[j][i].tick_params(which="both", bottom=False, top=False, left=False, right=False)
        axes[j][i].set_xlabel("$\phi^{rel}$")
        axes[j][i].set_ylabel("$\eta^{rel}$")
        axes[j][i].set_title(f"Jet {i + 1}")

cbar = fig.colorbar(im, ax=axes.ravel().tolist(), fraction=0.01)
cbar.set_label("$p_T^{rel}$")

And calculate and plot their overall features:

In [ ]:

from jetnet.utils import jet_features
import numpy as np

fig = plt.figure(figsize=(12, 12))
plt.ticklabel_format(axis="y", scilimits=(0, 0), useMathText=True)

for j in range(num_types):
    jet_type = data_args["jet_type"][j]
    type_selector = jet_data[:, 0] == type_indices[jet_type]  # select jets based on jet_type feat

    jet_masses = jet_features(particle_data[type_selector][:50000])["mass"]
    _ = plt.hist(jet_masses, bins=np.linspace(0, 0.2, 100), histtype="step", label=jet_type)

plt.xlabel("Jet $m/p_{T}$")
plt.ylabel("# Jets")
plt.legend(loc=1, prop={"size": 18})
plt.title("Relative Jet Masses")
plt.show()

Dataset preparation¶

To prepare the dataset for machine learning applications, we can use the jetnet.datasets.JetNet class itself, which inherits the pytorch.data.utils.Dataset class.

We'll also use the class to normalise the features to have zero means and unit standard deviations, and transform the jet type feature to be one-hot-encoded.

In [ ]:

from jetnet.datasets import JetNet
from jetnet.datasets.normalisations import FeaturewiseLinear

import numpy as np
from sklearn.preprocessing import OneHotEncoder


# function to one hot encode the jet type and leave the rest of the features as is
def OneHotEncodeType(x: np.ndarray):
    enc = OneHotEncoder(categories=[[0, 1]])
    type_encoded = enc.fit_transform(x[..., 0].reshape(-1, 1)).toarray()
    other_features = x[..., 1:].reshape(-1, 3)
    return np.concatenate((type_encoded, other_features), axis=-1).reshape(*x.shape[:-1], -1)


data_args = {
    "jet_type": ["g", "t"],  # gluon and top quark jets
    "data_dir": "datasets/jetnet",
    # these are the default particle features, written here to be explicit
    "particle_features": ["etarel", "phirel", "ptrel", "mask"],
    "num_particles": 10,  # we retain only the 10 highest pT particles for this demo
    "jet_features": ["type", "pt", "eta", "mass"],
    # we don't want to normalise the 'mask' feature so we set that to False
    "particle_normalisation": FeaturewiseLinear(normal=True, normalise_features=[True, True, True, False]),  
    # pass our function as a transform to be applied to the jet features
    "jet_transform": OneHotEncodeType,
}

jets_train = JetNet(**data_args, split="train")
jets_valid = JetNet(**data_args, split="valid")

We can look at one of our datasets to confirm everything is as we expect:

In [ ]:

jets_train

And also directly at the data itself - note that the features have been normalised and the jet type has been one-hot-encoded):

In [ ]:

particle_features, jet_features = jets_train[0]
print(f"Particle features ({data_args['particle_features']}):\n\t{particle_features}")
print(f"\nJet features ({data_args['jet_features']}):\n\t{jet_features}")

We can now feed this into a PyTorch DataLoader and start training!

Next things you can try are:

Repeat this with the Top Quark Tagging (jetnet.datasets.TopTagging) and Quark Gluon datasets (jetnet.datasets.QuarkGluon)
Training an ML model (tutorial coming soon...)
Evaluating generative models (jetnet.evaluation)

JetNet Demo¶

Raghav KansalUC San Diego

PyHEP 2022 WorkshopOnline, 12-16 September 2022

JetNet: For developing and reproducing ML + HEP projects.

Introduction¶

Problems:¶

Solution:¶

Note: Still under development, with currently a limited number of datasets and metrics. Feedback and contributions welcome!¶

Today¶

Data loading¶

Dataset preparation¶

We can now feed this into a PyTorch DataLoader and start training!

Raghav Kansal
UC San Diego

PyHEP 2022 Workshop
Online, 12-16 September 2022