Repo: github.com/jet-net/JetNet
Docs: jetnet.readthedocs.io
Paper: 2106.11535
JetNet: Python package with easy-to-access datasets, standardised evaluation metrics, and more utilities for improving accessibility and reproducibility in ML + HEP.
We'll use the jetnet.datasets.JetNet.getData
function to download and directly access the dataset.
First, we can check which particle and jet features are available in this dataset:
from jetnet.datasets import JetNet
print(f"Particle features: {JetNet.all_particle_features}")
print(f"Jet features: {JetNet.all_jet_features}")
Next, let's load the data:
data_args = {
"jet_type": ["g", "t", "w"], # gluon, top quark, and W boson jets
"data_dir": "datasets/jetnet",
# only selecting the kinematic features
"particle_features": ["etarel", "phirel", "ptrel"],
"num_particles": 30,
"jet_features": ["type", "pt", "eta", "mass"],
}
particle_data, jet_data = JetNet.getData(**data_args)
Let's look at some of the data:
print(f"Particle features of the 10 highest pT particles in the first jet\n{data_args['particle_features']}\n{particle_data[0, :10]}")
print(f"\nJet features of first jet\n{data_args['jet_features']}\n{jet_data[0]}")
We can also visualise these jets as images:
from jetnet.utils import to_image
import matplotlib.pyplot as plt
num_images = 5
num_types = len(data_args["jet_type"])
im_size = 25 # number of pixels in height and width
maxR = 0.4 # max radius in (eta, phi) away from the jet axis
cm = plt.cm.jet.copy()
cm.set_under(color="white")
plt.rcParams.update({"font.size": 16})
fig, axes = plt.subplots(
nrows=num_types,
ncols=num_images,
figsize=(40, 8 * num_types),
gridspec_kw={"wspace": 0.25},
)
# get the index of each jet type using the JetNet.jet_types array
type_indices = {jet_type: JetNet.jet_types.index(jet_type) for jet_type in data_args["jet_type"]}
for j in range(num_types):
jet_type = data_args["jet_type"][j]
type_selector = jet_data[:, 0] == type_indices[jet_type] # select jets based on jet_type feat
axes[j][0].annotate(
jet_type,
xy=(0, -1),
xytext=(-axes[j][0].yaxis.labelpad - 15, 0),
xycoords=axes[j][0].yaxis.label,
textcoords="offset points",
ha="right",
va="center",
fontsize=24
)
for i in range(num_images):
im = axes[j][i].imshow(
to_image(particle_data[type_selector][i], im_size, maxR=maxR),
cmap=cm,
interpolation="nearest",
vmin=1e-8,
extent=[-maxR, maxR, -maxR, maxR],
vmax=0.05,
)
axes[j][i].tick_params(which="both", bottom=False, top=False, left=False, right=False)
axes[j][i].set_xlabel("$\phi^{rel}$")
axes[j][i].set_ylabel("$\eta^{rel}$")
axes[j][i].set_title(f"Jet {i + 1}")
cbar = fig.colorbar(im, ax=axes.ravel().tolist(), fraction=0.01)
cbar.set_label("$p_T^{rel}$")
And calculate and plot their overall features:
from jetnet.utils import jet_features
import numpy as np
fig = plt.figure(figsize=(12, 12))
plt.ticklabel_format(axis="y", scilimits=(0, 0), useMathText=True)
for j in range(num_types):
jet_type = data_args["jet_type"][j]
type_selector = jet_data[:, 0] == type_indices[jet_type] # select jets based on jet_type feat
jet_masses = jet_features(particle_data[type_selector][:50000])["mass"]
_ = plt.hist(jet_masses, bins=np.linspace(0, 0.2, 100), histtype="step", label=jet_type)
plt.xlabel("Jet $m/p_{T}$")
plt.ylabel("# Jets")
plt.legend(loc=1, prop={"size": 18})
plt.title("Relative Jet Masses")
plt.show()
To prepare the dataset for machine learning applications, we can use the jetnet.datasets.JetNet
class itself, which inherits the pytorch.data.utils.Dataset
class.
We'll also use the class to normalise the features to have zero means and unit standard deviations, and transform the jet type feature to be one-hot-encoded.
from jetnet.datasets import JetNet
from jetnet.datasets.normalisations import FeaturewiseLinear
import numpy as np
from sklearn.preprocessing import OneHotEncoder
# function to one hot encode the jet type and leave the rest of the features as is
def OneHotEncodeType(x: np.ndarray):
enc = OneHotEncoder(categories=[[0, 1]])
type_encoded = enc.fit_transform(x[..., 0].reshape(-1, 1)).toarray()
other_features = x[..., 1:].reshape(-1, 3)
return np.concatenate((type_encoded, other_features), axis=-1).reshape(*x.shape[:-1], -1)
data_args = {
"jet_type": ["g", "t"], # gluon and top quark jets
"data_dir": "datasets/jetnet",
# these are the default particle features, written here to be explicit
"particle_features": ["etarel", "phirel", "ptrel", "mask"],
"num_particles": 10, # we retain only the 10 highest pT particles for this demo
"jet_features": ["type", "pt", "eta", "mass"],
# we don't want to normalise the 'mask' feature so we set that to False
"particle_normalisation": FeaturewiseLinear(normal=True, normalise_features=[True, True, True, False]),
# pass our function as a transform to be applied to the jet features
"jet_transform": OneHotEncodeType,
}
jets_train = JetNet(**data_args, split="train")
jets_valid = JetNet(**data_args, split="valid")
We can look at one of our datasets to confirm everything is as we expect:
jets_train
And also directly at the data itself - note that the features have been normalised and the jet type has been one-hot-encoded):
particle_features, jet_features = jets_train[0]
print(f"Particle features ({data_args['particle_features']}):\n\t{particle_features}")
print(f"\nJet features ({data_args['jet_features']}):\n\t{jet_features}")
Next things you can try are:
jetnet.datasets.TopTagging
) and Quark Gluon datasets (jetnet.datasets.QuarkGluon
)jetnet.evaluation
)