iminuit
.zfit
.
The iminuit
package provides Python bindings for the C++ Minuit2 library maintained at CERN, which is effectively the only fitting engine used in HEP. The package has no external dependency apart from NumPy.
iminuit
gives the user full power of the engine internals. That can be really useful, but it often demands a certain level of expertise. It is hence no wonder that many fitting libraries around build atop it. That's in particular the case for Astronomy.
Note: feel free to complement the introduction below with the several great tutorials available from the GitHub repository. Indeed what is shown here is just an appetiser.
Minimisation of a function:
from iminuit import Minuit
def fcn(x, y, z):
return (x - 2) ** 2 + (y - 3) ** 2 + (z - 4) ** 2
fcn.errordef = Minuit.LEAST_SQUARES
m = Minuit(fcn, x=0, y=0, z=0)
m.migrad() # run optimiser
m.hesse() # run covariance estimator
Let's look at a little sample of track information generated with a toy.
Acknowledgements
This mini-tutorial is kindly provided by Hans Dembinski (TU Dortmund), with minor modifications.
import uproot
f = uproot.open("data/sample_tracks.root")
event = f["event"]
event.show()
Legend :-):
Get the content of branches as arrays (don't use this in large trees - you will exhaust the computer memory):
trk_len = event["trk_len"].array()
mc_trk_len = event["mc_trk_len"].array()
#trk_mom = event.arrays(["trk_px", "trk_py", "trk_pz"], library="pd")
trk_px = event["trk_px"].array(library="pd")
# first ten entries, this is a normal numpy array
trk_len[:10]
import numpy as np
print(f"{np.sum(trk_len == 0)} events with zero tracks")
import matplotlib.pyplot as plt
plt.hist(trk_len);
# first ten entries, trk_px is a special jagged array
trk_px[:10]
plt.hist(trk_px);
Let's fit the $p_x$ distribution with a normal distribution to extract the parameters $\mu$ and $\sigma$.
import boost_histogram as bh
xaxis = bh.axis.Regular(50, -2, 2)
h_px = bh.Histogram(xaxis)
h_px.fill(trk_px)
# get data from before
px_axis = h_px.axes[0]
cx = px_axis.centers
dx = px_axis.widths
xe = px_axis.edges
n = h_px.view()
plt.errorbar(cx, n, n**0.5, dx, fmt="o", label="data");
# scipy has efficient and correct implementations for most statistical distributions
from scipy.stats import norm, poisson
n_total = np.sum(n)
def score(mu, sigma):
cdf = norm(mu, sigma).cdf
lambdas = n_total * (cdf(xe[1:]) - cdf(xe[:-1]))
probs = poisson.pmf(n, lambdas)
return -2 * np.sum(np.log(probs + 1e-100)) # avoid taking log of zero
from matplotlib.ticker import LogLocator
mus = np.linspace(-1, 1, 20)
sigmas = np.linspace(1e-10, 2, 20)
g_mu, g_sigma = np.meshgrid(mus, sigmas)
g_score = np.vectorize(score)(g_mu, g_sigma)
plt.contourf(g_mu, g_sigma, g_score)
plt.xlabel("$\mu$")
plt.ylabel("$\sigma$")
plt.colorbar();
m = Minuit(score, mu=0, sigma=1)
m.errordef = Minuit.LEAST_SQUARES
m.migrad()
mu, sigma = m.values
s_mu, s_sigma = m.errors
plt.errorbar(cx, n, n ** 0.5, dx, fmt="o", label="data");
plt.plot(cx, norm(mu, sigma).pdf(cx) * n_total * dx, label="fit")
plt.title(f"$\mu = {mu:.3f} \pm {s_mu:.3f}, \quad \sigma = {sigma:.3f} \pm {s_sigma:.3f}$")
plt.legend();
Check whether the fit is good:
cdf = norm(mu, sigma).cdf
n_pred = (cdf(xe[1:]) - cdf(xe[:-1])) * n_total
n_sigma = n_pred ** 0.5 # for Poisson-distributed data
pull = (n - n_pred) / n_sigma
plt.errorbar(cx, pull, np.ones_like(pull), fmt="o")
plt.axhline(0, ls="--")
# degrees of freedom: number of fitted bins minus number of fitted parameters
n_dof = len(n) - 2 # need to subtract two fitted parameters
chi2_obs = np.sum(pull ** 2)
print(f"chi2/ndof = {chi2_obs} / {n_dof} = {chi2_obs / n_dof}")
from scipy.stats import chi2
chance_prob = 1 - chi2(n_dof).cdf(chi2_obs)
print(f"{chance_prob}")
HEP analyses typically involve far more sophisticated fitting work. It is often about data model building and performing (un)binned maximum likelihood fits to describe experimental distributions. Also, fitting is often performed to extract decay mode yields atop the model parameters.
The iminuit
package has functionality to performed so-called simultaneous fits,
where the models being fitted to two or more datasets share at least a parameter.
But it is not meant as a package providing commonly-used goodies and a "framework" for (complex) model building. Imagine how cumbersome it would be to verbosely describe the model below for an amplitude analysis ...!
(Taken straight from one of the excellent iminuit
tutorials.)
The binned extended maximum-likelihood fit is strictly the binned equivalent of the corresponding unbinned fit. One sums the logarithm of Poisson probabilities for the observed counts as a function of the predicted counts in this case (times -1 to turn maximization into minimization).
Instead of a density, you need to provide a cdf of the density in this case (which must be vectorized). There is no need to separately return the total integral like the unbinned case. The parameters are the same as in the unbinned extended fit.
from iminuit import cost
from scipy.stats import norm, uniform
xrange = -1, 1
rng = np.random.default_rng(1)
xdata = rng.normal(0, 0.1, size=400)
xdata = np.append(xdata, rng.uniform(*xrange, size=1000))
def model_pdf(x, z, mu, sigma):
return (z * norm.pdf(x, mu, sigma) +
(1 - z) * uniform.pdf(x, xrange[0], xrange[1] - xrange[0]))
c = cost.UnbinnedNLL(xdata, model_pdf)
m = Minuit(c, z=0.4, mu=0, sigma=0.2)
m.limits["sigma"] = (0, None)
m.limits["z"] = (0, 1)
m.limits["mu"] = (-1, 1)
m.migrad()
n, xe = np.histogram(xdata, bins=50, range=xrange)
cx = 0.5 * (xe[1:] + xe[:-1])
dx = np.diff(xe)
plt.errorbar(cx, n, n ** 0.5, fmt="ok")
xm = np.linspace(xe[0], xe[-1])
plt.plot(xm, model_pdf(xm, *[p.value for p in m.init_params]) * len(xdata) * dx[0],
ls=":", label="init")
plt.plot(xm, model_pdf(xm, *m.values) * len(xdata) * dx[0], label="fit")
plt.legend();
In spite of the example above being a simple example of an extended maximum-likelihood fit to binned data, the code is, honestly, very verbose. Clearly much of the above could be simplified for the general user. And now imagine that the (realistic example!) function to fit is not a uniform+normal distribution but rather this:
The ROOT framework provides RooFit
as a model fitting library. The PyHEP ecosystem in particular has the zfit
package, a model fitting library based on TensorFlow and optimised for simple and direct manipulation of probability density functions ...