This notebook will provide you with the tools to do sensitivity estimates which can be used for search region optimization or sensitivity projections.
In searches for new physics we want to know how significant a potential deviation from our Standard Model (SM) expectation is. We do this by a hypothesis test where we try to exclude the SM ("background only") hypothesis. We use a so called p-value $p_0$ for this, abstractly defined by:
$$p_0 = \int\limits_{t_\mathrm{obs}}^{\infty}p(t|H_0)\mathrm{d}t$$where $t$ is a test statistic (a number we calculate from our data observations) and $p(t|H_0)$ is the probability distribution for $t$ under the assumption of our null Hypothesis $H_0$, in this case the background only hypothesis. This p-value is then typically converted into a number of standard deviations $z$, the significance ("number of sigmas") via the inverse of the cumulative standard normal distribution $\Phi$:
$$z = \Phi^{-1}(1 - p)$$The typical convention for particle physics is to speak of evidence when $z>3$ and of an observation when $z>5$.
So what do we use for $t$? We want to use something that discriminates well between our null Hypothesis and an alternative Hypothesis that we have in mind. When we try to discover new physics, our null Hypothesis is the absence and the alternative Hypothesis the presence of a signal. We can parametrize this by a signal strength parameter $\mu$. The test statistics used in almost all LHC searches use the profile likelihood ratio
$$\Lambda_\mu = \frac{L(\mu, \hat{\hat{\theta}})}{L(\hat{\mu}, \hat{\theta})}$$where $\theta$ are the other parameters of our model that are not part of the test, the so called nuisance parameters. In contrast, the parameter that we want to test, $\mu$, is called our parameter of interest (POI). The nuisance parameters include all fit parameters, like normalization factors and parameters for describing uncertainties. $L(\mu, \hat{\hat{\theta}})$ is the Likelihood function, maximized under the condition that our parameter of interest takes the value $\mu$ and $L(\hat{\mu}, \hat{\theta})$ is the unconditionally maximized Likelihood. So roughly speaking, we are calculating the fraction of the maximum possible likelihood that we can get under our test condition. If it is high, that speaks for our hypothesis, if it is low, against. The test statistic $t_\mu$ is then defined as
$$t_\mu = -2\ln\Lambda_\mu$$giving us a test statistic where high values speak against the null hypothesis.
All that's left now is to know the distribution of $p(t_\mu|H_0)$. Wilk's theorem tells us that the distribution of $t_\mu$ is asymptotically (for large sample sizes) a chi-square distribution. For the discovery p-value we use a slightly modified version of test statistic, called $q_0$ where $\hat{\mu}$ is required to be $>=0$ ($q_0=0$ for $\hat{\mu} < 0$). For $q_0$ the p-value in the asymptotic limit collapses to a very simple formula:
$$p_0 = \sqrt{q_0}$$The asymptotic limit often matches quite well even for fairly small sample sizes, but it should be kept in mind this is an approximation. Alternatively, one can evaluate $p(t_\mu|H_0)$ by Monte Carlo sampling ("toys").
Now, sadly, not all searches find evidence for new physics. What we still can do in such a case is to try exclude models by rejecting the hypothesis of a signal being present. That usually means we test against $\mu=1$ or some other value $>0$. The rest of the procedure is very similar with one small detail worth mentioning ... In high energy physics it is very common to use a quantity called $CL_s$ instead of plain p-value. It is defined by
$$CL_s = \frac{CL_{s+b}}{CL_{b}}$$where $CL_{s+b}$ is the p-value for rejecting the hypothesis of signal + background being present (what would be the "normal" p-value) and $CL_{b}$ is the p-value for rejecting the background only hypothesis, but now using the test statistic for $\mu=1$ (so this is different from $p_0$!). We won't go into further details how to calculate those p-values. pyhf
has the formulas included and does it automatically for us. The asymptotic distributions for all different variants are described in the paper "Asymptotic formulae for likelihood-based tests of new physics" (arXiv:1007.1727).
Just a qualitative explanation of why we use $CL_s$ instead of the p-value: We want to avoid excluding signals in cases where we don't have sensitivity, but observe an underfluctuation of the data. In these cases $CL_{s+b}$ and $CL_b$ will be very similar and consequently lead to a large value of $CL_{s}$, telling us the signal is not excluded. In case our observations are exactly on spot with the background expectations $CL_b = 0.5$ in the asymptotic limit, so on average we have twice as high "p-values" with $CL_s$.
The typical convention for particle physics is to speak of an exclusion of a signal if $CL_s < 0.05$. That's usually what is meant by "limit at 95% confidence level".
Let's start with a simple case where we only want to count the number of events in a certain search region. We assume a certain number of expected background events b
, expected signal events s
and a total uncertainty on the expected background delta_b
($\sigma_b$).
The likelihood function for this can be formulated as a primary measurement of n
events and a control ("auxiliary") measurment of m
events that constrains our background parameter within the uncertainty. So, a product of 2 Poisson distributions:
The parameter $\tau$ can be given in terms of $\sigma_b$ by asking the question "How much more events do i have to measure in the control region to get the relative uncertainty $\sigma_b / b$". That gives
$\tau = \frac{b}{\sigma_b^2}$
Equivalently, we can replace $b$ by $\gamma b$ and $s$ by $\mu s$ to fit normalization factors (initialized to 1) and keep $s$ and $b$ fixed to our expectation.
$$L'(\mu, \gamma) = L(\mu s, \gamma b)$$pyhf
has a convenience function to create the specification for such a model: pyhf.simplemodels.hepdata_like
. It also works for arbitrary many bins, but for now let's go with one bin and 5 expected background events, 7 expected signal events and an uncertainty of 2 on the expected background events:
import pyhf
from scipy import stats
import numpy as np
s = 7
b = 5
delta_b = 2
model = pyhf.simplemodels.hepdata_like(
signal_data=[s], bkg_data=[b], bkg_uncerts=[delta_b]
)
The model comes with a "parameter of interest" (POI) called mu
that is our signal strength:
model.config.poi_name
In addition, we have one nuisance parameter, the constrained background normalization $\gamma$, called uncorr_bkguncrt
here:
model.config.par_order
It's initial value should be 1
gamma_initial = 1
So the expected data in our model scales with mu
. For mu=1
we get 5 * 1 * 7 = 12
model.expected_actualdata([1, gamma_initial])
for mu=2
we get 5 + 2 * 7 = 19
model.expected_actualdata([2, gamma_initial])
The auxiliary data corresponds to $\tau b$ in the formula above:
model.config.auxdata
It's given by our background uncertainty delta_b
:
b ** 2 / (delta_b ** 2)
To get the p-value for rejection of the background only hypothesis, we call pyhf.infer.hypotest
with the test value 0 of our POI $\mu$ using the q0
test statistic.
We want to know which p-value we would get if we would observe an excess of events of precisely the expected signal, so we plug in s + b
for the data:
pvalue = pyhf.infer.hypotest(
poi_test=0,
data=[s + b] + model.config.auxdata,
pdf=model,
test_stat="q0"
)
pvalue
We can convert this into a significance (number of standard deviations) using the inverse of the cumulative standard normal distribution $\Phi$
$$z = \Phi^{-1}(1 - p)$$The function scipy.stats.norm.isf
("inverse survival function") calculates $\Phi^{-1}(1 - p)$ in a numerically stable way (also for small p-values).
def pvalue_to_significance(pvalue):
return stats.norm.isf(pvalue)
pvalue_to_significance(pvalue)
That would not count as "Evidence" yet.
Equivalently we can test for exclusion and calculate $CL_s$. For that we use 1 as the test value for $\mu$ and the qtilde
test statistic.
We want to know if we could exclude a signal if we would not observe any more data than our background expectation, so we set our data to b
:
CLs = pyhf.infer.hypotest(
poi_test=1,
data=[b] + model.config.auxdata,
pdf=model,
test_stat="qtilde"
)
CLs
pyhf
to calculate the expected significance or CLs. The maximum likelihood estimates can be calculated exactly. Have a look at numbercounting.py
for some functions to help you with that.from numbercounting import z0_exp, cls_exp
z0_exp(s, b, delta_b)
cls_exp(s, b, delta_b)
These functions support numpy arrays and you can run these functions easily to test millions of different values!
N = 1_000_000
s_random = 20 * np.random.rand(N)
b_random = 20 * np.random.rand(N)
db_rel_random = np.random.rand(N)
%time z0_exp(s_random, b_random, db_rel_random * b_random)
%time cls_exp(s_random, b_random, db_rel_random * b_random)
import numpy as np
import matplotlib.pyplot as plt
Often does not only quote limits on a particular assumed signal strength, but gives a limit on the signal strength itself. This is especially interesting for single-bin (cut & count) search regions, since it is quite model independent (everybody can simulate their own model and calculate the number of excess events in a certain search region to determine if it is excluded by such a limit).
To do that, we need to invert the hypothesis test, e.g find the value of the signal strength for which we can exclude it at $CL_s<0.05$.
So, let's do a scan!
Ok, before we do so, let me introduce another concept: In addition to the expected $CL_s$ values we usually also show the expected 1 and 2 $\sigma$ bands to get a feeling in which range we expect to actually observe $CL_s$ values when we do the analysis on real (and therefore fluctuating) data. pyhf.infer.hypotest
can return us as a second return value the [-2, -1, 0, 1, 2]
$\sigma$ bounds on $CL_s$ if we pass return_expected_set=True
.
For example:
pyhf.infer.hypotest(1, [b] + model.config.auxdata, model, return_expected_set=True, test_stat="qtilde")
So the first return value is the observed $CL_s$ value (which is in our case the same as the expected one, since we plugged the exact background expectation into our model) and the second return value is a list of 5 $CL_s$ values for the [-2, -1, 0, 1, 2]
$\sigma$ bounds. So in our case the third value in that list is the same as our "observed" $CL_s$ value.
Now for the actual scan - let's see what the lowest possible value (and the 1 and 2 $\sigma$ bands) for $\mu$ that is excluded with $CL_s<0.05$. Let's scan with 31 points between 0 and 3:
mu_scan = np.linspace(0, 3, 31)
results = [
pyhf.infer.hypotest(mu, [b] + model.config.auxdata, model, return_expected_set=True, test_stat="qtilde")
for mu in mu_scan
]
# for this example we only need the expected band (second return value)
# let's also convert this to a numpy array, such that we can slice it column-wise
results = np.array([r[1] for r in results])
This is often visualized with interpolated lines and a yellow and green band for the 1 and 2 sigma bounds ("brazil plot"):
def plot_scan(scan_parameters, results, exclusion_level=0.05, ax=None):
ax = ax or plt.gca()
ax.axhline(exclusion_level, linestyle="--", color="red")
ax.plot(scan_parameters, results[:, 2], "--", color="black", label="Expected")
ax.fill_between(
scan_parameters, results[:, 1], results[:, 3], alpha=0.5, color="green", label="Expected +/- 1 σ"
)
ax.fill_between(
scan_parameters, results[:, 0], results[:, 4], alpha=0.5, color="yellow", label="Expected +/- 2 σ"
)
return ax
ax = plot_scan(mu_scan, results)
ax.set_xlabel("Signal stregth $\mu$")
ax.set_ylabel("$CL_s$")
ax.legend()
By looking where the red line crosses the expected line and the error bands we can conclude that the minimum signal strength we expect to exclude in case of no excess events is slightly below 1. We would expect that limit to fluctuate between $\approx$ 0.6 and 1.4 at $1\sigma$ level and 0.5 to 2 at $2\sigma$ level.
pyhf
also has a convenience function to run the scan and interpolate this for us, so we don't need to read it from the plot with a ruler ;)
pyhf.infer.intervals.upperlimit([b] + model.config.auxdata, model, mu_scan)
1e-10
) since the $\lambda$ parameter of a poisson distribution has to be strictly greater than 0.
import json
import mplhep as hep
Now, often we are not only interested in one particular signal, but we might have a certain class of signal models in mind and ask ourselves which parameters of that model are excluded.
What we can do is run an upper limit scan for each parameter and look for which parameters the excluded signal strength is below 1!
To get something that looks realistic, i prepared a simplified version of a fit to 9 bins using the published data from the ATLAS SUSY 1L Wh analysis. If you are interested how the procedure of simplifiying the model work, have a look at dump_signal_grid.ipynb.
Lets load it! We have the following background expectations for each bin:
with open("example_background.json") as f:
b_9bins = json.load(f)
b_9bins
And we have a long list of signal models for a SUSY process with a chargino/neutralino pair ($\tilde{\chi}_1^{\pm}/\tilde{\chi}_2^{0}$) that each decay into the lightest neutralino $\tilde{\chi}_1^{0}$, while emitting a W Boson and a Higgs Boson.
The model parameters are the $\tilde{\chi}_1^{\pm}/\tilde{\chi}_2^{0}$ mass (assumed to be the same) and the $\tilde{\chi}_1^{0}$ mass. We call them x
and y
in the following.
with open("example_signals.json") as f:
signals_9bins = json.load(f)
signals_9bins
Let's plot a few random signal models against the background expectation for all 9 bins:
hep.histplot(b_9bins, label="background", color="black")
for i in np.random.permutation(len(signals_9bins))[:3]:
signal = signals_9bins[i]
hep.histplot(signal["data"], label=f"i={i} x={signal['x']} y={signal['y']}", linestyle="--")
plt.legend()
So by running the previous cell a few times you can see the distribution of the different signals is quite different, which makes a multi-bin search very powerful!
A little background information: Originally those 9 bins originated from 3 different, statistically independent signal regions. The first 3 bins are optimized for signals with low masses, the next 3 bins for signals with medium masses and the last 3 bins for signals with high masses. Here we don't care about the x-axis though, since we just want to do a statistical analysis.
So let's start with a 1D scan of all points with a neutralino mass (y
) of 150 GeV:
# filter out all signals with y=150 and sort by x
signals_150 = sorted([i for i in signals_9bins if i["y"] == 150], key=lambda k: k["x"])
We will run an upper limit scan on the signal strength for each of these signal points (so in some sense we are actually doing already a 2D scan)
We will again use the pyhf.simplemodels.hepdata_like
function and completely neglect any uncertainties:
limits = []
for signal in signals_150:
model_s = pyhf.simplemodels.hepdata_like(
signal_data=list(signal["data"]), bkg_data=list(b_9bins), bkg_uncerts=[0] * 9
)
limit_obs, limit_exp = pyhf.infer.intervals.upperlimit(
list(b_9bins) + model_s.config.auxdata,
model_s,
np.linspace(0, 5, 51)
)
# for now, we are just looking at the expected limit, since we didn't input any real data
limits.append(limit_exp)
limits = np.array(limits)
ax = plot_scan([i["x"] for i in signals_150], limits, exclusion_level=1)
ax.set_xlabel(r"$\tilde{\chi}_1^{\pm}/\tilde{\chi}_2^{0}$ mass [GeV]")
ax.set_ylabel("Upper limit on Signal strength")
ax.legend()
So the point at 300 GeV is at the boundary of exclusion and then the expected exclusion reaches up to around 850 GeV for these signals.
As you have seen this example has 2 parameters, so we can run a 2D scan against both parameters. For this we don't do an upper limit scan for each point, but just calculate $CL_s$ for $\mu=1$ and look at the contour of $CL_s=0.05$ in the grid of the 2 parameters.
x = []
y = []
cls = []
for signal in signals_9bins:
model_s = pyhf.simplemodels.hepdata_like(
signal_data=list(signal["data"]), bkg_data=list(b_9bins), bkg_uncerts=[0] * 9
)
cls_obs, cls_exp = pyhf.infer.hypotest(
1,
list(b_9bins) + model_s.config.auxdata,
model_s,
test_stat="qtilde",
return_expected_set=True
)
x.append(signal["x"])
y.append(signal["y"])
cls.append(cls_exp)
x = np.array(x)
y = np.array(y)
cls = np.array(cls)
For better interpolation, we convert the $CL_s$ values to significances since they change much more linearily and are therefore easier to interpolate between:
z = pvalue_to_significance(cls)
fig, ax = plt.subplots(ncols=2, figsize=(12, 3))
ax[0].plot(sorted(cls[:, 2]))
ax[0].set_title("Sorted CLs values")
ax[1].plot(sorted(z[:, 2]))
ax[1].set_title("Sorted Significance values")
The significance level corresponding to a p-value of 0.05
is
level = pvalue_to_significance(0.05)
level
You'll have the number 1.64
in your head after some time working for new physics searches :)
So, let's draw the contour in the 2D grid of our signal mass parameters. For such grids we don't overdo it and leave out the 2 sigma band. We use the tricontour
functions of matplotlib
to do a triangulation of our points in 3D (x
, y
, z
) space and draw contours along that interpolated hill.
Reminder: the columns of expected $CL_s$ values are for [-2, -1, 0, 1, 2]
sigma.
opt = dict(levels=[level], colors=["black"])
plt.tricontour(x, y, z[:, 2], linestyles="dashed", **opt)
plt.tricontour(x, y, z[:, 1], linestyles="dotted", **opt)
plt.tricontour(x, y, z[:, 3], linestyles="dotted", **opt)
plt.tricontourf(x, y, z[:, 2], levels=100, cmap="Spectral_r")
plt.scatter(x, y, c=z[:, 2], marker='o', edgecolor="black", cmap="Spectral_r")
plt.colorbar(label="Significance of CLs")
plt.xlabel(r"$\tilde{\chi}_1^{\pm}/\tilde{\chi}_2^{0}$ mass [GeV]")
plt.ylabel(r"$\tilde{\chi}_1^{0}$ mass [GeV]")