In this tutorial, we will generate some sample distributions, cluster them and select benchmark points.
NOTE FOR CONTRIBUTORS: Always clear all output before committing (Cell
> All Output
> Clear
)!
ClusterKinG
with the plotting
option, you already have it)
Install both with:
pip3 install --user flavio, matplotlib
Some Jupyter notebook magic:
# Show plots in Jupyter notebooks
%matplotlib inline
# Reload modules whenever they change
# (for development purposes)
%load_ext autoreload
%autoreload 2
# Make clusterking package available even without installation
import sys
sys.path = ["../../"] + sys.path
Import clusterking with a short name. This is all you usually have to do once clusterking is installed:
import clusterking as ck
In the first step, we generate distributions for different parameter values.
For this, there are two classes, Scanner
and WilsonScanner
, with the latter focusing on sampling in the space of Wilson coefficients. The Wilson coefficients are implemented using the Wilson package (https://wilson-eft.github.io/), which allows to use a variety of bases and EFTs and matches them to user specified scales. In this example we use the flavio basis (https://wcxf.github.io/assets/pdf/WET.flavio.pdf) at a scale of 5 GeV.
s = ck.scan.WilsonScanner(scale=5, eft="WET", basis="flavio")
Now we set up the function/distribution that we want to consider. Here we look into the branching ratio with respect to $q^2$ of $B\to D \,\tau\, \bar\nu_\tau$. The function of the differential branching ration is taken from the flavio package (https://flav-io.github.io/). The $q^2$ binning is chosen to have 9 bins between $3.2 \,\text{GeV}^2$ and $11.6\,\text{GeV}^2$:
import flavio
import numpy as np
def dBrdq2(w, q):
return flavio.np_prediction("dBR/dq2(B+->Dtaunu)", w, q)
s.set_dfunction(
dBrdq2,
binning=np.linspace(3.2, 11.6, 10),
normalize=True,
xvar="q2" # only sets name of variable on x axis
)
Now we specify the grid of Wilson coefficients that are subsequenetly sampled.
Using the example of $B\to D \tau \bar\nu_\tau$, we sample the coefficients CVL_bctaunutau
, CSL_bctaunutau
and CT_bctaunutau
with 10 points between $-1$ and $1$.
s.set_spoints_equidist(
{
"CVL_bctaunutau": (-1, 1, 10),
"CSL_bctaunutau": (-1, 1, 10),
"CT_bctaunutau": (-1, 1, 10)
}
)
im_
, e.g. im_CVL_bctaunutau
.
Now to compute the kinematical distributions from the Wilson coefficients sampled above we need a data instance:
d = ck.Data()
Computing the kinematical distributions is done using run()
method. This might take some time.
r = s.run(d)
r.write()
Scanner.run
uses all cores on your machine.
You can specify a different number using the no_workers
option or disable multiprocessing completely by setting no_workers=1
.
The results are saved in our data object d
.
At the heart of it is a dataframe, d.df
. Let's have a look:
d.df.head()
We can also already take a quick look at the resulting distributions by randomly selecting a few of them.
d.plot_dist(nlines=10);
More plots will be introduced in subsequent tutorials, but you can also try running d.plot_dist_box()
(box plots), d.plot_dist_minmax()
(plot spread of bin contents), ...
Different clustering algorithms are available in the cluster
subpackage of ClusterKinG.
They are implemented as subclasses of a class Cluster
and by subclassing Cluster
yourself (or any of the derived classes) it is easy to implement your own clustering algorithm.
In this example, we will use a hierarchical clustering algorithm to group similar distributions together.
The Cluster
class, or here its subclass HierarchyCluster
is initialized with the data object:
c = ck.cluster.HierarchyCluster()
First, we have to specify the metric we want to use to measure the distance between different distributions. If no argument is specified a Euclidean metric is being used (which is equivalent to a $\chi^2$ metric, if we have flat uncorrelated relative errors on each bin):
c.set_metric("euclidean")
The maximal distance between the individual clusters max_d
:
c.set_max_d(0.2)
r = c.run(d)
Now we add the information about the clusters to the dataframe created above:
r.write()
So when we now look at the dataframe again, we see a new column cluster
with the cluster number:
d.df.head()
Of course we have also plenty of methods to plot the distributions that belong to the clusters, e.g.
d.plot_dist_box();
We can also plot clusters vs parameters, e.g.
d.plot_clusters_scatter(["CSL_bctaunutau", "CVL_bctaunutau", "CT_bctaunutau"]);
Again, more on plots in the following tutorials.
In a similar way we can determine the benchmark points representing the individual clusters. Initializing a benchmark point object
b = ck.Benchmark()
and again choosing a metric (Euclidean metric is default)
b.set_metric()
the benchmark points can be computed
r = b.run(d)
and written in the dataframe:
r.write()
Let's take a look and notice the new column bpoint
at the end of the data frame:
d.df.head()
Now most plots will also show the distributions that correspond to the benchmark points:
d.plot_dist_box();
Now it's time to write out the results for later use.
d.write("output/tutorial_basics.sql", overwrite="overwrite")
This will not only write out the data itself, but also a lot of associated metadata that makes it easy to later reconstruct what the data actually represents. This was accumulated in the attribute d.md
over all steps:
d.md