Tutorial 1: Basics¶

In this tutorial, we will generate some sample distributions, cluster them and select benchmark points.

NOTE FOR CONTRIBUTORS: Always clear all output before committing (Cell > All Output > Clear)!

The distribution to cluster in this notebook is generated with the flavio package. Furthermore, in order to show the plots in this notebook, this tutorial requires matplotlib (if you installed ClusterKinG with the plotting option, you already have it) Install both with:

pip3 install --user flavio, matplotlib

Some Jupyter notebook magic:

In [ ]:

# Show plots in Jupyter notebooks
%matplotlib inline

# Reload modules whenever they change
# (for development purposes)
%load_ext autoreload
%autoreload 2

# Make clusterking package available even without installation
import sys
sys.path = ["../../"] + sys.path

Import clusterking with a short name. This is all you usually have to do once clusterking is installed:

In [ ]:

import clusterking as ck

Generate distributions (Scan)¶

In the first step, we generate distributions for different parameter values. For this, there are two classes, Scanner and WilsonScanner, with the latter focusing on sampling in the space of Wilson coefficients. The Wilson coefficients are implemented using the Wilson package (https://wilson-eft.github.io/), which allows to use a variety of bases and EFTs and matches them to user specified scales. In this example we use the flavio basis (https://wcxf.github.io/assets/pdf/WET.flavio.pdf) at a scale of 5 GeV.

In [ ]:

s = ck.scan.WilsonScanner(scale=5, eft="WET", basis="flavio")

Now we set up the function/distribution that we want to consider. Here we look into the branching ratio with respect to $q^2$ of $B\to D \,\tau\, \bar\nu_\tau$. The function of the differential branching ration is taken from the flavio package (https://flav-io.github.io/). The $q^2$ binning is chosen to have 9 bins between $3.2 \,\text{GeV}^2$ and $11.6\,\text{GeV}^2$:

In [ ]:

import flavio
import numpy as np

def dBrdq2(w, q):
    return flavio.np_prediction("dBR/dq2(B+->Dtaunu)", w, q)

s.set_dfunction(
    dBrdq2,
    binning=np.linspace(3.2, 11.6, 10),
    normalize=True,
    xvar="q2"  # only sets name of variable on x axis
)

Now we specify the grid of Wilson coefficients that are subsequenetly sampled. Using the example of $B\to D \tau \bar\nu_\tau$, we sample the coefficients CVL_bctaunutau, CSL_bctaunutau and CT_bctaunutau with 10 points between $-1$ and $1$.

In [ ]:

s.set_spoints_equidist(
    {
        "CVL_bctaunutau": (-1, 1, 10),
        "CSL_bctaunutau": (-1, 1, 10),
        "CT_bctaunutau": (-1, 1, 10)
    }
)

Imaginary parts can be added by prefixing the name of the coefficient with im_, e.g. im_CVL_bctaunutau.

Now to compute the kinematical distributions from the Wilson coefficients sampled above we need a data instance:

In [ ]:

d = ck.Data()

Computing the kinematical distributions is done using run() method. This might take some time.

In [ ]:

r = s.run(d)

In [ ]:

r.write()

By default, Scanner.run uses all cores on your machine. You can specify a different number using the no_workers option or disable multiprocessing completely by setting no_workers=1.

The results are saved in our data object d. At the heart of it is a dataframe, d.df. Let's have a look:

In [ ]:

d.df.head()

We can also already take a quick look at the resulting distributions by randomly selecting a few of them.

In [ ]:

d.plot_dist(nlines=10);

More plots will be introduced in subsequent tutorials, but you can also try running d.plot_dist_box() (box plots), d.plot_dist_minmax() (plot spread of bin contents), ...

Clustering¶

Different clustering algorithms are available in the cluster subpackage of ClusterKinG. They are implemented as subclasses of a class Cluster and by subclassing Cluster yourself (or any of the derived classes) it is easy to implement your own clustering algorithm.

In this example, we will use a hierarchical clustering algorithm to group similar distributions together. The Cluster class, or here its subclass HierarchyCluster is initialized with the data object:

In [ ]:

c = ck.cluster.HierarchyCluster()

First, we have to specify the metric we want to use to measure the distance between different distributions. If no argument is specified a Euclidean metric is being used (which is equivalent to a $\chi^2$ metric, if we have flat uncorrelated relative errors on each bin):

In [ ]:

c.set_metric("euclidean")

The maximal distance between the individual clusters max_d:

In [ ]:

c.set_max_d(0.2)

In [ ]:

r = c.run(d)

Now we add the information about the clusters to the dataframe created above:

In [ ]:

r.write()

So when we now look at the dataframe again, we see a new column cluster with the cluster number:

In [ ]:

d.df.head()

Of course we have also plenty of methods to plot the distributions that belong to the clusters, e.g.

In [ ]:

d.plot_dist_box();

We can also plot clusters vs parameters, e.g.

In [ ]:

d.plot_clusters_scatter(["CSL_bctaunutau", "CVL_bctaunutau", "CT_bctaunutau"]);

Again, more on plots in the following tutorials.

Selecting benchmark points¶

In a similar way we can determine the benchmark points representing the individual clusters. Initializing a benchmark point object

In [ ]:

b = ck.Benchmark()

and again choosing a metric (Euclidean metric is default)

In [ ]:

b.set_metric()

the benchmark points can be computed

In [ ]:

r = b.run(d)

and written in the dataframe:

In [ ]:

r.write()

Let's take a look and notice the new column bpoint at the end of the data frame:

In [ ]:

d.df.head()

Now most plots will also show the distributions that correspond to the benchmark points:

In [ ]:

d.plot_dist_box();

Preserving results¶

Now it's time to write out the results for later use.

In [ ]:

d.write("output/tutorial_basics.sql", overwrite="overwrite")

This will not only write out the data itself, but also a lot of associated metadata that makes it easy to later reconstruct what the data actually represents. This was accumulated in the attribute d.md over all steps:

In [ ]:

d.md

In [ ]: