rf403_weightedevts¶

'DATA AND CATEGORIES' RooFit tutorial macro #403

Using weights in unbinned datasets

Author: Clemens Lange, Wouter Verkerke (C version)
This notebook tutorial was automatically generated with ROOTBOOK-izer from the macro found in the ROOT repository on Wednesday, April 17, 2024 at 11:18 AM.

In [1]:

from __future__ import print_function
import ROOT

Create observable and unweighted dataset¶

Declare observable

In [2]:

x = ROOT.RooRealVar("x", "x", -10, 10)
x.setBins(40)

Construction a uniform pdf

In [3]:

p0 = ROOT.RooPolynomial("px", "px", x)

Sample 1000 events from pdf

In [4]:

data = p0.generate({x}, 1000)

Calculate weight and make dataset weighted¶

Construct formula to calculate (fake) weight for events

In [5]:

wFunc = ROOT.RooFormulaVar("w", "event weight", "(x*x+10)", [x])

Add column with variable w to previously generated dataset

In [6]:

w = data.addColumn(wFunc)

Dataset d is now a dataset with two observable (x,w) with 1000 entries

In [7]:

data.Print()

RooDataSet::pxData[x,w] = 1000 entries

Instruct dataset wdata in interpret w as event weight rather than as observable

In [8]:

wdata = ROOT.RooDataSet(data.GetName(), data.GetTitle(), data, data.get(), "", w.GetName())

Dataset d is now a dataset with one observable (x) with 1000 entries and a sum of weights of ~430K

In [9]:

wdata.Print()

RooDataSet::pxData[x,weight:w] = 1000 entries (43238.9 weighted)

Unbinned ML fit to weighted data¶

Construction quadratic polynomial pdf for fitting

In [10]:

a0 = ROOT.RooRealVar("a0", "a0", 1)
a1 = ROOT.RooRealVar("a1", "a1", 0, -1, 1)
a2 = ROOT.RooRealVar("a2", "a2", 1, 0, 10)
p2 = ROOT.RooPolynomial("p2", "p2", x, [a0, a1, a2], 0)

Fit quadratic polynomial to weighted data

NOTE: A plain Maximum likelihood fit to weighted data does in general NOT result in correct error estimates, individual event weights represent Poisson statistics themselves.

Fit with 'wrong' errors

In [11]:

r_ml_wgt = p2.fitTo(wdata, Save=True, PrintLevel=-1)

[#1] INFO:Fitting -- RooAbsPdf::fitTo(p2_over_p2_Int[x]) fixing normalization set for coefficient determination to observables in data
[#1] INFO:Fitting -- using CPU computation library compiled with -mavx2
[#0] WARNING:InputArguments -- RooAbsPdf::fitTo(p2): WARNING: a likelihood fit is requested of what appears to be weighted data.
       While the estimated values of the parameters will always be calculated taking the weights into account,
       there are multiple ways to estimate the errors of the parameters. You are advised to make an
       explicit choice for the error calculation:
           - Either provide SumW2Error(true), to calculate a sum-of-weights-corrected HESSE error matrix
             (error will be proportional to the number of events in MC).
           - Or provide SumW2Error(false), to return errors from original HESSE error matrix
             (which will be proportional to the sum of the weights, i.e., a dataset with <sum of weights> events).
           - Or provide AsymptoticError(true), to use the asymptotically correct expression
             (for details see https://arxiv.org/abs/1911.01303)."
[#1] INFO:Fitting -- RooAddition::defaultErrorLevel(nll_p2_over_p2_Int[x]_pxData) Summation contains a RooNLLVar, using its error level
[#1] INFO:Minimization -- RooAbsMinimizerFcn::setOptimizeConst: activating const optimization
[#1] INFO:Minimization -- RooAbsMinimizerFcn::setOptimizeConst: deactivating const optimization

A first order correction to estimated parameter errors in an (unbinned) ML fit can be obtained by calculating the covariance matrix as

V' = V C-1 V

where V is the covariance matrix calculated from a fit to -logL = - sum [ w_i log f(x_i) ] and C is the covariance matrix calculated from -logL' = -sum [ w_i^2 log f(x_i) ] (i.e. the weights are applied squared)

A fit in self mode can be performed as follows:

In [12]:

r_ml_wgt_corr = p2.fitTo(wdata, Save=True, SumW2Error=True, PrintLevel=-1)

[#1] INFO:Fitting -- RooAbsPdf::fitTo(p2_over_p2_Int[x]) fixing normalization set for coefficient determination to observables in data
[#1] INFO:Fitting -- RooAddition::defaultErrorLevel(nll_p2_over_p2_Int[x]_pxData) Summation contains a RooNLLVar, using its error level
[#1] INFO:Minimization -- RooAbsMinimizerFcn::setOptimizeConst: activating const optimization
[#1] INFO:Fitting -- RooAbsPdf::fitTo(p2) Calculating sum-of-weights-squared correction matrix for covariance matrix
[#1] INFO:Minimization -- RooAbsMinimizerFcn::setOptimizeConst: deactivating const optimization

Plot weighted data and fit result¶

Construct plot frame

In [13]:

frame = x.frame(Title="Unbinned ML fit, chi^2 fit to weighted data")

Plot data using sum-of-weights-squared error rather than Poisson errors

In [14]:

wdata.plotOn(frame, DataError="SumW2")

Out[14]:

<cppyy.gbl.RooPlot object at 0x9d97ed0>

Overlay result of 2nd order polynomial fit to weighted data

In [15]:

p2.plotOn(frame)

Out[15]:

<cppyy.gbl.RooPlot object at 0x9d97ed0>

ML fit of pdf to equivalent unweighted dataset¶

Construct a pdf with the same shape as p0 after weighting

In [16]:

genPdf = ROOT.RooGenericPdf("genPdf", "x*x+10", [x])

Sample a dataset with the same number of events as data

In [17]:

data2 = genPdf.generate({x}, 1000)

[#1] INFO:NumericIntegration -- RooRealIntegral::init(genPdf_Int[x]) using numeric integrator RooIntegrator1D to calculate Int(x)
[#1] INFO:NumericIntegration -- RooRealIntegral::init(genPdf_Int[x]) using numeric integrator RooIntegrator1D to calculate Int(x)

Sample a dataset with the same number of weights as data

In [18]:

data3 = genPdf.generate({x}, 43000)

[#1] INFO:NumericIntegration -- RooRealIntegral::init(genPdf_Int[x]) using numeric integrator RooIntegrator1D to calculate Int(x)
[#1] INFO:NumericIntegration -- RooRealIntegral::init(genPdf_Int[x]) using numeric integrator RooIntegrator1D to calculate Int(x)

Fit the 2nd order polynomial to both unweighted datasets and save the results for comparison

In [19]:

r_ml_unw10 = p2.fitTo(data2, Save=True, PrintLevel=-1)
r_ml_unw43 = p2.fitTo(data3, Save=True, PrintLevel=-1)

[#1] INFO:Fitting -- RooAbsPdf::fitTo(p2_over_p2_Int[x]) fixing normalization set for coefficient determination to observables in data
[#1] INFO:Fitting -- RooAddition::defaultErrorLevel(nll_p2_over_p2_Int[x]_genPdfData) Summation contains a RooNLLVar, using its error level
[#1] INFO:Minimization -- RooAbsMinimizerFcn::setOptimizeConst: activating const optimization
[#1] INFO:Minimization -- RooAbsMinimizerFcn::setOptimizeConst: deactivating const optimization
[#1] INFO:Fitting -- RooAbsPdf::fitTo(p2_over_p2_Int[x]) fixing normalization set for coefficient determination to observables in data
[#1] INFO:Fitting -- RooAddition::defaultErrorLevel(nll_p2_over_p2_Int[x]_genPdfData) Summation contains a RooNLLVar, using its error level
[#1] INFO:Minimization -- RooAbsMinimizerFcn::setOptimizeConst: activating const optimization
[#1] INFO:Minimization -- RooAbsMinimizerFcn::setOptimizeConst: deactivating const optimization

Chis2 fit of pdf to binned weighted dataset¶

Construct binned clone of unbinned weighted dataset

In [20]:

binnedData = wdata.binnedClone()
binnedData.Print("v")

DataStore pxData_binned (Generated From px_binned)
  Contains 40 entries
  Observables: 
    1)  x = 9.75  L(-10 - 10) B(40)  "x"
Binned Dataset pxData_binned (Generated From px_binned)
  Contains 40 bins with a total weight of 43238.9
  Observables:     1)  x = 9.75  L(-10 - 10) B(40)  "x"

Perform chi2 fit to binned weighted dataset using sum-of-weights errors

NB: Within the usual approximations of a chi2 fit, chi2 fit to weighted data using sum-of-weights-squared errors does give correct error estimates

In [21]:

chi2 = p2.createChi2(binnedData, ROOT.RooFit.DataError("SumW2"))
m = ROOT.RooMinimizer(chi2)
m.migrad()
m.hesse()

Out[21]:

Minuit2Minimizer: Minimize with max-calls 1000 convergence for edm < 1 strategy 1
Minuit2Minimizer : Valid minimum - status = 0
FVAL  = 31.3747451817532266
Edm   = 4.13511525432537654e-08
Nfcn  = 29
a1	  = -0.009989	 +/-  0.0262975	(limited)
a2	  = 0.106373	 +/-  0.0101849	(limited)

Info in <Minuit2>: MnSeedGenerator Computing seed using NumericalGradient calculator
Info in <Minuit2>: MnSeedGenerator Initial state: FCN =       32.33995323 Edm =      0.8968537843 NCalls =      7
Info in <Minuit2>: MnSeedGenerator Initial state  
  Minimum value : 32.33995323
  Edm           : 0.8968537843
  Internal parameters:	[   -0.00121057482     -1.372915994]	
  Internal gradient  :	[      27.17902731     -203.7916428]	
  Internal covariance matrix:
[[  0.00060495515              0]
 [              0  7.5618996e-05]]]
Info in <Minuit2>: VariableMetricBuilder Start iterating until Edm is < 0.002 with call limit = 1000
Info in <Minuit2>: VariableMetricBuilder    0 - FCN =       32.33995323 Edm =      0.8968537843 NCalls =      7
Info in <Minuit2>: VariableMetricBuilder    1 - FCN =       31.37772394 Edm =    0.002313012125 NCalls =     13
Info in <Minuit2>: VariableMetricBuilder    2 - FCN =       31.37474518 Edm =   3.566880129e-08 NCalls =     19
Info in <Minuit2>: VariableMetricBuilder After Hessian
Info in <Minuit2>: VariableMetricBuilder    3 - FCN =       31.37474518 Edm =   4.135115254e-08 NCalls =     29
Info in <Minuit2>: Minuit2Minimizer::Hesse Using max-calls 1000
Info in <Minuit2>: Minuit2Minimizer::Hesse Hesse is valid - matrix is accurate

Plot chi^2 fit result on frame as well

In [22]:

r_chi2_wgt = m.save()
p2.plotOn(frame, LineStyle="--", LineColor="r")

Out[22]:

<cppyy.gbl.RooPlot object at 0x9d97ed0>

Compare fit results of chi2, L fits to (un)weighted data¶

Note that ML fit on 1Kevt of weighted data is closer to result of ML fit on 43Kevt of unweighted data than to 1Kevt of unweighted data, the reference chi^2 fit with SumW2 error gives a result closer to that of an unbinned ML fit to 1Kevt of unweighted data.

In [23]:

print("==> ML Fit results on 1K unweighted events")
r_ml_unw10.Print()
print("==> ML Fit results on 43K unweighted events")
r_ml_unw43.Print()
print("==> ML Fit results on 1K weighted events with a summed weight of 43K")
r_ml_wgt.Print()
print("==> Corrected ML Fit results on 1K weighted events with a summed weight of 43K")
r_ml_wgt_corr.Print()
print("==> Chi2 Fit results on 1K weighted events with a summed weight of 43K")
r_chi2_wgt.Print()

c = ROOT.TCanvas("rf403_weightedevts", "rf403_weightedevts", 600, 600)
ROOT.gPad.SetLeftMargin(0.15)
frame.GetYaxis().SetTitleOffset(1.8)
frame.Draw()

c.SaveAs("rf403_weightedevts.png")

==> ML Fit results on 1K unweighted events
==> ML Fit results on 43K unweighted events
==> ML Fit results on 1K weighted events with a summed weight of 43K
==> Corrected ML Fit results on 1K weighted events with a summed weight of 43K
==> Chi2 Fit results on 1K weighted events with a summed weight of 43K

  RooFitResult: minimized FCN value: 2766.49, estimated distance to minimum: 0.000399952
                covariance matrix quality: Full, accurate covariance matrix
                Status : MINIMIZE=0 HESSE=0 

    Floating Parameter    FinalValue +/-  Error   
  --------------------  --------------------------
                    a1    8.9483e-03 +/-  2.70e-02
                    a2    1.0177e-01 +/-  1.69e-02


  RooFitResult: minimized FCN value: 118892, estimated distance to minimum: 0.000206627
                covariance matrix quality: Full, accurate covariance matrix
                Status : MINIMIZE=0 HESSE=0 

    Floating Parameter    FinalValue +/-  Error   
  --------------------  --------------------------
                    a1   -1.2106e-03 +/-  4.02e-03
                    a2    9.7573e-02 +/-  2.37e-03


  RooFitResult: minimized FCN value: 119682, estimated distance to minimum: 1.25398e-05
                covariance matrix quality: Full, accurate covariance matrix
                Status : MINIMIZE=0 HESSE=0 

    Floating Parameter    FinalValue +/-  Error   
  --------------------  --------------------------
                    a1   -4.8713e-03 +/-  4.03e-03
                    a2    9.8645e-02 +/-  2.41e-03


  RooFitResult: minimized FCN value: 119682, estimated distance to minimum: 79498.5
                covariance matrix quality: Full, accurate covariance matrix
                Status : MINIMIZE=0 HESSE=0 HESSE=0 

    Floating Parameter    FinalValue +/-  Error   
  --------------------  --------------------------
                    a1   -4.8565e-03 +/-  3.00e-02
                    a2    9.8652e-02 +/-  2.99e-02


  RooFitResult: minimized FCN value: 31.3747, estimated distance to minimum: 4.135e-08
                covariance matrix quality: Full, accurate covariance matrix
                Status : MIGRAD=0 HESSE=0 

    Floating Parameter    FinalValue +/-  Error   
  --------------------  --------------------------
                    a1   -9.9890e-03 +/-  2.63e-02
                    a2    1.0637e-01 +/-  1.02e-02

Info in <TCanvas::Print>: png file rf403_weightedevts.png has been created

Draw all canvases

In [24]:

from ROOT import gROOT 
gROOT.GetListOfCanvases().Draw()