This is an exercise showing a simple analysis exploring the SM Higgs decays to tau tau in the tau mu final state: https://opendata.web.cern.ch/record/12350. The analysis aims to explore the kinematics of H --> tau tau and compare them with that of a prominent SM background process with the same final state.
The analysis consists of two parts:
!wget --progress=dot:giga http://opendata.cern.ch/record/12351/files/GluGluToHToTauTau.root
//Get the ROOT file containing the H -> tautau signal events.
//CMS OD record: https://opendata.web.cern.ch/record/12351
!wget --progress=dot:giga https://www.dropbox.com/s/mcvb4mis1qhcjvy/DYJetsToLLsubset.root
// Get the ROOT file containing the Z -> tau tau background events:
//CMS OD record: https://opendata.web.cern.ch/record/12353
(More information on cern.ch/adl)
LHC data analyses are usually performed using complex analysis frameworks written in general purpose languages like C++ and python. But this method has a steep learning curve, as even the simplest tasks could be coded in a complicated way, and it is not straightforward to understand the code, make changes or additions. However there is another emerging alternative which allows to decouple physics content from the technical code and write analyses with a simple, self-describing syntax. Analysis Description Language (ADL) is a HEP-specific analysis language developed with this purpose.
A HEP analysis includes 3 main parts:
ADL consists of blocks separating object, variable and event selection definitions for a clear separation of analysis components. Blocks have a keyword-expression structure. Keywords specify analysis concepts and operations. Syntax includes mathematical and logical operations, comparison and optimization operators, reducers, 4-vector algebra and HEP-specific functions (dφ, dR, …).
ADL is designed with the goal to be self-describing, so especially for simple cases like in this example, one does not need to read syntax rules to understand an ADL description. However if you are interested, the set of syntax rules can be found here.
Once an analysis is written it needs to be run on events. This is achieved by CutLang , the runtime interpreter who reads and understands the ADL syntax and runs it on events. CutLang is also a framework which aturomatically handles many tedious tasks as reading input events, writing output histograms, etc. CutLang can be run on various environments such as linux, mac, conda, docker, jupyter, etc.
In case you are interested to learn more on CutLang, please see the CutLang github
Writing the analysis with ADL: In the following cell, part of the analysis is written using the ADL syntax. However there are some parts missing. Please follow the instructions in the comments to complete the missing parts. If you feel adventurous, you could modify the object or event selections, add new variables or new histograms.
Running the analysis with CutLang: Executing the cell will run the analysis on both the signal (SMHiggsToZZTo4L.root) and background (ZZTo2e2mu.root) events. The run parameters are given in the first line of the cell:
NOTE: When running jupyter/binder via direct link, if your run does not complete due to memory issues, please reduce the number of events via the "events" parameter.
Analysis output: Running the analysis will produce two outputs:
Try to understand the object selections, Higgs reconstruction algorithm and event selections in the analysis. Check the resulting cutflows in both baseline and twojets selections.
%%cutlang file=GluGluToHToTauTau.root;DYJetsToLLsubset.root filetype=CMSNANO adlfile=Htautau events=200000;250000 verbose=20000
# CMS Open Data outreach analysis of H -> tau tau using CMS 2012 data and MC
# Link to open data record: http://opendata.web.cern.ch/record/12350
# Can be used with the reduced NanoAOD ntuples listed in the open data record, e.g. http://opendata.cern.ch/record/12351
# Runs with CutLang: https://github.com/unelg/CutLang
info analysis
title "CMS Open Data Outreach analysis: Analysis of Higgs boson decays to two tau leptons using data and simulation of events at the CMS detector from 2012"
experiment CMS
id "Open Data record: 12350"
publication "J. High Energy Phys. 05 (2014) 104"
sqrtS 7 # TeV, 8.TeV"
lumi 4.9 # fb-1
arXiv "1401.5041"
hepdata "https://www.hepdata.net/record/ins1749379"
doi "10.1007/JHEP05(2014)104"
# OBJECT SELECTIONS
object goodMuons
take Muon
select abs(Eta(Muon)) < 2.1
select pt(Muon) > 17
select tightId(Muon) == 1.0
object goodTaus
take Tau
select q(Tau) != 0
select abs(eta(Tau)) < 2.3
select pt(Tau) > 20
select idDecayMode(Tau) == 1
select idIsoTight(Tau) == 1
select idAntiEleTight(Tau) == 1
select idAntiMuTight(Tau) == 1
object goodJets
take Jet
select puId(Jet) > 0
select abs(eta(Jet)) < 4.7
select pt(Jet) > 10
# Find the best mu-tau pair that reconstructs the best Higgs candidate.
# We use negative indices -1 and -2 for muon and tau, because we don't initially know which mu and tau to select from the collections/
# Loop over all muon-tau pairs:
object HMuTaus : comb( goodMuons[-1] goodTaus[-2] ) alias aHMuTaus
# Select all mu-tau pairs with dR(mu,tau) > 0.5:
select dR( goodMuons[-1], goodTaus[-2] ) > 0.5
# Select the mu-tau pair which has the mu with maximum pT:
# pT(goodMuons) indicates a loop over goodMuons pTs.
select max(pT(goodMuons)) - pT(goodMuons[-1]) ~= 0
# Among the surviving pairs, select the mu-tau pair which has the tau with minimum relative isolation:
select min(relIsoall(goodTaus)) - relIsoall(goodTaus[-2]) ~= 0
# EVENT VARIABLES
# We can write the definitions of event variables here and use the variable names directly in the selection regions below.
define bestMu = daughters(HMuTaus[0], goodMuons[-1])
define bestTau = daughters(HMuTaus[0], goodTaus[-2])
define mutau = HMuTaus[0]
define dRmutau = dR(bestMu, bestTau)
define MTtau = Sqrt( 2*pT(bestTau) * MET*(1-cos(phi(METLV[0]) - phi(bestTau) )))
define MTmu = Sqrt( 2*pT(bestMu) * MET*(1-cos(phi(METLV[0]) - phi(bestMu) )))
define jj = goodJets[0] + goodJets[1]
define mjj = m(jj)
define pTjj = pT(jj)
define detajj = dEta(goodJets[0], goodJets[1])
# EVENT SELECTIONS
region baseline
select ALL
select HLT_IsoMu17_eta2p1_LooseIsoPFTau20 == 1
select size(goodMuons) > 0
select size(goodTaus) > 0
select size(HMuTaus) > 0
# Histograms in the baseline region:
histo hmutaum , "mu+tau mass (GeV)", 30, 20, 140, m(mutau)
histo hmutaupt , "mu+tau pT (GeV)", 30, 0, 60, pT(mutau)
histo hmupt , "muon pT (GeV)", 30, 17, 70, pT(bestMu)
histo htaupt , "tau pT (GeV)", 30, 17, 70, pT(bestTau)
histo hmueta , "muon eta", 30, -2.1, 2.1, eta(bestMu)
histo htaueta , "tau eta", 30, -2.3, 2.3, eta(bestTau)
histo hmuphi , "muon phi", 30, -3.14, 3.14, eta(bestMu)
histo htauphi , "tau phi", 30, -3.14, 3.14, eta(bestTau)
histo hmuiso , "muon iso", 30, -3.14, 3.14, pfRelIso03all(bestMu)
histo htauiso , "tau iso", 30, -3.14, 3.14, relIsoall(bestTau)
histo hmuq , "muon charge", 2, -2, 2, q(bestMu)
histo htauq , "tau charge", 2, -2, 2, q(bestTau)
histo hmet, "MET (GeV)", 30, 0, 60, MET
histo hmetphi, "MET phi", 30, -3.14, 3.14, phi(METLV[0])
histo hmumass , "muon mass (GeV)", 30, 0.0, 0.2, m(bestMu)
histo htaumass , "tau mass (GeV)", 30, 0.0, 2.0, m(bestTau)
histo hmuMT , "muon+MET transverse mass (GeV)", 30, 0.0, 100, MTmu
histo htauMT , "tau+MET transverse mass (GeV)", 30, 0.0, 100, MTtau
histo htaudecaymode , "tau decay mode", 11, 0, 11, decayMode(bestTau)
region twojets
baseline
select size(goodJets) >= 2
# histograms in the twojets region
histo hj1pt , "jet1 pT (GeV)", 30, 30, 70, pT(goodJets[0])
histo hj2pt , "jet1 pT (GeV)", 30, 30, 70, pT(goodJets[1])
histo hj1eta , "jet1 eta", 30, -4.7, 4.7, eta(goodJets[0])
histo hj2eta , "jet2 eta", 30, -4.7, 4.7, eta(goodJets[1])
histo hj1phi , "jet1 phi", 30, -3.14, 3.14, phi(goodJets[0])
histo hj2phi , "jet2 phi", 30, -3.14, 3.14, phi(goodJets[1])
histo hj1m , "jet1 mass (GeV)", 30, 30, 70, m(goodJets[0])
histo hj2m , "jet2 mass (GeV)", 30, 30, 70, m(goodJets[1])
histo hj1btag , "jet1 btag", 30, 0, 1, bTag(goodJets[0])
histo hj2btag , "jet2 btag", 30, 0, 1, bTag(goodJets[1])
histo hnjets , "number of jets", 5, 0, 5, size(goodJets)
histo hmjj , "jj inv. mass (GeV)", 30, 0, 400, mjj
histo hptjj , "jj pT (GeV)", 30, 0, 200, pTjj
histo hjdeta , "deta j1, j2", 30, -9.4, 9.4, detajj
Now let's make some plots using the ROOT package in python (which is widely used at CERN). Instructions are shown within comments in the following cells.
What to do:
%%python
# Let's start with importing the needed modules
from ROOT import gStyle, TFile, TH1, TH1D, TH2D, TCanvas, TLegend, TColor
# Now let's set some ROOT styling parameters:
# You do not need to know what they mean, but can directly use these settings
gStyle.SetOptStat(0)
gStyle.SetPalette(1)
gStyle.SetTextFont(42)
gStyle.SetTitleStyle(0000)
gStyle.SetTitleBorderSize(0)
gStyle.SetTitleFont(42)
gStyle.SetTitleFontSize(0.055)
gStyle.SetTitleFont(42, "xyz")
gStyle.SetTitleSize(0.5, "xyz")
gStyle.SetLabelFont(42, "xyz")
gStyle.SetLabelSize(0.45, "xyz")
%%python
# Let's open the signal (Htautau) and background (Ztautau) files produced by CutLang:
fs = TFile("histoOut-Htautau-GluGluToHToTauTau.root")
fb = TFile("histoOut-Htautau-DYJetsToLLsubset.root")
%%python
# We can see what is inside the signal file:
fs.ls()
# There should be a directory (TDirectoryFile) per selection region, e.g. for "baseline" and "twojets":
%%python
# Let's check out what is inside "baseline":
fs.cd("baseline")
fs.ls()
%%python
# Now let's draw some histograms.
# We will compare signal and background distributions for different variables.
# You can try this with different histograms and different regions.
# Which histogram would you like to draw? You can change the histogram name.
hname = "hmutaum"
# In which region would you like to draw? You can change the region name.
region = "baseline"
# Get the histograms from the file:
hsg = fs.Get(region+"/"+hname)
hbg = fb.Get(region+"/"+hname)
%%python
# This cell formats the histograms: scaling, lines, colors, axes titles, etc..
# You do not need to learn the commands here unless you are really curious.
# Otherwise just execute the cell.
# Our purpose in this exercise is to compare the shapes of signal and background distributions.
# To do this comparison best, the area integral under histograms being compared should be the same.
# Therefore we scale the hisgograms so that the area integral under the histograms equals 1.
hsg.Scale(1./hsg.Integral())
hbg.Scale(1./hbg.Integral())
if hsg.GetMaximum() > hbg.GetMaximum():
hbg.SetMaximum(hsg.GetMaximum()*1.1)
# Histogram style settings:
hsg.SetLineWidth(2)
hbg.SetLineWidth(2)
# Set the colors:
# Color numbers can be retrived from https://root.cern.ch/doc/master/classTColor.html
# (check for color wheel)
hbg.SetFillColor(400-7) # kYellow - 7
hsg.SetLineColor(600) # kBlue
hbg.SetLineColor(400+2) # kYellow + 2
# Titles, labels.
# It is enough to set such variables ONLY FOR THE FIRST HISTOGRAM YOU WILL DRAW
# i.e., the one you will call by .Draw(). The rest you will draw by .Draw("same") will only
# contribute with the historam curve.
#hbg.SetTitle("")
hbg.SetTitle("")
hbg.GetXaxis().SetTitle(hsg.GetTitle())
hbg.GetXaxis().SetTitleOffset(1.25)
hbg.GetXaxis().SetTitleSize(0.05)
hbg.GetXaxis().SetLabelSize(0.045)
hbg.GetXaxis().SetNdivisions(8, 5, 0)
hbg.GetYaxis().SetTitle("number of events")
hbg.GetYaxis().SetTitleOffset(1.4)
hbg.GetYaxis().SetTitleSize(0.05)
hbg.GetYaxis().SetLabelSize(0.045)
# Make a generically usable plot legend
l = TLegend(0.65, 0.75, 0.88, 0.87)
l.SetBorderSize(0)
l.SetFillStyle(0000)
l.AddEntry(hsg,"H->tautau", "l")
l.AddEntry(hbg,"Z->tautau", "f")
%%python %jsroot on
# Now we make a canvas and draw our histograms
c = TCanvas("c", "c", 620, 500)
c.SetBottomMargin(0.15)
c.SetLeftMargin(0.15)
c.SetRightMargin(0.15)
hbg.Draw("H")
hbg.Draw("Esame")
hsg.Draw("Hsame")
hsg.Draw("same")
l.Draw("same")
c.Draw()
# Don't worry about the error that appears below!