With RDataFrame, you can read your dataset, add new columns with processed values and finally use Snapshot
to save the resulting data to a ROOT file in TTree format.
import ROOT
df = ROOT.RDataFrame("dataset","data/example_file.root")
df1 = df.Define("c","a+b")
out_treename = "outtree"
out_filename = "outtree.root"
out_columns = ["a","b","c"]
snapdf = df1.Snapshot(out_treename, out_filename, out_columns)
We can now check that the dataset was correctly stored in a file:
%%bash
rootls -lt outtree.root
Result of a Snapshot is still an RDataFrame that can be further used:
snapdf.Display().Print()
Filters applied to the dataset can be given a name. The Report
method will gather information about filter efficiency and show the data flow between subsequent cuts on the original dataset.
df = ROOT.RDataFrame("sig_tree", "https://root.cern/files/Higgs_data.root")
filter1 = df.Filter("lepton_eta > 0", "Lepton eta cut")
filter2 = filter1.Filter("lepton_phi < 1", "Lepton phi cut")
rep = df.Report()
rep.Print()
We still want to perform complex operations in Python but plain Python code is prone to be slow and not thread-safe.
Instead, you can inject C++ functions that will do the work in your event loop during runtime.
This mechanism uses the C++ interpreter cling shipped with ROOT, making this possible in a single line of code.
Let's start by defining a function that will allow us to change the type of a the RDataFrame dataset entry numbers (stored in the special column "rdfentry") from unsigned long long
to float
.
%%cpp
float asfloat(unsigned long long entrynumber){
return entrynumber;
}
Then let's define another function that takes a float
values and computes its square.
%%cpp
float square(float val){
return val * val;
}
And now let's use these functions with RDataFrame!
We start by creating an empty RDataFrame with 100 consecutive entries and defining new columns on it:
# Create a new RDataFrame from scratch with 100 consecutive entries
df = ROOT.RDataFrame(100)
# Create a new column using the previously declared C++ functions
df1 = df.Define("a", "asfloat(rdfentry_)")
df2 = df1.Define("b", "square(a)")
We can now plot the values of the columns in a graph:
# Show the two columns created in a graph
c = ROOT.TCanvas()
graph = df2.Graph("a","b")
graph.SetMarkerStyle(20)
graph.SetMarkerSize(0.5)
graph.SetMarkerColor(ROOT.kBlue)
graph.SetTitle("My graph")
graph.Draw("AP")
c.Draw()
RDataFrame can transparently perform multi-threaded event loops to speed up the execution of its actions.
Users have to call ROOT::EnableImplicitMT()
before constructing the RDataFrame object to indicate that it should take advantage of a pool of worker threads.
Each worker thread processes a distinct subset of entries, and their partial results are merged before returning the final values to the user.
RDataFrame operations such as Histo1D or Snapshot are guaranteed to work correctly in multi-thread event loops.
User-defined expressions, such as strings or lambdas passed to Filter
, Define
, Foreach
, Reduce
or Aggregate
will have to be thread-safe, i.e. it should be possible to call them concurrently from different threads.
%%time
# Get a first baseline measurement
treename = "Events"
filename = "root://eospublic.cern.ch//eos/opendata/cms/derived-data/AOD2NanoAODOutreachTool/Run2012BC_DoubleMuParked_Muons.root"
df = ROOT.RDataFrame(treename, filename)
df.Sum("nMuon").GetValue()
%%time
# Activate multithreading capabilities
# By default takes all available cores on the machine
ROOT.EnableImplicitMT()
treename = "Events"
filename = "root://eospublic.cern.ch//eos/opendata/cms/derived-data/AOD2NanoAODOutreachTool/Run2012BC_DoubleMuParked_Muons.root"
df = ROOT.RDataFrame(treename, filename)
df.Sum("nMuon").GetValue()
# Disable implicit multithreading when done
ROOT.DisableImplicitMT()