servicex.yaml
includes an access token. Thus it's perfectly fine to deliver data to a university cluster or a laptop for small tests.servicex.yaml
) from the ServiceX website and copy to your home or working directoryservicex.af.uchicago.edu
is limited to the ATLAS users as it provides an access to the ATLAS event dataServiceX Client library is a python library for users to communicate with ServiceX backend (or server) to make delivery requests and handling of outputs
The most fundamental compenents of a ServiceX request
Design goal of the new ServiceX Client library
Installation
pip install servicex==3.0.0.alpha.18
# !pip install servicex==3.0.0.alpha.18
!pip list | grep servicex
servicex 3.0.0a18
servicex.yaml
) from the ServiceX webpage and installed servicex
package
import servicex
spec = {
"Sample":[{
"Name": "UprootRaw_PyHEP",
"Dataset": servicex.dataset.Rucio("user.kchoi.pyhep2024.test_dataset"),
"Query": servicex.query.UprootRaw({"treename": "nominal", "filter_name": "el_pt"})
}]
}
spec
object.Query
, sent to transformers and run on all files in the given Rucio datasetUprootRaw
query takes "treename"
to set TTree
in flat ROOT ntuples and "filter_name"
to select branches in a given treeo = servicex.deliver(spec)
Output()
len(o['UprootRaw_PyHEP'])
3
print(f"Sample.Name: {o.keys()}\n")
print(f"Fileset: {type(o['UprootRaw_PyHEP'])}\n")
print(f"First file: {(o['UprootRaw_PyHEP'][0])}\n")
Sample.Name: dict_keys(['UprootRaw_PyHEP']) Fileset: <class 'list'> First file: /Users/kc43627/Work/data/servicex_cache/c9a57bae-b2c3-4432-93cd-253763e42ead/root___192.170.240.145__root___fax.mwt2.org_1094__pnfs_uchicago.edu_atlaslocalgroupdisk_rucio_user_mgeyik_a0_3c_user.mgeyik.30183079._000006.out.root
import uproot
with uproot.open(o['UprootRaw_PyHEP'][0]) as f:
column = f['nominal']['el_pt']
column.array()
[[3.86e+04, 3.6e+04], [3.44e+04, 1.91e+04], [], [5.98e+04, 5.76e+04], [6.84e+04, 2.4e+04], [3.5e+04], [], [], [1.34e+05, 4.36e+04], [], ..., [6.46e+04, 3.66e+04, 2.78e+04], [], [1.74e+04], [], [5.42e+04], [3.81e+04, 1.26e+04], [], [3.53e+04], [6.17e+04]] -------------------------------- type: 11543 * var * float32
Let me go through what kinds of Dataset
and Query
are supported by ServiceX
servicex.dataset.Rucio.__init__
<function servicex.dataset_identifier.RucioDatasetIdentifier.__init__(self, dataset: str, num_files: Optional[int] = None)>
servicex.dataset.FileList.__init__
<function servicex.dataset_identifier.FileListDataset.__init__(self, files: Union[List[str], str])>
servicex.dataset.CERNOpenData.__init__
<function servicex.dataset_identifier.CERNOpenDataDatasetIdentifier.__init__(self, dataset: int, num_files: Optional[int] = None)>
UprootRaw({"treename": "nominal", "filter_name": "el_pt"})
servicex.query.plugins
[EntryPoint(name='FuncADL_ATLASr21', value='servicex.func_adl.func_adl_dataset:FuncADLQuery_ATLASr21', group='servicex.query'), EntryPoint(name='FuncADL_ATLASr22', value='servicex.func_adl.func_adl_dataset:FuncADLQuery_ATLASr22', group='servicex.query'), EntryPoint(name='FuncADL_ATLASxAOD', value='servicex.func_adl.func_adl_dataset:FuncADLQuery_ATLASxAOD', group='servicex.query'), EntryPoint(name='FuncADL_CMS', value='servicex.func_adl.func_adl_dataset:FuncADLQuery_CMS', group='servicex.query'), EntryPoint(name='FuncADL_Uproot', value='servicex.func_adl.func_adl_dataset:FuncADLQuery_Uproot', group='servicex.query'), EntryPoint(name='PythonFunction', value='servicex.python_dataset:PythonQuery', group='servicex.query'), EntryPoint(name='UprootRaw', value='servicex.uproot_raw.uproot_raw:UprootRawQuery', group='servicex.query')]
UprootRaw
Query
uproot.tree.arrays()
functiontreename
keycopy_histograms
key
query = [
{
'treename': 'reco',
'filter_name': ['/mu.*/', 'runNumber', 'lbn', 'jet_pt_*'],
'cut':'(count_nonzero(jet_pt_NOSYS>40e3, axis=1)>=4)'
},
{
'copy_histograms': ['CutBookkeeper*', '/cflow.*/', 'metadata', 'listOfSystematics']
}
]
query_UprootRaw = servicex.query.UprootRaw({"treename": "nominal", "filter_name": "el_pt"})
FuncADL_Uproot
Query
Select()
for column selection or Where()
for filtering, more sophisticated query can be builtFromTree()
method to set a tree name in a queryquery_FuncADL = servicex.query.FuncADL_Uproot().FromTree('nominal').Select(lambda e: {'el_pt': e['el_pt']})
PythonFunction
Query
uproot
, awkward
, vector
can be imported (limited by the transformer image)def run_query(input_filenames=None):
import uproot
with uproot.open({input_filenames: "nominal"}) as o:
br = o.arrays("el_pt")
return br
query_PythonFunction = servicex.query.PythonFunction().with_uproot_function(run_query)
el_pt_NOSYS
!
spec_multiple = {
"Sample":[
{
"Name": "UprootRaw_PyHEP",
"Dataset": servicex.dataset.Rucio("user.kchoi.pyhep2024.test_dataset"),
"Query": query_UprootRaw
},
{
"Name": "FuncADL_Uproot_PyHEP",
"Dataset": servicex.dataset.Rucio("user.kchoi.pyhep2024.test_dataset"),
"Query": query_FuncADL
},
{
"Name": "PythonFunction_PyHEP",
"Dataset": servicex.dataset.Rucio("user.kchoi.pyhep2024.test_dataset"),
"Query": query_PythonFunction
}
]
}
Sample
block is a list of dictionaries, each with a Dataset
- Query
pairDataset
- Query
pairo_multiple = servicex.deliver(spec_multiple)
Output()
servicex-databinder
and significantly improve user interface to allow a seamless experience with YAML%%writefile -a config_UprootRaw.yaml
Sample:
- Name: Uproot_UprootRaw_YAML
Dataset: !Rucio user.kchoi.pyhep2024.test_dataset
Query: !UprootRaw |
{"treename":"nominal", "filter_name": "el_pt"}
Writing config_UprootRaw.yaml
from servicex.dataset import Rucio
from servicex.query import UprootRaw
from servicex import deliver
spec = {
"Sample":[{
"Name": "UprootRaw_PyHEP",
"Dataset": Rucio("user.kchoi.pyhep2024.test_dataset"),
"Query": UprootRaw({"treename": "nominal", "filter_name": "el_pt"})
}]
}
from servicex import deliver
o_yaml = deliver("config_UprootRaw.yaml")
# o_py = deliver(spec)
Output()
YAML syntax
!Rucio
, !Rucio
, !FileList
, !CERNOpenData
!UprootRaw
, !FuncADL_Uproot
, !PythonFunction
|
) after query tag represents the literal operator and allows to properly interpret multi-line stringDefinition:
- &DEF_ggH_input "root://eospublic.cern.ch//eos/opendata/atlas/OutreachDatasets\
/2020-01-22/4lep/MC/mc_345060.ggH125_ZZ4lep.4lep.root"
- &DEF_query1 !PythonFunction |
def run_query(input_filenames=None):
import uproot
with uproot.open({input_filenames:"nominal"}) as o:
br = o.arrays("mu_pt")
return br
- &DEF_query2 !FuncADL_Uproot |
FromTree('mini').Select(lambda e: {'lep_pt': e['lep_pt']}).Where(lambda e: e['lep_pt'] > 1000)
General:
OutputFormat: parquet
Delivery: SignedURLs
Sample:
- Name: ttH
Dataset: !Rucio user.kchoi.fcnc_tHq_ML.ttH.v11
Query: *DEF_query1
NFiles: 5
# IgnoreLocalCache: False
- Name: ttZ
Dataset: !Rucio user.kchoi.fcnc_tHq_ML.ttZ.v11
Query: *DEF_query1
NFiles: 3
- Name: ggH
Dataset: !FileList *DEF_ggH_input
Query: *DEF_query2
spec_typo = {
"Sample":[{
"Name": "UprootRaw_PyHEP",
"Dataset": Rucio("user.kchoi.pyhep2024.test_dataset"),
"Query": UprootRaw({"treename": "nominal", "filter_name": "el_pta"})
}]
}
o = deliver(spec_typo)
Output()
[07/01/24 00:13:17] WARNING Transform "UprootRaw_PyHEP" completed with failures: 3/3 files query_core.py:215 failed
WARNING More information of 'UprootRaw_PyHEP' HERE query_core.py:226
Client library
ServiceX