This notebook describes new features added in version 0.53 to help users measure time spent in the compiler.
from numba.core import config
# This is needed for recording LLVM pass timings used
# later in this notebook.
config.LLVM_PASS_TIMINGS = 1
In Numba 0.53, a simple event API is implemented to allow users to track specific compilation events. This section will describe how it can be used to record compilation events within a region of code.
We will measure the following function.
import numpy as np
import numba
assert numba.version_info.short >= (0, 53)
from numba import njit
# The function to record
@njit
def foo(x):
out = [4, 5, 6, 7]
out.sort()
return np.linalg.svd(x), out
To record a specific event for later processing, users can use install_recorder()
from numba.core.event import install_recorder
# Record all compilation events
with install_recorder("numba:compile") as rec:
foo(np.ones((2, 2)))
The above stores the recorded numba:compile
events as a list
into rec.buffer
. Each item is a 2-tuple of (timestamp, Event)
.
rec.buffer[:3]
[(1616078243.82092, Event(numba:compile, EventStatus.START, data: dict)), (1616078243.910625, Event(numba:compile, EventStatus.START, data: dict)), (1616078243.939735, Event(numba:compile, EventStatus.START, data: dict))]
The following prepares the data for plotting:
def organize_events(rec):
"""Combine START and END events.
"""
events = []
temp = {}
for ts, ev in rec.buffer:
ed = ev.data
k = ed['dispatcher'], ed['args']
# If it's the start event
if ev.is_start:
# record the dispatcher name, type signature and start time
temp[k] = ev, ts
# Else, it's the end event
else:
# save 4-tuple of (start_event, end_event, start_time, end_time)
ev_start, tt = temp[k]
events.append((ev_start, ev, tt, ts))
return events
def group_data(events):
"""Group the event timings into a simple column format in a dict.
Returns a dict with 3 keys: "start_times", "end_times", and "events".
"""
names = []
start_times = []
end_times = []
for i, (s, e, ts, te) in enumerate(events):
names.append(f"{s.data['dispatcher'].py_func.__name__}{s.data['args']}-[{i}]")
start_times.append(ts)
end_times.append(te)
# Rebase timestamp to the first compilation event
base = np.min(start_times)
start_times -= base
end_times -= base
source = {
"start_times": start_times,
"end_times": end_times,
"events": names,
}
return source
source = group_data(organize_events(rec))
We will use bokeh to render the compilation events.
from bokeh.io import output_notebook, show
from bokeh.plotting import figure
output_notebook()
p = figure(y_range=source['events'],
plot_width=800, plot_height=600, toolbar_location=None,
title="Compilation events")
p.hbar(y="events", left='start_times', right='end_times', height=.8, source=source)
p.xaxis.axis_label = "Seconds since first event"
show(p)
The plot above shows all the compilation events triggered for foo()
. We can see that it triggers the compilation of 19 functions.
The event API is used internally to record time spent in the compiler and LLVM.
# Print out the timers for compiler_lock
md = foo.get_metadata(foo.signatures[0])
timers = md['timers']
print(f"Timers: {timers}")
print()
print(f"Total time spent in LLVM: {timers['llvm_lock']:.2f}s")
print(f"Total time spent in the compiler (includes LLVM time): {timers['compiler_lock']:.2f}s")
print(f"Percent of time spent in LLVM: {timers['llvm_lock'] / timers['compiler_lock'] * 100:.1f}%")
Timers: {'llvm_lock': 1.1270671279999966, 'compiler_lock': 2.46969849} Total time spent in LLVM: 1.13s Total time spent in the compiler (includes LLVM time): 2.47s Percent of time spent in LLVM: 45.6%
As we can see, about half of the time is spent in LLVM.
To understand why some functions take a long time in LLVM, we can look at the LLVM pass timings to see which LLVM compiler passes are taking a long time.
This feature is turned off by default because of the overhead incurred in LLVM when recording the pass timings. To turn this on, set numba.core.config.LLVM_PASS_TIMINGS
to truthy as done at the start of this notebook.
The pass timings are recorded in the metadata for each compiled version of the function.
md = foo.get_metadata(foo.signatures[0])
# Get a summary of the LLVM pass timings
print(md['llvm_pass_timings'])
Printing pass timings for JITCodeLibrary('foo') Total time: 0.1698 == #0 Function passes on '_ZN8__main__7foo$241E5ArrayIdLi2E1C7mutable7alignedE' Percent: 2.1% Total 0.0035s Top timings: 0.0026s ( 74.4%) Detect Dead Lanes #17 0.0005s ( 14.1%) Early CSE #34 0.0002s ( 7.1%) Prune NRT refops #31 0.0001s ( 1.9%) Simplify the CFG #39 0.0000s ( 1.0%) Normalize NRT refops #35 == #1 Function passes on '.dtor.list.int64' Percent: 0.1% Total 0.0001s Top timings: 0.0000s ( 29.3%) Early CSE #34 0.0000s ( 15.5%) Simplify the CFG #39 0.0000s ( 11.7%) Prune NRT refops #31 0.0000s ( 10.0%) Post-Dominator Tree Construction #36 0.0000s ( 8.6%) Dominator Tree Construction #178 == #2 Function passes on '_ZN7cpython8__main__7foo$241E5ArrayIdLi2E1C7mutable7alignedE' Percent: 0.6% Total 0.0011s Top timings: 0.0005s ( 48.6%) SROA #31 0.0003s ( 23.5%) Early CSE #35 0.0001s ( 9.6%) Prune NRT refops #32 0.0001s ( 8.7%) Simplify the CFG #40 0.0000s ( 2.5%) Post-Dominator Tree Construction #37 == #3 Function passes on 'cfunc._ZN8__main__7foo$241E5ArrayIdLi2E1C7mutable7alignedE' Percent: 0.2% Total 0.0004s Top timings: 0.0001s ( 32.6%) SROA #32 0.0001s ( 31.3%) Early CSE #36 0.0001s ( 14.0%) Simplify the CFG #41 0.0000s ( 10.1%) Prune NRT refops #33 0.0000s ( 2.8%) Normalize NRT refops #36 == #4 Module passes (cheap optimization for refprune) Percent: 5.6% Total 0.0095s Top timings: 0.0063s ( 66.1%) Combine redundant instructions 0.0008s ( 8.2%) Function Integration/Inlining 0.0007s ( 6.9%) Prune NRT refops #3 0.0002s ( 2.4%) Post-Dominator Tree Construction #3 0.0002s ( 1.9%) Rotate Loops == #5 Module passes (full optimization) Percent: 46.6% Total 0.0791s Top timings: 0.0067s ( 8.5%) Induction Variable Simplification 0.0050s ( 6.3%) Global Value Numbering 0.0036s ( 4.6%) SLP Vectorizer 0.0033s ( 4.2%) Combine redundant instructions #4 0.0031s ( 3.9%) Combine redundant instructions #7 == #6 Finalize object Percent: 44.8% Total 0.0761s Top timings: 0.0291s ( 38.3%) X86 DAG->DAG Instruction Selection #19 0.0077s ( 10.1%) Greedy Register Allocator #18 0.0063s ( 8.3%) Loop Strength Reduction #11 0.0027s ( 3.6%) Induction Variable Users #10 0.0026s ( 3.4%) Machine Instruction Scheduler #21
There are multiple sections in LLVM timing information. Each represents a single run of a bundle of LLVM passes for a specific purpose. The name of each bundle gives hints to the purpose. Under == #0 Function passes on '_ZN8__main__8foo$2426E5ArrayIdLi2E1C7mutable7alignedE'
are the timings for running lightweight pre-optimization passes on foo()
. These passes took 2.8% of all pass execution time. Under == #6 Finalize object
are the timings for emitting the machine code. These passes took 44.0% of all pass execution time. The Top timings
is a descending list of the 5 most expensive passes. In this case the X86 DAG->DAG Instruction Selection
is the most expensive pass.
Under == #5 Module passes (full optimization)
are the timings for the full optimization. At this stage, all referenced functions are linked into the same LLVM module for aggressive optimizations, including function inlining, loop optimizations, and SIMD-vectorization.
It's also possible to process the timings programmatically. See the API reference here.
pass_timings = md['llvm_pass_timings']
type(pass_timings)
numba.misc.llvm_pass_timings.PassTimingsCollection
Get pass-bundles in descending order of execution time:
bundles = pass_timings.list_longest_first()
print(f"Longest pass bundle is: {bundles[0].name!r}")
print("Top 3 passes:")
for info in bundles[0].timings.list_top(3):
print(f"- {info.pass_name!r} pass took {info.wall_time:.3f}s")
Longest pass bundle is: 'Module passes (full optimization)' Top 3 passes: - 'Induction Variable Simplification' pass took 0.007s - 'Global Value Numbering' pass took 0.005s - 'SLP Vectorizer' pass took 0.004s
Total seconds spent in LLVM passes:
pass_timings.get_total_time()
0.1698
Note, this is a small fraction of the time shown earlier as measured using the events API. The reason is that the LLVM pass timings are showing per-function statistics. In the following section, we will combine both techniques to discover all the LLVM pass timings in all triggered compilations.
The event API keeps extra data, including the dispatcher and the type signature for that compilation.
ev = rec.buffer[0][1]
ev
Event(numba:compile, EventStatus.START, data: dict)
ev.data
{'dispatcher': CPUDispatcher(<function foo at 0x11cb2bca0>), 'args': (array(float64, 2d, C),), 'return_type': None}
We can gather all the LLVM pass timings in all the recorded compilation events.
all_pass_timings = {}
for ts, ev in rec.buffer:
# Use only the START event
if ev.is_start:
# Get the extra data in the event
extra = ev.data
# Get the dispatcher for the event
dispatcher = extra['dispatcher']
md = dispatcher.get_metadata(extra['args'])
# Get the LLVM pass timings for this dispatcher
timings = md['llvm_pass_timings']
all_pass_timings[ev] = timings
Get the timings in descending order:
unordered_timings = []
total_time_in_passes = 0.
for ev, timings in all_pass_timings.items():
curtime = timings.get_total_time()
total_time_in_passes += curtime
unordered_timings.append((curtime, ev.data['dispatcher'].__qualname__))
ordered_timings = sorted(unordered_timings, key=lambda x: x[0], reverse=True)
for run_time, func_name in ordered_timings:
print(f"{run_time / total_time_in_passes * 100:4.1f}% in LLVM passes for {func_name!r}")
total_time_in_passes = sum(x for x, _ in ordered_timings)
print(f'total time in LLVM passes: {total_time_in_passes}s')
21.5% in LLVM passes for 'foo' 11.3% in LLVM passes for 'ol_list_sort.<locals>.impl' 11.2% in LLVM passes for 'ol_list_sort.<locals>.impl' 10.5% in LLVM passes for 'make_quicksort_impl.<locals>.run_quicksort' 10.2% in LLVM passes for 'make_quicksort_impl.<locals>.run_quicksort' 7.0% in LLVM passes for 'svd_impl.<locals>.svd_impl' 5.9% in LLVM passes for 'make_quicksort_impl.<locals>.partition' 5.7% in LLVM passes for 'make_quicksort_impl.<locals>.partition' 3.4% in LLVM passes for 'make_quicksort_impl.<locals>.insertion_sort' 3.3% in LLVM passes for 'make_quicksort_impl.<locals>.insertion_sort' 2.6% in LLVM passes for 'ol_copy_to_fortran_order.<locals>.impl' 1.6% in LLVM passes for '_check_finite_matrix' 1.0% in LLVM passes for '_handle_err_maybe_convergence_problem' 1.0% in LLVM passes for '_dummy_liveness_func' 0.9% in LLVM passes for 'make_quicksort_impl.<locals>.make_res' 0.8% in LLVM passes for 'make_quicksort_impl.<locals>.make_res' 0.6% in LLVM passes for 'gt' 0.6% in LLVM passes for 'make_quicksort_impl.<locals>.default_lt' 0.5% in LLVM passes for 'make_quicksort_impl.<locals>.GET' 0.5% in LLVM passes for 'make_quicksort_impl.<locals>.GET' total time in LLVM passes: 0.7901000000000001s
Not all of the time reported by the llvm_lock
event is spent in LLVM passes. The remaining time is spent in other LLVM API calls.
total_llvm_time = foo.get_metadata(foo.signatures[0])['timers']['llvm_lock']
print(f"Percent of llvm_lock time not accounted for in pass timings: {(total_llvm_time - total_time_in_passes) / total_llvm_time * 100:.1f}%")
Percent of llvm_lock time not accounted for in pass timings: 29.9%