Notebook

Profiling the Compiler¶

This notebook describes new features added in version 0.53 to help users measure time spent in the compiler.

In [1]:

from numba.core import config
# This is needed for recording LLVM pass timings used 
# later in this notebook.
config.LLVM_PASS_TIMINGS = 1

Tracking Compilation Events¶

In Numba 0.53, a simple event API is implemented to allow users to track specific compilation events. This section will describe how it can be used to record compilation events within a region of code.

We will measure the following function.

In [2]:

import numpy as np
import numba
assert numba.version_info.short >= (0, 53)
from numba import njit

In [3]:

# The function to record
@njit
def foo(x):
    out = [4, 5, 6, 7]
    out.sort()
    return np.linalg.svd(x), out

To record a specific event for later processing, users can use install_recorder()

In [4]:

from numba.core.event import install_recorder

# Record all compilation events
with install_recorder("numba:compile") as rec:
    foo(np.ones((2, 2)))

The above stores the recorded numba:compile events as a list into rec.buffer. Each item is a 2-tuple of (timestamp, Event).

In [5]:

rec.buffer[:3]

Out[5]:

[(1616078243.82092, Event(numba:compile, EventStatus.START, data: dict)),
 (1616078243.910625, Event(numba:compile, EventStatus.START, data: dict)),
 (1616078243.939735, Event(numba:compile, EventStatus.START, data: dict))]

The following prepares the data for plotting:

In [6]:

def organize_events(rec):
    """Combine START and END events.
    """
    events = []
    temp = {}
    
    for ts, ev in rec.buffer:
        ed = ev.data
        k = ed['dispatcher'], ed['args']
        # If it's the start event
        if ev.is_start:
            # record the dispatcher name, type signature and start time
            temp[k] = ev, ts
        # Else, it's the end event
        else:
            # save 4-tuple of (start_event, end_event, start_time, end_time)
            ev_start, tt = temp[k]
            events.append((ev_start, ev, tt, ts))
    return events

def group_data(events):
    """Group the event timings into a simple column format in a dict.
    
    Returns a dict with 3 keys: "start_times", "end_times", and "events".
    """
    names = []
    start_times = []
    end_times = []
    for i, (s, e, ts, te) in enumerate(events):
        names.append(f"{s.data['dispatcher'].py_func.__name__}{s.data['args']}-[{i}]")
        start_times.append(ts)    
        end_times.append(te)

    # Rebase timestamp to the first compilation event
    base = np.min(start_times)
    start_times -= base
    end_times -= base

    source = {
        "start_times": start_times,
        "end_times": end_times,
        "events": names,
    }
    return source

source = group_data(organize_events(rec))

We will use bokeh to render the compilation events.

In [7]:

from bokeh.io import output_notebook, show
from bokeh.plotting import figure

output_notebook()

Loading BokehJS ...

In [8]:

p = figure(y_range=source['events'],
           plot_width=800, plot_height=600, toolbar_location=None,
           title="Compilation events")
p.hbar(y="events", left='start_times', right='end_times', height=.8, source=source)
p.xaxis.axis_label = "Seconds since first event"

show(p)

The plot above shows all the compilation events triggered for foo(). We can see that it triggers the compilation of 19 functions.

The event API is used internally to record time spent in the compiler and LLVM.

In [9]:

# Print out the timers for compiler_lock
md = foo.get_metadata(foo.signatures[0])
timers = md['timers']
print(f"Timers: {timers}")
print()
print(f"Total time spent in LLVM: {timers['llvm_lock']:.2f}s")
print(f"Total time spent in the compiler (includes LLVM time): {timers['compiler_lock']:.2f}s")

print(f"Percent of time spent in LLVM: {timers['llvm_lock'] / timers['compiler_lock'] * 100:.1f}%")

Timers: {'llvm_lock': 1.1270671279999966, 'compiler_lock': 2.46969849}

Total time spent in LLVM: 1.13s
Total time spent in the compiler (includes LLVM time): 2.47s
Percent of time spent in LLVM: 45.6%

As we can see, about half of the time is spent in LLVM.

LLVM Pass Timings¶

To understand why some functions take a long time in LLVM, we can look at the LLVM pass timings to see which LLVM compiler passes are taking a long time.

This feature is turned off by default because of the overhead incurred in LLVM when recording the pass timings. To turn this on, set numba.core.config.LLVM_PASS_TIMINGS to truthy as done at the start of this notebook.

The pass timings are recorded in the metadata for each compiled version of the function.

In [10]:

md = foo.get_metadata(foo.signatures[0])
# Get a summary of the LLVM pass timings
print(md['llvm_pass_timings'])

Printing pass timings for JITCodeLibrary('foo')
Total time: 0.1698
== #0 Function passes on '_ZN8__main__7foo$241E5ArrayIdLi2E1C7mutable7alignedE'
 Percent: 2.1%
 Total 0.0035s
 Top timings:
   0.0026s ( 74.4%) Detect Dead Lanes #17
   0.0005s ( 14.1%) Early CSE #34
   0.0002s (  7.1%) Prune NRT refops #31
   0.0001s (  1.9%) Simplify the CFG #39
   0.0000s (  1.0%) Normalize NRT refops #35
== #1 Function passes on '.dtor.list.int64'
 Percent: 0.1%
 Total 0.0001s
 Top timings:
   0.0000s ( 29.3%) Early CSE #34
   0.0000s ( 15.5%) Simplify the CFG #39
   0.0000s ( 11.7%) Prune NRT refops #31
   0.0000s ( 10.0%) Post-Dominator Tree Construction #36
   0.0000s (  8.6%) Dominator Tree Construction #178
== #2 Function passes on '_ZN7cpython8__main__7foo$241E5ArrayIdLi2E1C7mutable7alignedE'
 Percent: 0.6%
 Total 0.0011s
 Top timings:
   0.0005s ( 48.6%) SROA #31
   0.0003s ( 23.5%) Early CSE #35
   0.0001s (  9.6%) Prune NRT refops #32
   0.0001s (  8.7%) Simplify the CFG #40
   0.0000s (  2.5%) Post-Dominator Tree Construction #37
== #3 Function passes on 'cfunc._ZN8__main__7foo$241E5ArrayIdLi2E1C7mutable7alignedE'
 Percent: 0.2%
 Total 0.0004s
 Top timings:
   0.0001s ( 32.6%) SROA #32
   0.0001s ( 31.3%) Early CSE #36
   0.0001s ( 14.0%) Simplify the CFG #41
   0.0000s ( 10.1%) Prune NRT refops #33
   0.0000s (  2.8%) Normalize NRT refops #36
== #4 Module passes (cheap optimization for refprune)
 Percent: 5.6%
 Total 0.0095s
 Top timings:
   0.0063s ( 66.1%) Combine redundant instructions
   0.0008s (  8.2%) Function Integration/Inlining
   0.0007s (  6.9%) Prune NRT refops #3
   0.0002s (  2.4%) Post-Dominator Tree Construction #3
   0.0002s (  1.9%) Rotate Loops
== #5 Module passes (full optimization)
 Percent: 46.6%
 Total 0.0791s
 Top timings:
   0.0067s (  8.5%) Induction Variable Simplification
   0.0050s (  6.3%) Global Value Numbering
   0.0036s (  4.6%) SLP Vectorizer
   0.0033s (  4.2%) Combine redundant instructions #4
   0.0031s (  3.9%) Combine redundant instructions #7
== #6 Finalize object
 Percent: 44.8%
 Total 0.0761s
 Top timings:
   0.0291s ( 38.3%) X86 DAG->DAG Instruction Selection #19
   0.0077s ( 10.1%) Greedy Register Allocator #18
   0.0063s (  8.3%) Loop Strength Reduction #11
   0.0027s (  3.6%) Induction Variable Users #10
   0.0026s (  3.4%) Machine Instruction Scheduler #21

There are multiple sections in LLVM timing information. Each represents a single run of a bundle of LLVM passes for a specific purpose. The name of each bundle gives hints to the purpose. Under == #0 Function passes on '_ZN8__main__8foo$2426E5ArrayIdLi2E1C7mutable7alignedE' are the timings for running lightweight pre-optimization passes on foo(). These passes took 2.8% of all pass execution time. Under == #6 Finalize object are the timings for emitting the machine code. These passes took 44.0% of all pass execution time. The Top timings is a descending list of the 5 most expensive passes. In this case the X86 DAG->DAG Instruction Selection is the most expensive pass.

Under == #5 Module passes (full optimization) are the timings for the full optimization. At this stage, all referenced functions are linked into the same LLVM module for aggressive optimizations, including function inlining, loop optimizations, and SIMD-vectorization.

It's also possible to process the timings programmatically. See the API reference here.

In [11]:

pass_timings = md['llvm_pass_timings']
type(pass_timings)

Out[11]:

numba.misc.llvm_pass_timings.PassTimingsCollection

Get pass-bundles in descending order of execution time:

In [12]:

bundles = pass_timings.list_longest_first()

In [13]:

print(f"Longest pass bundle is: {bundles[0].name!r}")

print("Top 3 passes:")
for info in bundles[0].timings.list_top(3):
    print(f"- {info.pass_name!r} pass took {info.wall_time:.3f}s")

Longest pass bundle is: 'Module passes (full optimization)'
Top 3 passes:
- 'Induction Variable Simplification' pass took 0.007s
- 'Global Value Numbering' pass took 0.005s
- 'SLP Vectorizer' pass took 0.004s

Total seconds spent in LLVM passes:

In [14]:

pass_timings.get_total_time()

Out[14]:

0.1698

Note, this is a small fraction of the time shown earlier as measured using the events API. The reason is that the LLVM pass timings are showing per-function statistics. In the following section, we will combine both techniques to discover all the LLVM pass timings in all triggered compilations.

Combining Events and LLVM Pass Timings¶

The event API keeps extra data, including the dispatcher and the type signature for that compilation.

In [15]:

ev = rec.buffer[0][1]
ev

Out[15]:

Event(numba:compile, EventStatus.START, data: dict)

In [16]:

ev.data

Out[16]:

{'dispatcher': CPUDispatcher(<function foo at 0x11cb2bca0>),
 'args': (array(float64, 2d, C),),
 'return_type': None}

We can gather all the LLVM pass timings in all the recorded compilation events.

In [17]:

all_pass_timings = {}
for ts, ev in rec.buffer:
    # Use only the START event
    if ev.is_start:
        # Get the extra data in the event
        extra = ev.data
        # Get the dispatcher for the event
        dispatcher = extra['dispatcher']
        md = dispatcher.get_metadata(extra['args'])
        # Get the LLVM pass timings for this dispatcher
        timings = md['llvm_pass_timings']
        all_pass_timings[ev] = timings

Get the timings in descending order:

In [18]:

unordered_timings = []
total_time_in_passes = 0.
for ev, timings in all_pass_timings.items():
    curtime = timings.get_total_time()
    total_time_in_passes += curtime
    unordered_timings.append((curtime, ev.data['dispatcher'].__qualname__))
    
ordered_timings = sorted(unordered_timings, key=lambda x: x[0], reverse=True)

In [19]:

for run_time, func_name in ordered_timings:
    print(f"{run_time / total_time_in_passes * 100:4.1f}% in LLVM passes for {func_name!r}")
total_time_in_passes = sum(x for x, _ in ordered_timings)
print(f'total time in LLVM passes: {total_time_in_passes}s')

21.5% in LLVM passes for 'foo'
11.3% in LLVM passes for 'ol_list_sort.<locals>.impl'
11.2% in LLVM passes for 'ol_list_sort.<locals>.impl'
10.5% in LLVM passes for 'make_quicksort_impl.<locals>.run_quicksort'
10.2% in LLVM passes for 'make_quicksort_impl.<locals>.run_quicksort'
 7.0% in LLVM passes for 'svd_impl.<locals>.svd_impl'
 5.9% in LLVM passes for 'make_quicksort_impl.<locals>.partition'
 5.7% in LLVM passes for 'make_quicksort_impl.<locals>.partition'
 3.4% in LLVM passes for 'make_quicksort_impl.<locals>.insertion_sort'
 3.3% in LLVM passes for 'make_quicksort_impl.<locals>.insertion_sort'
 2.6% in LLVM passes for 'ol_copy_to_fortran_order.<locals>.impl'
 1.6% in LLVM passes for '_check_finite_matrix'
 1.0% in LLVM passes for '_handle_err_maybe_convergence_problem'
 1.0% in LLVM passes for '_dummy_liveness_func'
 0.9% in LLVM passes for 'make_quicksort_impl.<locals>.make_res'
 0.8% in LLVM passes for 'make_quicksort_impl.<locals>.make_res'
 0.6% in LLVM passes for 'gt'
 0.6% in LLVM passes for 'make_quicksort_impl.<locals>.default_lt'
 0.5% in LLVM passes for 'make_quicksort_impl.<locals>.GET'
 0.5% in LLVM passes for 'make_quicksort_impl.<locals>.GET'
total time in LLVM passes: 0.7901000000000001s

Not all of the time reported by the llvm_lock event is spent in LLVM passes. The remaining time is spent in other LLVM API calls.

In [20]:

total_llvm_time = foo.get_metadata(foo.signatures[0])['timers']['llvm_lock']

print(f"Percent of llvm_lock time not accounted for in pass timings: {(total_llvm_time - total_time_in_passes) / total_llvm_time * 100:.1f}%")

Percent of llvm_lock time not accounted for in pass timings: 29.9%