The purpose of this tutorial is twofold
Operator
.As we shall see, most optimizations are automatically applied as they're known to systematically improve performance. Others, whose impact varies across different Operator
's, are instead to be enabled through specific flags.
An Operator has several preset optimization levels; the fundamental ones are noop
and advanced
. With noop
, no performance optimizations are introduced. With advanced
, several flop-reducing and data locality optimization passes are applied. Examples of flop-reducing optimization passes are common sub-expressions elimination and factorization, while examples of data locality optimization passes are loop fusion and cache blocking. Optimization levels in Devito are conceptually akin to the -O2, -O3, ...
flags in classic C/C++/Fortran compilers.
An optimization pass may provide knobs, or options, for fine-grained tuning. As explained in the next sections, some of these options are given at compile-time, others at run-time.
**** Remark ****
Parallelism -- both shared-memory (e.g., OpenMP) and distributed-memory (MPI) -- is by default disabled and is not controlled via the optimization level. In this tutorial we will also show how to enable OpenMP parallelism (you'll see it's trivial!). Another mini-guide about parallelism in Devito and related aspects is available here.
********
The optimization level may be changed in various ways:
DEVITO_OPT
environment variable. For example, to disable all optimizations on all Operator
's, one could run withDEVITO_OPT=noop python ...
from devito import configuration
configuration['opt'] = 'noop'
Operator
argumentOperator(..., opt='noop')
Local takes precedence over programmatic, and programmatic takes precedence over global.
The optimization options, instead, may only be changed locally. The syntax to specify an option is
Operator(..., opt=('advanced', {<optimization options>})
A concrete example (you can ignore the meaning for now) is
Operator(..., opt=('advanced', {'blocklevels': 2})
That is, options are to be specified together with the optimization level (advanced
).
By default, all Operator
's are run with the optimization level set to advanced
. So this
Operator(Eq(...))
is equivalent to
Operator(Eq(...), opt='advanced')
and obviously also to
Operator(Eq(...), opt=('advanced', {}))
In virtually all scenarios, regardless of application and underlying architecture, this ensures very good performance -- but not necessarily the very best.
The following functions will be used throughout the notebook for printing generated code.
from examples.performance import unidiff_output, print_kernel
The following cell is only needed for Continuous Integration. But actually it's also an example of how "programmatic takes precedence over global" (see API section).
from devito import configuration
configuration['language'] = 'C'
configuration['platform'] = 'bdw' # Optimize for an Intel Broadwell
configuration['opt-options']['par-collapse-ncores'] = 1 # Maximize use loop collapsing
Throughout the notebook we will generate Operator
's for the following time-marching Eq
.
from devito import Eq, Grid, Operator, Function, TimeFunction, sin
grid = Grid(shape=(80, 80, 80))
f = Function(name='f', grid=grid)
u = TimeFunction(name='u', grid=grid, space_order=4)
eq = Eq(u.forward, f**2*sin(f)*u.dy.dy)
Despite its simplicity, this Eq
is all we need to showcase the key components of the Devito optimization engine.
There are several ways to enable OpenMP parallelism. The one we use here consists of supplying an option to an Operator
. The next cell illustrates the difference between two Operator
's generated with the noop
optimization level, but with OpenMP enabled on the latter one.
op0 = Operator(eq, opt=('noop'))
op0_omp = Operator(eq, opt=('noop', {'openmp': True}))
# print(op0)
# print(unidiff_output(str(op0), str(op0_omp))) # Uncomment to print out the diff only
print_kernel(op0_omp)
for (int time = time_m, t0 = (time)%(2), t1 = (time + 1)%(2); time <= time_M; time += 1, t0 = (time)%(2), t1 = (time + 1)%(2)) { /* Begin section0 */ START_TIMER(section0) #pragma omp parallel num_threads(nthreads) { #pragma omp for collapse(1) schedule(dynamic,1) for (int x = x_m; x <= x_M; x += 1) { for (int y = y_m; y <= y_M; y += 1) { for (int z = z_m; z <= z_M; z += 1) { u[t1][x + 4][y + 4][z + 4] = (8.33333333e-2F*(8.33333333e-2F*u[t0][x + 4][y][z + 4]/h_y - 6.66666667e-1F*u[t0][x + 4][y + 1][z + 4]/h_y + 6.66666667e-1F*u[t0][x + 4][y + 3][z + 4]/h_y - 8.33333333e-2F*u[t0][x + 4][y + 4][z + 4]/h_y)/h_y - 6.66666667e-1F*(8.33333333e-2F*u[t0][x + 4][y + 1][z + 4]/h_y - 6.66666667e-1F*u[t0][x + 4][y + 2][z + 4]/h_y + 6.66666667e-1F*u[t0][x + 4][y + 4][z + 4]/h_y - 8.33333333e-2F*u[t0][x + 4][y + 5][z + 4]/h_y)/h_y + 6.66666667e-1F*(8.33333333e-2F*u[t0][x + 4][y + 3][z + 4]/h_y - 6.66666667e-1F*u[t0][x + 4][y + 4][z + 4]/h_y + 6.66666667e-1F*u[t0][x + 4][y + 6][z + 4]/h_y - 8.33333333e-2F*u[t0][x + 4][y + 7][z + 4]/h_y)/h_y - 8.33333333e-2F*(8.33333333e-2F*u[t0][x + 4][y + 4][z + 4]/h_y - 6.66666667e-1F*u[t0][x + 4][y + 5][z + 4]/h_y + 6.66666667e-1F*u[t0][x + 4][y + 7][z + 4]/h_y - 8.33333333e-2F*u[t0][x + 4][y + 8][z + 4]/h_y)/h_y)*sin(f[x + 1][y + 1][z + 1])*pow(f[x + 1][y + 1][z + 1], 2); } } } } STOP_TIMER(section0,timers) /* End section0 */ }
The OpenMP-ized op0_omp
Operator
includes:
"omp.h"
#pragma omp parallel num_threads(nthreads)
directive#pragma omp for collapse(...) schedule(dynamic,1)
directiveMore complex Operator
's will have more directives, more types of directives, different iteration scheduling strategies based on heuristics and empirical tuning (e.g., static
instead of dynamic
), etc.
The reason for collapse(1)
, rather than collapse(3)
, boils down to using opt=('noop', ...)
; if the default advanced
mode had been used instead, we would see the latter clause.
We note how the OpenMP pass introduces a new symbol, nthreads
. This allows users to explicitly control the number of threads with which an Operator
is run.
op0_omp.apply(time_M=0) # Picks up `nthreads` from the standard environment variable OMP_NUM_THREADS
op0_omp.apply(time_M=0, nthreads=2) # Runs with 2 threads per parallel loop
A few optimization options are available for this pass (but not on all platforms, see here), though in our experience the default values do a fine job:
par-collapse-ncores
: use a collapse clause only if the number of available physical cores is greater than this value (default=4).par-collapse-work
: use a collapse clause only if the trip count of the collapsable loops is statically known to exceed this value (default=100).par-chunk-nonaffine
: a coefficient to adjust the chunk size in non-affine parallel loops. The larger the coefficient, the smaller the chunk size (default=3).par-dynamic-work
: use dynamic scheduling if the operation count per iteration exceeds this value. Otherwise, use static scheduling (default=10).par-nested
: use nested parallelism if the number of hyperthreads per core is greater than this value (default=2).So, for instance, we could switch to static scheduling by constructing the following Operator
op0_b0_omp = Operator(eq, opt=('noop', {'openmp': True, 'par-dynamic-work': 100}))
print_kernel(op0_b0_omp)
for (int time = time_m, t0 = (time)%(2), t1 = (time + 1)%(2); time <= time_M; time += 1, t0 = (time)%(2), t1 = (time + 1)%(2)) { /* Begin section0 */ START_TIMER(section0) #pragma omp parallel num_threads(nthreads) { #pragma omp for collapse(1) schedule(static,1) for (int x = x_m; x <= x_M; x += 1) { for (int y = y_m; y <= y_M; y += 1) { for (int z = z_m; z <= z_M; z += 1) { u[t1][x + 4][y + 4][z + 4] = (8.33333333e-2F*(8.33333333e-2F*u[t0][x + 4][y][z + 4]/h_y - 6.66666667e-1F*u[t0][x + 4][y + 1][z + 4]/h_y + 6.66666667e-1F*u[t0][x + 4][y + 3][z + 4]/h_y - 8.33333333e-2F*u[t0][x + 4][y + 4][z + 4]/h_y)/h_y - 6.66666667e-1F*(8.33333333e-2F*u[t0][x + 4][y + 1][z + 4]/h_y - 6.66666667e-1F*u[t0][x + 4][y + 2][z + 4]/h_y + 6.66666667e-1F*u[t0][x + 4][y + 4][z + 4]/h_y - 8.33333333e-2F*u[t0][x + 4][y + 5][z + 4]/h_y)/h_y + 6.66666667e-1F*(8.33333333e-2F*u[t0][x + 4][y + 3][z + 4]/h_y - 6.66666667e-1F*u[t0][x + 4][y + 4][z + 4]/h_y + 6.66666667e-1F*u[t0][x + 4][y + 6][z + 4]/h_y - 8.33333333e-2F*u[t0][x + 4][y + 7][z + 4]/h_y)/h_y - 8.33333333e-2F*(8.33333333e-2F*u[t0][x + 4][y + 4][z + 4]/h_y - 6.66666667e-1F*u[t0][x + 4][y + 5][z + 4]/h_y + 6.66666667e-1F*u[t0][x + 4][y + 7][z + 4]/h_y - 8.33333333e-2F*u[t0][x + 4][y + 8][z + 4]/h_y)/h_y)*sin(f[x + 1][y + 1][z + 1])*pow(f[x + 1][y + 1][z + 1], 2); } } } } STOP_TIMER(section0,timers) /* End section0 */ }
advanced
mode¶The default optimization level in Devito is advanced
. This mode performs several compilation passes to optimize the Operator
for computation (number of flops), working set size, and data locality. In the next paragraphs we'll dissect the advanced
mode to analyze, one by one, some of its key passes.
The next cell creates a new Operator
that adds loop blocking to what we had in op0_omp
.
op1_omp = Operator(eq, opt=('blocking', {'openmp': True}))
# print(op1_omp) # Uncomment to see the *whole* generated code
print_kernel(op1_omp)
void bf0(struct dataobj *restrict f_vec, const float h_y, struct dataobj *restrict u_vec, const int t0, const int t1, const int x0_blk0_size, const int x_M, const int x_m, const int y0_blk0_size, const int y_M, const int y_m, const int z_M, const int z_m, const int nthreads) { float (*restrict f)[f_vec->size[1]][f_vec->size[2]] __attribute__ ((aligned (64))) = (float (*)[f_vec->size[1]][f_vec->size[2]]) f_vec->data; float (*restrict u)[u_vec->size[1]][u_vec->size[2]][u_vec->size[3]] __attribute__ ((aligned (64))) = (float (*)[u_vec->size[1]][u_vec->size[2]][u_vec->size[3]]) u_vec->data; if (x0_blk0_size == 0 || y0_blk0_size == 0) { return; } #pragma omp parallel num_threads(nthreads) { #pragma omp for collapse(2) schedule(dynamic,1) for (int x0_blk0 = x_m; x0_blk0 <= x_M; x0_blk0 += x0_blk0_size) { for (int y0_blk0 = y_m; y0_blk0 <= y_M; y0_blk0 += y0_blk0_size) { for (int x = x0_blk0; x <= x0_blk0 + x0_blk0_size - 1; x += 1) { for (int y = y0_blk0; y <= y0_blk0 + y0_blk0_size - 1; y += 1) { for (int z = z_m; z <= z_M; z += 1) { u[t1][x + 4][y + 4][z + 4] = (8.33333333e-2F*(8.33333333e-2F*u[t0][x + 4][y][z + 4]/h_y - 6.66666667e-1F*u[t0][x + 4][y + 1][z + 4]/h_y + 6.66666667e-1F*u[t0][x + 4][y + 3][z + 4]/h_y - 8.33333333e-2F*u[t0][x + 4][y + 4][z + 4]/h_y)/h_y - 6.66666667e-1F*(8.33333333e-2F*u[t0][x + 4][y + 1][z + 4]/h_y - 6.66666667e-1F*u[t0][x + 4][y + 2][z + 4]/h_y + 6.66666667e-1F*u[t0][x + 4][y + 4][z + 4]/h_y - 8.33333333e-2F*u[t0][x + 4][y + 5][z + 4]/h_y)/h_y + 6.66666667e-1F*(8.33333333e-2F*u[t0][x + 4][y + 3][z + 4]/h_y - 6.66666667e-1F*u[t0][x + 4][y + 4][z + 4]/h_y + 6.66666667e-1F*u[t0][x + 4][y + 6][z + 4]/h_y - 8.33333333e-2F*u[t0][x + 4][y + 7][z + 4]/h_y)/h_y - 8.33333333e-2F*(8.33333333e-2F*u[t0][x + 4][y + 4][z + 4]/h_y - 6.66666667e-1F*u[t0][x + 4][y + 5][z + 4]/h_y + 6.66666667e-1F*u[t0][x + 4][y + 7][z + 4]/h_y - 8.33333333e-2F*u[t0][x + 4][y + 8][z + 4]/h_y)/h_y)*sin(f[x + 1][y + 1][z + 1])*pow(f[x + 1][y + 1][z + 1], 2); } } } } } } }
**** Remark ****
'blocking'
is not an optimization level -- it rather identifies a specific compilation pass. In other words, the advanced
mode defines an ordered sequence of passes, and blocking
is one such pass.
********
The blocking
pass creates additional loops over blocks. In this simple Operator
there's just one loop nest, so only a pair of additional loops are created. In more complex Operator
's, several loop nests may individually be blocked, whereas others may be left unblocked -- this is decided by the Devito compiler according to certain heuristics. The size of a block is represented by the symbols x0_blk0_size
and y0_blk0_size
, which are runtime parameters akin to nthreads
.
By default, Devito applies 2D blocking and sets the default block shape to 8x8. There are two ways to set a different block shape:
op1_omp.apply(..., x0_blk0_size=24)
op1_omp.apply(..., autotune='aggressive')
Loop blocking also provides two optimization options:
blockinner={False, True}
-- to enable 3D (or any nD, n>2) blockingblocklevels={int}
-- to enable hierarchical blocking, to exploit multiple levels of the cache hierarchyIn the example below, we construct an Operator
with six-dimensional loop blocking: the first three loops represent outer blocks, whereas the second three loops represent inner blocks within an outer block.
op1_omp_6D = Operator(eq, opt=('blocking', {'blockinner': True, 'blocklevels': 2, 'openmp': True}))
# print(op1_omp_6D) # Uncomment to see the *whole* generated code
print_kernel(op1_omp_6D)
void bf0(struct dataobj *restrict f_vec, const float h_y, struct dataobj *restrict u_vec, const int t0, const int t1, const int x0_blk0_size, const int x0_blk1_size, const int x_M, const int x_m, const int y0_blk0_size, const int y0_blk1_size, const int y_M, const int y_m, const int z0_blk0_size, const int z0_blk1_size, const int z_M, const int z_m, const int nthreads) { float (*restrict f)[f_vec->size[1]][f_vec->size[2]] __attribute__ ((aligned (64))) = (float (*)[f_vec->size[1]][f_vec->size[2]]) f_vec->data; float (*restrict u)[u_vec->size[1]][u_vec->size[2]][u_vec->size[3]] __attribute__ ((aligned (64))) = (float (*)[u_vec->size[1]][u_vec->size[2]][u_vec->size[3]]) u_vec->data; if (x0_blk0_size == 0 || y0_blk0_size == 0 || z0_blk0_size == 0) { return; } #pragma omp parallel num_threads(nthreads) { #pragma omp for collapse(3) schedule(dynamic,1) for (int x0_blk0 = x_m; x0_blk0 <= x_M; x0_blk0 += x0_blk0_size) { for (int y0_blk0 = y_m; y0_blk0 <= y_M; y0_blk0 += y0_blk0_size) { for (int z0_blk0 = z_m; z0_blk0 <= z_M; z0_blk0 += z0_blk0_size) { for (int x0_blk1 = x0_blk0; x0_blk1 <= x0_blk0 + x0_blk0_size - 1; x0_blk1 += x0_blk1_size) { for (int y0_blk1 = y0_blk0; y0_blk1 <= y0_blk0 + y0_blk0_size - 1; y0_blk1 += y0_blk1_size) { for (int z0_blk1 = z0_blk0; z0_blk1 <= z0_blk0 + z0_blk0_size - 1; z0_blk1 += z0_blk1_size) { for (int x = x0_blk1; x <= x0_blk1 + x0_blk1_size - 1; x += 1) { for (int y = y0_blk1; y <= y0_blk1 + y0_blk1_size - 1; y += 1) { for (int z = z0_blk1; z <= z0_blk1 + z0_blk1_size - 1; z += 1) { u[t1][x + 4][y + 4][z + 4] = (8.33333333e-2F*(8.33333333e-2F*u[t0][x + 4][y][z + 4]/h_y - 6.66666667e-1F*u[t0][x + 4][y + 1][z + 4]/h_y + 6.66666667e-1F*u[t0][x + 4][y + 3][z + 4]/h_y - 8.33333333e-2F*u[t0][x + 4][y + 4][z + 4]/h_y)/h_y - 6.66666667e-1F*(8.33333333e-2F*u[t0][x + 4][y + 1][z + 4]/h_y - 6.66666667e-1F*u[t0][x + 4][y + 2][z + 4]/h_y + 6.66666667e-1F*u[t0][x + 4][y + 4][z + 4]/h_y - 8.33333333e-2F*u[t0][x + 4][y + 5][z + 4]/h_y)/h_y + 6.66666667e-1F*(8.33333333e-2F*u[t0][x + 4][y + 3][z + 4]/h_y - 6.66666667e-1F*u[t0][x + 4][y + 4][z + 4]/h_y + 6.66666667e-1F*u[t0][x + 4][y + 6][z + 4]/h_y - 8.33333333e-2F*u[t0][x + 4][y + 7][z + 4]/h_y)/h_y - 8.33333333e-2F*(8.33333333e-2F*u[t0][x + 4][y + 4][z + 4]/h_y - 6.66666667e-1F*u[t0][x + 4][y + 5][z + 4]/h_y + 6.66666667e-1F*u[t0][x + 4][y + 7][z + 4]/h_y - 8.33333333e-2F*u[t0][x + 4][y + 8][z + 4]/h_y)/h_y)*sin(f[x + 1][y + 1][z + 1])*pow(f[x + 1][y + 1][z + 1], 2); } } } } } } } } } } }
Devito enforces SIMD vectorization through OpenMP pragmas.
op2_omp = Operator(eq, opt=('blocking', 'simd', {'openmp': True}))
# print(op2_omp) # Uncomment to see the generated code
# print(unidiff_output(str(op1_omp), str(op2_omp))) # Uncomment to print out the diff only
The advanced
mode has a code motion pass. In explicit PDE solvers, this is most commonly used to lift expensive time-invariant sub-expressions out of the inner loops. The pass is however quite general in that it is not restricted to the concept of time-invariance -- any sub-expression invariant with respect to a subset of Dimension
s is a code motion candidate. In our running example, sin(f)
gets hoisted out of the inner loops since it is determined to be an expensive invariant sub-expression. In other words, the compiler trades the redundant computation of sin(f)
for additional storage (the r0[...]
array).
op3_omp = Operator(eq, opt=('lift', {'openmp': True}))
print_kernel(op3_omp)
float (*r0)[y_size][z_size]; posix_memalign((void**)&r0, 64, sizeof(float[x_size][y_size][z_size])); /* Begin section0 */ START_TIMER(section0) #pragma omp parallel num_threads(nthreads) { #pragma omp for collapse(3) schedule(static,1) for (int x = x_m; x <= x_M; x += 1) { for (int y = y_m; y <= y_M; y += 1) { for (int z = z_m; z <= z_M; z += 1) { r0[x][y][z] = sin(f[x + 1][y + 1][z + 1]); } } } } STOP_TIMER(section0,timers) /* End section0 */ for (int time = time_m, t0 = (time)%(2), t1 = (time + 1)%(2); time <= time_M; time += 1, t0 = (time)%(2), t1 = (time + 1)%(2)) { /* Begin section1 */ START_TIMER(section1) #pragma omp parallel num_threads(nthreads) { #pragma omp for collapse(1) schedule(dynamic,1) for (int x = x_m; x <= x_M; x += 1) { for (int y = y_m; y <= y_M; y += 1) { for (int z = z_m; z <= z_M; z += 1) { u[t1][x + 4][y + 4][z + 4] = (8.33333333e-2F*(8.33333333e-2F*u[t0][x + 4][y][z + 4]/h_y - 6.66666667e-1F*u[t0][x + 4][y + 1][z + 4]/h_y + 6.66666667e-1F*u[t0][x + 4][y + 3][z + 4]/h_y - 8.33333333e-2F*u[t0][x + 4][y + 4][z + 4]/h_y)/h_y - 6.66666667e-1F*(8.33333333e-2F*u[t0][x + 4][y + 1][z + 4]/h_y - 6.66666667e-1F*u[t0][x + 4][y + 2][z + 4]/h_y + 6.66666667e-1F*u[t0][x + 4][y + 4][z + 4]/h_y - 8.33333333e-2F*u[t0][x + 4][y + 5][z + 4]/h_y)/h_y + 6.66666667e-1F*(8.33333333e-2F*u[t0][x + 4][y + 3][z + 4]/h_y - 6.66666667e-1F*u[t0][x + 4][y + 4][z + 4]/h_y + 6.66666667e-1F*u[t0][x + 4][y + 6][z + 4]/h_y - 8.33333333e-2F*u[t0][x + 4][y + 7][z + 4]/h_y)/h_y - 8.33333333e-2F*(8.33333333e-2F*u[t0][x + 4][y + 4][z + 4]/h_y - 6.66666667e-1F*u[t0][x + 4][y + 5][z + 4]/h_y + 6.66666667e-1F*u[t0][x + 4][y + 7][z + 4]/h_y - 8.33333333e-2F*u[t0][x + 4][y + 8][z + 4]/h_y)/h_y)*pow(f[x + 1][y + 1][z + 1], 2)*r0[x][y][z]; } } } } STOP_TIMER(section1,timers) /* End section1 */ } free(r0);
Among the simpler flop-reducing transformations applied by the advanced
mode we find:
The cell below shows how the computation of u
changes by incrementally applying these three passes. First of all, we observe how the symbolic spacing h_y
gets assigned to a temporary, r0
, as it appears in several sub-expressions. This is the effect of CSE.
op4_omp = Operator(eq, opt=('cse', {'openmp': True}))
print(unidiff_output(str(op0_omp), str(op4_omp)))
--- +++ @@ -42,7 +42,8 @@ { for (int z = z_m; z <= z_M; z += 1) { - u[t1][x + 4][y + 4][z + 4] = (8.33333333e-2F*(8.33333333e-2F*u[t0][x + 4][y][z + 4]/h_y - 6.66666667e-1F*u[t0][x + 4][y + 1][z + 4]/h_y + 6.66666667e-1F*u[t0][x + 4][y + 3][z + 4]/h_y - 8.33333333e-2F*u[t0][x + 4][y + 4][z + 4]/h_y)/h_y - 6.66666667e-1F*(8.33333333e-2F*u[t0][x + 4][y + 1][z + 4]/h_y - 6.66666667e-1F*u[t0][x + 4][y + 2][z + 4]/h_y + 6.66666667e-1F*u[t0][x + 4][y + 4][z + 4]/h_y - 8.33333333e-2F*u[t0][x + 4][y + 5][z + 4]/h_y)/h_y + 6.66666667e-1F*(8.33333333e-2F*u[t0][x + 4][y + 3][z + 4]/h_y - 6.66666667e-1F*u[t0][x + 4][y + 4][z + 4]/h_y + 6.66666667e-1F*u[t0][x + 4][y + 6][z + 4]/h_y - 8.33333333e-2F*u[t0][x + 4][y + 7][z + 4]/h_y)/h_y - 8.33333333e-2F*(8.33333333e-2F*u[t0][x + 4][y + 4][z + 4]/h_y - 6.66666667e-1F*u[t0][x + 4][y + 5][z + 4]/h_y + 6.66666667e-1F*u[t0][x + 4][y + 7][z + 4]/h_y - 8.33333333e-2F*u[t0][x + 4][y + 8][z + 4]/h_y)/h_y)*sin(f[x + 1][y + 1][z + 1])*pow(f[x + 1][y + 1][z + 1], 2); + float r0 = 1.0F/h_y; + u[t1][x + 4][y + 4][z + 4] = (8.33333333e-2F*r0*(8.33333333e-2F*r0*u[t0][x + 4][y][z + 4] - 6.66666667e-1F*r0*u[t0][x + 4][y + 1][z + 4] + 6.66666667e-1F*r0*u[t0][x + 4][y + 3][z + 4] - 8.33333333e-2F*r0*u[t0][x + 4][y + 4][z + 4]) - 6.66666667e-1F*r0*(8.33333333e-2F*r0*u[t0][x + 4][y + 1][z + 4] - 6.66666667e-1F*r0*u[t0][x + 4][y + 2][z + 4] + 6.66666667e-1F*r0*u[t0][x + 4][y + 4][z + 4] - 8.33333333e-2F*r0*u[t0][x + 4][y + 5][z + 4]) + 6.66666667e-1F*r0*(8.33333333e-2F*r0*u[t0][x + 4][y + 3][z + 4] - 6.66666667e-1F*r0*u[t0][x + 4][y + 4][z + 4] + 6.66666667e-1F*r0*u[t0][x + 4][y + 6][z + 4] - 8.33333333e-2F*r0*u[t0][x + 4][y + 7][z + 4]) - 8.33333333e-2F*r0*(8.33333333e-2F*r0*u[t0][x + 4][y + 4][z + 4] - 6.66666667e-1F*r0*u[t0][x + 4][y + 5][z + 4] + 6.66666667e-1F*r0*u[t0][x + 4][y + 7][z + 4] - 8.33333333e-2F*r0*u[t0][x + 4][y + 8][z + 4]))*sin(f[x + 1][y + 1][z + 1])*pow(f[x + 1][y + 1][z + 1], 2); } } }
The factorization pass makes sure r0
is collected to reduce the number of multiplications.
op5_omp = Operator(eq, opt=('cse', 'factorize', {'openmp': True}))
print(unidiff_output(str(op4_omp), str(op5_omp)))
--- +++ @@ -43,7 +43,7 @@ for (int z = z_m; z <= z_M; z += 1) { float r0 = 1.0F/h_y; - u[t1][x + 4][y + 4][z + 4] = (8.33333333e-2F*r0*(8.33333333e-2F*r0*u[t0][x + 4][y][z + 4] - 6.66666667e-1F*r0*u[t0][x + 4][y + 1][z + 4] + 6.66666667e-1F*r0*u[t0][x + 4][y + 3][z + 4] - 8.33333333e-2F*r0*u[t0][x + 4][y + 4][z + 4]) - 6.66666667e-1F*r0*(8.33333333e-2F*r0*u[t0][x + 4][y + 1][z + 4] - 6.66666667e-1F*r0*u[t0][x + 4][y + 2][z + 4] + 6.66666667e-1F*r0*u[t0][x + 4][y + 4][z + 4] - 8.33333333e-2F*r0*u[t0][x + 4][y + 5][z + 4]) + 6.66666667e-1F*r0*(8.33333333e-2F*r0*u[t0][x + 4][y + 3][z + 4] - 6.66666667e-1F*r0*u[t0][x + 4][y + 4][z + 4] + 6.66666667e-1F*r0*u[t0][x + 4][y + 6][z + 4] - 8.33333333e-2F*r0*u[t0][x + 4][y + 7][z + 4]) - 8.33333333e-2F*r0*(8.33333333e-2F*r0*u[t0][x + 4][y + 4][z + 4] - 6.66666667e-1F*r0*u[t0][x + 4][y + 5][z + 4] + 6.66666667e-1F*r0*u[t0][x + 4][y + 7][z + 4] - 8.33333333e-2F*r0*u[t0][x + 4][y + 8][z + 4]))*sin(f[x + 1][y + 1][z + 1])*pow(f[x + 1][y + 1][z + 1], 2); + u[t1][x + 4][y + 4][z + 4] = pow(r0, 2)*(6.94444444e-3F*(u[t0][x + 4][y][z + 4] + u[t0][x + 4][y + 8][z + 4]) + 4.44444444e-1F*(u[t0][x + 4][y + 2][z + 4] + u[t0][x + 4][y + 6][z + 4]) + 1.11111111e-1F*(-u[t0][x + 4][y + 1][z + 4] + u[t0][x + 4][y + 3][z + 4] + u[t0][x + 4][y + 5][z + 4] - u[t0][x + 4][y + 7][z + 4]) - 9.02777778e-1F*u[t0][x + 4][y + 4][z + 4])*sin(f[x + 1][y + 1][z + 1])*pow(f[x + 1][y + 1][z + 1], 2); } } }
Finally, opt-pows
turns costly pow
calls into multiplications.
op6_omp = Operator(eq, opt=('cse', 'factorize', 'opt-pows', {'openmp': True}))
print(unidiff_output(str(op5_omp), str(op6_omp)))
--- +++ @@ -43,7 +43,7 @@ for (int z = z_m; z <= z_M; z += 1) { float r0 = 1.0F/h_y; - u[t1][x + 4][y + 4][z + 4] = pow(r0, 2)*(6.94444444e-3F*(u[t0][x + 4][y][z + 4] + u[t0][x + 4][y + 8][z + 4]) + 4.44444444e-1F*(u[t0][x + 4][y + 2][z + 4] + u[t0][x + 4][y + 6][z + 4]) + 1.11111111e-1F*(-u[t0][x + 4][y + 1][z + 4] + u[t0][x + 4][y + 3][z + 4] + u[t0][x + 4][y + 5][z + 4] - u[t0][x + 4][y + 7][z + 4]) - 9.02777778e-1F*u[t0][x + 4][y + 4][z + 4])*sin(f[x + 1][y + 1][z + 1])*pow(f[x + 1][y + 1][z + 1], 2); + u[t1][x + 4][y + 4][z + 4] = (r0*r0)*(f[x + 1][y + 1][z + 1]*f[x + 1][y + 1][z + 1])*(6.94444444e-3F*(u[t0][x + 4][y][z + 4] + u[t0][x + 4][y + 8][z + 4]) + 4.44444444e-1F*(u[t0][x + 4][y + 2][z + 4] + u[t0][x + 4][y + 6][z + 4]) + 1.11111111e-1F*(-u[t0][x + 4][y + 1][z + 4] + u[t0][x + 4][y + 3][z + 4] + u[t0][x + 4][y + 5][z + 4] - u[t0][x + 4][y + 7][z + 4]) - 9.02777778e-1F*u[t0][x + 4][y + 4][z + 4])*sin(f[x + 1][y + 1][z + 1]); } } }
This is perhaps the most advanced among the optimization passes in Devito. CIRE [1] searches for redundancies across consecutive loop iterations. These are often induced by a mixture of nested, high-order, partial derivatives. The underlying idea is very simple. Consider:
r0 = a[i-1] + a[i] + a[i+1]
at i=1
, we have
r0 = a[0] + a[1] + a[2]
at i=2
, we have
r0 = a[1] + a[2] + a[3]
So the sub-expression a[1] + a[2]
is computed twice, by two consecutive iterations.
What makes CIRE complicated is the generalization to arbitrary expressions, the presence of multiple dimensions, the scheduling strategy due to the trade-off between redundant compute and working set, and the co-existance with other optimizations (e.g., blocking, vectorization). All these aspects won't be treated here. What instead we will show is the effect of CIRE in our running example and the optimization options at our disposal to drive the detection and scheduling of the captured redundancies.
In our running example, some cross-iteration redundancies are induced by the nested first-order derivatives along y
. As we see below, these redundancies are captured and assigned to the two-dimensional temporary r0
.
Note: the name cire-sops
means "Apply CIRE to sum-of-product expressions". A sum-of-product is what taking a derivative via finite differences produces.
op7_omp = Operator(eq, opt=('cire-sops', {'openmp': True}))
print_kernel(op7_omp)
# print(unidiff_output(str(op7_omp), str(op0_omp))) # Uncomment to print out the diff only
float **pr0; posix_memalign((void**)&pr0, 64, sizeof(float*)*nthreads); #pragma omp parallel num_threads(nthreads) { const int tid = omp_get_thread_num(); posix_memalign((void**)&pr0[tid], 64, sizeof(float[y_size + 4][z_size])); } for (int time = time_m, t0 = (time)%(2), t1 = (time + 1)%(2); time <= time_M; time += 1, t0 = (time)%(2), t1 = (time + 1)%(2)) { /* Begin section0 */ START_TIMER(section0) #pragma omp parallel num_threads(nthreads) { const int tid = omp_get_thread_num(); float (*restrict r0)[z_size] __attribute__ ((aligned (64))) = (float (*)[z_size]) pr0[tid]; #pragma omp for collapse(1) schedule(dynamic,1) for (int x = x_m; x <= x_M; x += 1) { for (int y = y_m - 2; y <= y_M + 2; y += 1) { for (int z = z_m; z <= z_M; z += 1) { r0[y + 2][z] = 8.33333333e-2F*u[t0][x + 4][y + 2][z + 4]/h_y - 6.66666667e-1F*u[t0][x + 4][y + 3][z + 4]/h_y + 6.66666667e-1F*u[t0][x + 4][y + 5][z + 4]/h_y - 8.33333333e-2F*u[t0][x + 4][y + 6][z + 4]/h_y; } } for (int y = y_m; y <= y_M; y += 1) { for (int z = z_m; z <= z_M; z += 1) { u[t1][x + 4][y + 4][z + 4] = (8.33333333e-2F*r0[y][z]/h_y - 6.66666667e-1F*r0[y + 1][z]/h_y + 6.66666667e-1F*r0[y + 3][z]/h_y - 8.33333333e-2F*r0[y + 4][z]/h_y)*sin(f[x + 1][y + 1][z + 1])*pow(f[x + 1][y + 1][z + 1], 2); } } } } STOP_TIMER(section0,timers) /* End section0 */ } #pragma omp parallel num_threads(nthreads) { const int tid = omp_get_thread_num(); free(pr0[tid]); } free(pr0);
We note that since there are no redundancies along x
, the compiler is smart to figure out that r0
and u
can safely be computed within the same x
loop. This is nice -- not only is the reuse distance decreased, but also a grid-sized temporary avoided.
min-storage
option¶Let's now consider a variant of our running example
eq = Eq(u.forward, f**2*sin(f)*(u.dy.dy + u.dx.dx))
op7_b0_omp = Operator(eq, opt=('cire-sops', {'openmp': True}))
print_kernel(op7_b0_omp)
float (*r0)[y_size + 4][z_size]; posix_memalign((void**)&r0, 64, sizeof(float[x_size + 4][y_size + 4][z_size])); float (*r1)[y_size + 4][z_size]; posix_memalign((void**)&r1, 64, sizeof(float[x_size + 4][y_size + 4][z_size])); for (int time = time_m, t0 = (time)%(2), t1 = (time + 1)%(2); time <= time_M; time += 1, t0 = (time)%(2), t1 = (time + 1)%(2)) { /* Begin section0 */ START_TIMER(section0) #pragma omp parallel num_threads(nthreads) { #pragma omp for collapse(3) schedule(dynamic,1) for (int x = x_m - 2; x <= x_M + 2; x += 1) { for (int y = y_m - 2; y <= y_M + 2; y += 1) { for (int z = z_m; z <= z_M; z += 1) { r0[x + 2][y + 2][z] = 8.33333333e-2F*u[t0][x + 2][y + 4][z + 4]/h_x - 6.66666667e-1F*u[t0][x + 3][y + 4][z + 4]/h_x + 6.66666667e-1F*u[t0][x + 5][y + 4][z + 4]/h_x - 8.33333333e-2F*u[t0][x + 6][y + 4][z + 4]/h_x; r1[x + 2][y + 2][z] = 8.33333333e-2F*u[t0][x + 4][y + 2][z + 4]/h_y - 6.66666667e-1F*u[t0][x + 4][y + 3][z + 4]/h_y + 6.66666667e-1F*u[t0][x + 4][y + 5][z + 4]/h_y - 8.33333333e-2F*u[t0][x + 4][y + 6][z + 4]/h_y; } } } } #pragma omp parallel num_threads(nthreads) { #pragma omp for collapse(3) schedule(dynamic,1) for (int x = x_m; x <= x_M; x += 1) { for (int y = y_m; y <= y_M; y += 1) { for (int z = z_m; z <= z_M; z += 1) { u[t1][x + 4][y + 4][z + 4] = (8.33333333e-2F*r1[x + 2][y][z]/h_y - 6.66666667e-1F*r1[x + 2][y + 1][z]/h_y + 6.66666667e-1F*r1[x + 2][y + 3][z]/h_y - 8.33333333e-2F*r1[x + 2][y + 4][z]/h_y + 8.33333333e-2F*r0[x][y + 2][z]/h_x - 6.66666667e-1F*r0[x + 1][y + 2][z]/h_x + 6.66666667e-1F*r0[x + 3][y + 2][z]/h_x - 8.33333333e-2F*r0[x + 4][y + 2][z]/h_x)*sin(f[x + 1][y + 1][z + 1])*pow(f[x + 1][y + 1][z + 1], 2); } } } } STOP_TIMER(section0,timers) /* End section0 */ } free(r0); free(r1);
As expected, there are now two temporaries, one stemming from u.dy.dy
and the other from u.dx.dx
. A key difference with respect to op7_omp
here is that both are grid-size temporaries. This might seem odd at first -- why should the u.dy.dy
temporary, that is r1
, now be a three-dimensional temporary when we know already it could be a two-dimensional temporary? This is merely a compiler heuristic: by adding an extra dimension to r1
, both temporaries can be scheduled within the same loop nest, thus augmenting data reuse and potentially enabling further cross-expression optimizations. We can disable this heuristic through the min-storage
option.
op7_b1_omp = Operator(eq, opt=('cire-sops', {'openmp': True, 'min-storage': True}))
print_kernel(op7_b1_omp)
float (*r0)[y_size][z_size]; posix_memalign((void**)&r0, 64, sizeof(float[x_size + 4][y_size][z_size])); float **pr1; posix_memalign((void**)&pr1, 64, sizeof(float*)*nthreads); #pragma omp parallel num_threads(nthreads) { const int tid = omp_get_thread_num(); posix_memalign((void**)&pr1[tid], 64, sizeof(float[y_size + 4][z_size])); } for (int time = time_m, t0 = (time)%(2), t1 = (time + 1)%(2); time <= time_M; time += 1, t0 = (time)%(2), t1 = (time + 1)%(2)) { /* Begin section0 */ START_TIMER(section0) #pragma omp parallel num_threads(nthreads) { #pragma omp for collapse(3) schedule(dynamic,1) for (int x = x_m - 2; x <= x_M + 2; x += 1) { for (int y = y_m; y <= y_M; y += 1) { for (int z = z_m; z <= z_M; z += 1) { r0[x + 2][y][z] = 8.33333333e-2F*u[t0][x + 2][y + 4][z + 4]/h_x - 6.66666667e-1F*u[t0][x + 3][y + 4][z + 4]/h_x + 6.66666667e-1F*u[t0][x + 5][y + 4][z + 4]/h_x - 8.33333333e-2F*u[t0][x + 6][y + 4][z + 4]/h_x; } } } } #pragma omp parallel num_threads(nthreads) { const int tid = omp_get_thread_num(); float (*restrict r1)[z_size] __attribute__ ((aligned (64))) = (float (*)[z_size]) pr1[tid]; #pragma omp for collapse(1) schedule(dynamic,1) for (int x = x_m; x <= x_M; x += 1) { for (int y = y_m - 2; y <= y_M + 2; y += 1) { for (int z = z_m; z <= z_M; z += 1) { r1[y + 2][z] = 8.33333333e-2F*u[t0][x + 4][y + 2][z + 4]/h_y - 6.66666667e-1F*u[t0][x + 4][y + 3][z + 4]/h_y + 6.66666667e-1F*u[t0][x + 4][y + 5][z + 4]/h_y - 8.33333333e-2F*u[t0][x + 4][y + 6][z + 4]/h_y; } } for (int y = y_m; y <= y_M; y += 1) { for (int z = z_m; z <= z_M; z += 1) { u[t1][x + 4][y + 4][z + 4] = (8.33333333e-2F*r1[y][z]/h_y - 6.66666667e-1F*r1[y + 1][z]/h_y + 6.66666667e-1F*r1[y + 3][z]/h_y - 8.33333333e-2F*r1[y + 4][z]/h_y + 8.33333333e-2F*r0[x][y][z]/h_x - 6.66666667e-1F*r0[x + 1][y][z]/h_x + 6.66666667e-1F*r0[x + 3][y][z]/h_x - 8.33333333e-2F*r0[x + 4][y][z]/h_x)*sin(f[x + 1][y + 1][z + 1])*pow(f[x + 1][y + 1][z + 1], 2); } } } } STOP_TIMER(section0,timers) /* End section0 */ } #pragma omp parallel num_threads(nthreads) { const int tid = omp_get_thread_num(); free(pr1[tid]); } free(r0); free(pr1);
cire-mincost-sops
option¶So far so good -- we've seen that Devito can capture and schedule cross-iteration redundancies. But what if we actually do not want certain redundancies to be captured? There are a few reasons we may like that way, for example we're allocating too much extra memory for the tensor temporaries, and we rather prefer to avoid that. For this, we can tell Devito what the minimum cost of a sub-expression should be in order to be a CIRE candidate. The cost is an integer number based on the operation count and the type of operations:
+
and *
has a cost of 1./
whose divisor is a constant expression has a cost of 1./
whose divisor is not a constant expression has a cost of 25.n
has a cost of n-1
(i.e., the number of *
it will be converted into).sin
, cos
, etc.) has a cost of 100.The cire-mincost-sops
option can be used to control the minimum cost of CIRE candidates.
eq = Eq(u.forward, f**2*sin(f)*u.dy.dy) # Back to original running example
op8_omp = Operator(eq, opt=('cire-sops', {'openmp': True, 'cire-mincost-sops': 31}))
print_kernel(op8_omp)
for (int time = time_m, t0 = (time)%(2), t1 = (time + 1)%(2); time <= time_M; time += 1, t0 = (time)%(2), t1 = (time + 1)%(2)) { /* Begin section0 */ START_TIMER(section0) #pragma omp parallel num_threads(nthreads) { #pragma omp for collapse(1) schedule(dynamic,1) for (int x = x_m; x <= x_M; x += 1) { for (int y = y_m; y <= y_M; y += 1) { for (int z = z_m; z <= z_M; z += 1) { u[t1][x + 4][y + 4][z + 4] = (8.33333333e-2F*(8.33333333e-2F*u[t0][x + 4][y][z + 4]/h_y - 6.66666667e-1F*u[t0][x + 4][y + 1][z + 4]/h_y + 6.66666667e-1F*u[t0][x + 4][y + 3][z + 4]/h_y - 8.33333333e-2F*u[t0][x + 4][y + 4][z + 4]/h_y)/h_y - 6.66666667e-1F*(8.33333333e-2F*u[t0][x + 4][y + 1][z + 4]/h_y - 6.66666667e-1F*u[t0][x + 4][y + 2][z + 4]/h_y + 6.66666667e-1F*u[t0][x + 4][y + 4][z + 4]/h_y - 8.33333333e-2F*u[t0][x + 4][y + 5][z + 4]/h_y)/h_y + 6.66666667e-1F*(8.33333333e-2F*u[t0][x + 4][y + 3][z + 4]/h_y - 6.66666667e-1F*u[t0][x + 4][y + 4][z + 4]/h_y + 6.66666667e-1F*u[t0][x + 4][y + 6][z + 4]/h_y - 8.33333333e-2F*u[t0][x + 4][y + 7][z + 4]/h_y)/h_y - 8.33333333e-2F*(8.33333333e-2F*u[t0][x + 4][y + 4][z + 4]/h_y - 6.66666667e-1F*u[t0][x + 4][y + 5][z + 4]/h_y + 6.66666667e-1F*u[t0][x + 4][y + 7][z + 4]/h_y - 8.33333333e-2F*u[t0][x + 4][y + 8][z + 4]/h_y)/h_y)*sin(f[x + 1][y + 1][z + 1])*pow(f[x + 1][y + 1][z + 1], 2); } } } } STOP_TIMER(section0,timers) /* End section0 */ }
We observe how setting cire-min-cost=31
makes the tensor temporary produced by op7_omp
disappear. 30 is indeed the minimum cost such that the targeted sub-expression becomes an optimization candidate.
8.33333333e-2F*u[t0][x + 4][y + 2][z + 4]/h_y - 6.66666667e-1F*u[t0][x + 4][y + 3][z + 4]/h_y + 6.66666667e-1F*u[t0][x + 4][y + 5][z + 4]/h_y - 8.33333333e-2F*u[t0][x + 4][y + 6][z + 4]/h_y
+
(or -
) => 3/
by h_y
, a constant symbol => 4*
with operands (1/h_y, u)
=> 8So in total we have 15 operations. For a sub-expression to be optimizable away by CIRE, the resulting saving in operation count must be at least twice the cire-mincost-sops
value OR there must be no increase in working set size. Here there is an increase in working set size -- the new tensor temporary -- and at the same time the threshold is set to 31, so the compiler decides not to optimize away the sub-expression. In short, the rational is that the saving in operation count does not justify the introduction of a new tensor temporary.
Next, we try again with a smaller cire-mincost-sops
.
op8_b1_omp = Operator(eq, opt=('cire-sops', {'openmp': True, 'cire-mincost-sops': 30}))
print_kernel(op8_b1_omp)
float **pr0; posix_memalign((void**)&pr0, 64, sizeof(float*)*nthreads); #pragma omp parallel num_threads(nthreads) { const int tid = omp_get_thread_num(); posix_memalign((void**)&pr0[tid], 64, sizeof(float[y_size + 4][z_size])); } for (int time = time_m, t0 = (time)%(2), t1 = (time + 1)%(2); time <= time_M; time += 1, t0 = (time)%(2), t1 = (time + 1)%(2)) { /* Begin section0 */ START_TIMER(section0) #pragma omp parallel num_threads(nthreads) { const int tid = omp_get_thread_num(); float (*restrict r0)[z_size] __attribute__ ((aligned (64))) = (float (*)[z_size]) pr0[tid]; #pragma omp for collapse(1) schedule(dynamic,1) for (int x = x_m; x <= x_M; x += 1) { for (int y = y_m - 2; y <= y_M + 2; y += 1) { for (int z = z_m; z <= z_M; z += 1) { r0[y + 2][z] = 8.33333333e-2F*u[t0][x + 4][y + 2][z + 4]/h_y - 6.66666667e-1F*u[t0][x + 4][y + 3][z + 4]/h_y + 6.66666667e-1F*u[t0][x + 4][y + 5][z + 4]/h_y - 8.33333333e-2F*u[t0][x + 4][y + 6][z + 4]/h_y; } } for (int y = y_m; y <= y_M; y += 1) { for (int z = z_m; z <= z_M; z += 1) { u[t1][x + 4][y + 4][z + 4] = (8.33333333e-2F*r0[y][z]/h_y - 6.66666667e-1F*r0[y + 1][z]/h_y + 6.66666667e-1F*r0[y + 3][z]/h_y - 8.33333333e-2F*r0[y + 4][z]/h_y)*sin(f[x + 1][y + 1][z + 1])*pow(f[x + 1][y + 1][z + 1], 2); } } } } STOP_TIMER(section0,timers) /* End section0 */ } #pragma omp parallel num_threads(nthreads) { const int tid = omp_get_thread_num(); free(pr0[tid]); } free(pr0);
cire-mincost-inv
option¶Akin to sum-of-products, cross-iteration redundancies may be searched across dimension-invariant sub-expressions, typically time-invariants. So, analogously to what we've seen before:
eq = Eq(u.forward, f**2*sin(f)*u.dy.dy) # Back to original running example
op11_omp = Operator(eq, opt=('lift', {'openmp': True}))
# print_kernel(op11_omp) # Uncomment to see the generated code
For convenience, the lift
pass triggers CIRE for dimension-invariant sub-expressions. As seen before, this leads to producing one tensor temporary. By setting a larger value for cire-mincost-inv
, we avoid a grid-size temporary to be allocated, in exchange for a trascendental function, sin
, to be computed at each iteration
op12_omp = Operator(eq, opt=('lift', {'openmp': True, 'cire-mincost-inv': 51}))
print_kernel(op12_omp)
for (int time = time_m, t0 = (time)%(2), t1 = (time + 1)%(2); time <= time_M; time += 1, t0 = (time)%(2), t1 = (time + 1)%(2)) { /* Begin section0 */ START_TIMER(section0) #pragma omp parallel num_threads(nthreads) { #pragma omp for collapse(1) schedule(dynamic,1) for (int x = x_m; x <= x_M; x += 1) { for (int y = y_m; y <= y_M; y += 1) { for (int z = z_m; z <= z_M; z += 1) { u[t1][x + 4][y + 4][z + 4] = (8.33333333e-2F*(8.33333333e-2F*u[t0][x + 4][y][z + 4]/h_y - 6.66666667e-1F*u[t0][x + 4][y + 1][z + 4]/h_y + 6.66666667e-1F*u[t0][x + 4][y + 3][z + 4]/h_y - 8.33333333e-2F*u[t0][x + 4][y + 4][z + 4]/h_y)/h_y - 6.66666667e-1F*(8.33333333e-2F*u[t0][x + 4][y + 1][z + 4]/h_y - 6.66666667e-1F*u[t0][x + 4][y + 2][z + 4]/h_y + 6.66666667e-1F*u[t0][x + 4][y + 4][z + 4]/h_y - 8.33333333e-2F*u[t0][x + 4][y + 5][z + 4]/h_y)/h_y + 6.66666667e-1F*(8.33333333e-2F*u[t0][x + 4][y + 3][z + 4]/h_y - 6.66666667e-1F*u[t0][x + 4][y + 4][z + 4]/h_y + 6.66666667e-1F*u[t0][x + 4][y + 6][z + 4]/h_y - 8.33333333e-2F*u[t0][x + 4][y + 7][z + 4]/h_y)/h_y - 8.33333333e-2F*(8.33333333e-2F*u[t0][x + 4][y + 4][z + 4]/h_y - 6.66666667e-1F*u[t0][x + 4][y + 5][z + 4]/h_y + 6.66666667e-1F*u[t0][x + 4][y + 7][z + 4]/h_y - 8.33333333e-2F*u[t0][x + 4][y + 8][z + 4]/h_y)/h_y)*sin(f[x + 1][y + 1][z + 1])*pow(f[x + 1][y + 1][z + 1], 2); } } } } STOP_TIMER(section0,timers) /* End section0 */ }
cire-maxpar
option¶Sometimes it's possible to trade storage for parallelism (i.e., for more parallel dimensions). For this, Devito provides the cire-maxpar
option which is by default set to:
Let's see what happens when we switch it on
op13_omp = Operator(eq, opt=('cire-sops', {'openmp': True, 'cire-maxpar': True}))
print_kernel(op13_omp)
float (*r0)[y_size + 4][z_size]; posix_memalign((void**)&r0, 64, sizeof(float[x_size][y_size + 4][z_size])); for (int time = time_m, t0 = (time)%(2), t1 = (time + 1)%(2); time <= time_M; time += 1, t0 = (time)%(2), t1 = (time + 1)%(2)) { /* Begin section0 */ START_TIMER(section0) #pragma omp parallel num_threads(nthreads) { #pragma omp for collapse(1) schedule(dynamic,1) for (int x = x_m; x <= x_M; x += 1) { for (int y = y_m - 2; y <= y_M + 2; y += 1) { for (int z = z_m; z <= z_M; z += 1) { r0[x][y + 2][z] = 8.33333333e-2F*u[t0][x + 4][y + 2][z + 4]/h_y - 6.66666667e-1F*u[t0][x + 4][y + 3][z + 4]/h_y + 6.66666667e-1F*u[t0][x + 4][y + 5][z + 4]/h_y - 8.33333333e-2F*u[t0][x + 4][y + 6][z + 4]/h_y; } } } } #pragma omp parallel num_threads(nthreads) { #pragma omp for collapse(3) schedule(dynamic,1) for (int x = x_m; x <= x_M; x += 1) { for (int y = y_m; y <= y_M; y += 1) { for (int z = z_m; z <= z_M; z += 1) { u[t1][x + 4][y + 4][z + 4] = (8.33333333e-2F*r0[x][y][z]/h_y - 6.66666667e-1F*r0[x][y + 1][z]/h_y + 6.66666667e-1F*r0[x][y + 3][z]/h_y - 8.33333333e-2F*r0[x][y + 4][z]/h_y)*sin(f[x + 1][y + 1][z + 1])*pow(f[x + 1][y + 1][z + 1], 2); } } } } STOP_TIMER(section0,timers) /* End section0 */ } free(r0);
The generated code uses a three-dimensional temporary that gets written and subsequently read in two separate x-y-z
loop nests. Now, both loops can safely be openmp-collapsed, which is vital when running on GPUs.
advanced
mode¶The advanced
mode triggers all of the passes we've seen so far... and in fact, many more! Some of them, however, aren't visible in our running example (e.g., all of the MPI-related optimizations). These will be treated in a future notebook.
Obviously, all of the optimization options (e.g., min-cost
, cire-mincost-sops
, blocklevels
) are applicable and composable in any arbitrary way.
op14_omp = Operator(eq, openmp=True)
print(op14_omp)
# op14_b0_omp = Operator(eq, opt=('advanced', {'min-storage': True}))
#define _POSIX_C_SOURCE 200809L #define START_TIMER(S) struct timeval start_ ## S , end_ ## S ; gettimeofday(&start_ ## S , NULL); #define STOP_TIMER(S,T) gettimeofday(&end_ ## S, NULL); T->S += (double)(end_ ## S .tv_sec-start_ ## S.tv_sec)+(double)(end_ ## S .tv_usec-start_ ## S .tv_usec)/1000000; #include "stdlib.h" #include "math.h" #include "sys/time.h" #include "xmmintrin.h" #include "pmmintrin.h" #include "omp.h" struct dataobj { void *restrict data; int * size; int * npsize; int * dsize; int * hsize; int * hofs; int * oofs; } ; struct profiler { double section0; double section1; } ; void bf0(struct dataobj *restrict f_vec, const float h_y, float *restrict r0_vec, struct dataobj *restrict u_vec, const int x_size, const int y0_blk0_size, const int y_size, const int z_size, const int t0, const int t1, const int x0_blk0_size, const int x_M, const int x_m, const int y_M, const int y_m, const int z_M, const int z_m, const int nthreads, float * *restrict pr1_vec); int Kernel(struct dataobj *restrict f_vec, const float h_y, struct dataobj *restrict u_vec, const int x_size, const int y0_blk0_size, const int y_size, const int z_size, const int time_M, const int time_m, const int x0_blk0_size, const int x_M, const int x_m, const int y_M, const int y_m, const int z_M, const int z_m, const int nthreads, struct profiler * timers) { float (*restrict f)[f_vec->size[1]][f_vec->size[2]] __attribute__ ((aligned (64))) = (float (*)[f_vec->size[1]][f_vec->size[2]]) f_vec->data; float (*r0)[y_size][z_size]; posix_memalign((void**)&r0, 64, sizeof(float[x_size][y_size][z_size])); float **pr1; posix_memalign((void**)&pr1, 64, sizeof(float*)*nthreads); #pragma omp parallel num_threads(nthreads) { const int tid = omp_get_thread_num(); posix_memalign((void**)&pr1[tid], 64, sizeof(float[y0_blk0_size + 4][z_size])); } /* Flush denormal numbers to zero in hardware */ _MM_SET_DENORMALS_ZERO_MODE(_MM_DENORMALS_ZERO_ON); _MM_SET_FLUSH_ZERO_MODE(_MM_FLUSH_ZERO_ON); /* Begin section0 */ START_TIMER(section0) #pragma omp parallel num_threads(nthreads) { #pragma omp for collapse(2) schedule(static,1) for (int x = x_m; x <= x_M; x += 1) { for (int y = y_m; y <= y_M; y += 1) { #pragma omp simd aligned(f:32) for (int z = z_m; z <= z_M; z += 1) { r0[x][y][z] = sin(f[x + 1][y + 1][z + 1]); } } } } STOP_TIMER(section0,timers) /* End section0 */ for (int time = time_m, t0 = (time)%(2), t1 = (time + 1)%(2); time <= time_M; time += 1, t0 = (time)%(2), t1 = (time + 1)%(2)) { /* Begin section1 */ START_TIMER(section1) bf0(f_vec,h_y,(float *)r0,u_vec,x_size,y0_blk0_size,y_size,z_size,t0,t1,x0_blk0_size,x_M - (x_M - x_m + 1)%(x0_blk0_size),x_m,y_M - (y_M - y_m + 1)%(y0_blk0_size),y_m,z_M,z_m,nthreads,(float * *)pr1); bf0(f_vec,h_y,(float *)r0,u_vec,x_size,(y_M - y_m + 1)%(y0_blk0_size),y_size,z_size,t0,t1,x0_blk0_size,x_M - (x_M - x_m + 1)%(x0_blk0_size),x_m,y_M,y_M - (y_M - y_m + 1)%(y0_blk0_size) + 1,z_M,z_m,nthreads,(float * *)pr1); bf0(f_vec,h_y,(float *)r0,u_vec,x_size,y0_blk0_size,y_size,z_size,t0,t1,(x_M - x_m + 1)%(x0_blk0_size),x_M,x_M - (x_M - x_m + 1)%(x0_blk0_size) + 1,y_M - (y_M - y_m + 1)%(y0_blk0_size),y_m,z_M,z_m,nthreads,(float * *)pr1); bf0(f_vec,h_y,(float *)r0,u_vec,x_size,(y_M - y_m + 1)%(y0_blk0_size),y_size,z_size,t0,t1,(x_M - x_m + 1)%(x0_blk0_size),x_M,x_M - (x_M - x_m + 1)%(x0_blk0_size) + 1,y_M,y_M - (y_M - y_m + 1)%(y0_blk0_size) + 1,z_M,z_m,nthreads,(float * *)pr1); STOP_TIMER(section1,timers) /* End section1 */ } #pragma omp parallel num_threads(nthreads) { const int tid = omp_get_thread_num(); free(pr1[tid]); } free(r0); free(pr1); return 0; } void bf0(struct dataobj *restrict f_vec, const float h_y, float *restrict r0_vec, struct dataobj *restrict u_vec, const int x_size, const int y0_blk0_size, const int y_size, const int z_size, const int t0, const int t1, const int x0_blk0_size, const int x_M, const int x_m, const int y_M, const int y_m, const int z_M, const int z_m, const int nthreads, float * *restrict pr1_vec) { float (*restrict f)[f_vec->size[1]][f_vec->size[2]] __attribute__ ((aligned (64))) = (float (*)[f_vec->size[1]][f_vec->size[2]]) f_vec->data; float (*restrict r0)[y_size][z_size] __attribute__ ((aligned (64))) = (float (*)[y_size][z_size]) r0_vec; float (*restrict u)[u_vec->size[1]][u_vec->size[2]][u_vec->size[3]] __attribute__ ((aligned (64))) = (float (*)[u_vec->size[1]][u_vec->size[2]][u_vec->size[3]]) u_vec->data; float **pr1 = (float**) pr1_vec; if (x0_blk0_size == 0 || y0_blk0_size == 0) { return; } #pragma omp parallel num_threads(nthreads) { const int tid = omp_get_thread_num(); float (*restrict r1)[z_size] __attribute__ ((aligned (64))) = (float (*)[z_size]) pr1[tid]; #pragma omp for collapse(2) schedule(dynamic,1) for (int x0_blk0 = x_m; x0_blk0 <= x_M; x0_blk0 += x0_blk0_size) { for (int y0_blk0 = y_m; y0_blk0 <= y_M; y0_blk0 += y0_blk0_size) { for (int x = x0_blk0; x <= x0_blk0 + x0_blk0_size - 1; x += 1) { for (int y = y0_blk0 - 2, ys = 0; y <= y0_blk0 + y0_blk0_size + 1; y += 1, ys += 1) { #pragma omp simd aligned(u:32) for (int z = z_m; z <= z_M; z += 1) { r1[ys][z] = (8.33333333e-2F*(u[t0][x + 4][y + 2][z + 4] - u[t0][x + 4][y + 6][z + 4]) + 6.66666667e-1F*(-u[t0][x + 4][y + 3][z + 4] + u[t0][x + 4][y + 5][z + 4]))/h_y; } } for (int y = y0_blk0, ys = 0; y <= y0_blk0 + y0_blk0_size - 1; y += 1, ys += 1) { #pragma omp simd aligned(f,u:32) for (int z = z_m; z <= z_M; z += 1) { u[t1][x + 4][y + 4][z + 4] = (f[x + 1][y + 1][z + 1]*f[x + 1][y + 1][z + 1])*(8.33333333e-2F*(r1[ys][z] - r1[ys + 4][z]) + 6.66666667e-1F*(-r1[ys + 1][z] + r1[ys + 3][z]))*r0[x][y][z]/h_y; } } } } } } }
A crucial observation here is that CIRE is applied on top of loop blocking -- the r1
temporary is computed within the same block as u
, which in turn requires an additional iteration at the block edge along y
to be performed (the first y
loop starts at y0_blk0 - 2
, while the second one at y0_blk0
). Further, the size of r1
is now a function of the block shape. Altogether, this implements what in the literature is often referred to as "overlapped tiling" (or "overlapped blocking"): data reuse across consecutive loop nests is obtained by cross-loop blocking, which in turn requires a certain degree of redundant computation at the block edge. Clearly, there's a tension between the block shape and the amount of redundant computation. For example, a small block shape guarantees a small(er) working set, and thus better data reuse, but also requires more redundant computation.
cire-rotate
option¶So far we've seen two ways to compute the tensor temporaries:
There are a few other ways, and in particular there's a third way supported in Devito, enabled through the cire-rotate
option:
Let's jump straight into an example
op15_omp = Operator(eq, opt=('advanced', {'openmp': True, 'cire-rotate': True}))
print_kernel(op15_omp)
void bf0(struct dataobj *restrict f_vec, const float h_y, float *restrict r0_vec, struct dataobj *restrict u_vec, const int x_size, const int y_size, const int z_size, const int t0, const int t1, const int x0_blk0_size, const int x_M, const int x_m, const int y0_blk0_size, const int y_M, const int y_m, const int z_M, const int z_m, const int nthreads, float * *restrict pr1_vec) { float (*restrict f)[f_vec->size[1]][f_vec->size[2]] __attribute__ ((aligned (64))) = (float (*)[f_vec->size[1]][f_vec->size[2]]) f_vec->data; float (*restrict r0)[y_size][z_size] __attribute__ ((aligned (64))) = (float (*)[y_size][z_size]) r0_vec; float (*restrict u)[u_vec->size[1]][u_vec->size[2]][u_vec->size[3]] __attribute__ ((aligned (64))) = (float (*)[u_vec->size[1]][u_vec->size[2]][u_vec->size[3]]) u_vec->data; float **pr1 = (float**) pr1_vec; if (x0_blk0_size == 0 || y0_blk0_size == 0) { return; } #pragma omp parallel num_threads(nthreads) { const int tid = omp_get_thread_num(); float (*restrict r1)[z_size] __attribute__ ((aligned (64))) = (float (*)[z_size]) pr1[tid]; #pragma omp for collapse(2) schedule(dynamic,1) for (int x0_blk0 = x_m; x0_blk0 <= x_M; x0_blk0 += x0_blk0_size) { for (int y0_blk0 = y_m; y0_blk0 <= y_M; y0_blk0 += y0_blk0_size) { for (int x = x0_blk0; x <= x0_blk0 + x0_blk0_size - 1; x += 1) { for (int y = y0_blk0, ys = 0, yr0 = (ys)%(5), yr1 = (ys + 3)%(5), yr2 = (ys + 4)%(5), yr3 = (ys + 1)%(5), yii = -2; y <= y0_blk0 + y0_blk0_size - 1; y += 1, ys += 1, yr0 = (ys)%(5), yr1 = (ys + 3)%(5), yr2 = (ys + 4)%(5), yr3 = (ys + 1)%(5), yii = 2) { for (int yy = yii, ysi = (yy + ys + 2)%(5); yy <= 2; yy += 1, ysi = (yy + ys + 2)%(5)) { #pragma omp simd aligned(u:32) for (int z = z_m; z <= z_M; z += 1) { r1[ysi][z] = (8.33333333e-2F*(u[t0][x + 4][yy + y + 2][z + 4] - u[t0][x + 4][yy + y + 6][z + 4]) + 6.66666667e-1F*(-u[t0][x + 4][yy + y + 3][z + 4] + u[t0][x + 4][yy + y + 5][z + 4]))/h_y; } } #pragma omp simd aligned(f,u:32) for (int z = z_m; z <= z_M; z += 1) { u[t1][x + 4][y + 4][z + 4] = (f[x + 1][y + 1][z + 1]*f[x + 1][y + 1][z + 1])*(8.33333333e-2F*(r1[yr0][z] - r1[yr2][z]) + 6.66666667e-1F*(r1[yr1][z] - r1[yr3][z]))*r0[x][y][z]/h_y; } } } } } } }
There are two key things to notice here:
r1
temporary is a pointer to a two-dimensional array of size [2][z_size]
. It's obtained via casting of pr1[tid]
, which in turn is defined asprint(op15_omp.body[1].header[4])
#pragma omp parallel num_threads(nthreads) { const int tid = omp_get_thread_num(); posix_memalign((void**)&pr1[tid], 64, sizeof(float[5][z_size])); }
y
loop there are several iteration variables, some of which (yr0
, yr1
, ...) employ modulo increment to cyclically produce the indices 0 and 1.In essence, with cire-rotate
, instead of computing an entire slice of y
values, at each y
iteration we only keep track of the values that are strictly necessary to evaluate u
at y
-- only two values in this case. This results in a working set reduction, at the price of turning one parallel loop (y
) into a sequential one.
cire-maxalias
option¶Let's consider the following variation of our running example, in which the outer y
derivative now comprises several terms, other than just u.dy
.
eq = Eq(u.forward, (2*f*f*u.dy).dy)
op16_omp = Operator(eq, opt=('advanced', {'openmp': True}))
print_kernel(op16_omp)
void bf0(struct dataobj *restrict f_vec, const float h_y, struct dataobj *restrict u_vec, const int y0_blk0_size, const int z_size, const int t0, const int t1, const int x0_blk0_size, const int x_M, const int x_m, const int y_M, const int y_m, const int z_M, const int z_m, const int nthreads, float * *restrict pr0_vec) { float (*restrict f)[f_vec->size[1]][f_vec->size[2]] __attribute__ ((aligned (64))) = (float (*)[f_vec->size[1]][f_vec->size[2]]) f_vec->data; float (*restrict u)[u_vec->size[1]][u_vec->size[2]][u_vec->size[3]] __attribute__ ((aligned (64))) = (float (*)[u_vec->size[1]][u_vec->size[2]][u_vec->size[3]]) u_vec->data; float **pr0 = (float**) pr0_vec; if (x0_blk0_size == 0 || y0_blk0_size == 0) { return; } #pragma omp parallel num_threads(nthreads) { const int tid = omp_get_thread_num(); float (*restrict r0)[z_size] __attribute__ ((aligned (64))) = (float (*)[z_size]) pr0[tid]; #pragma omp for collapse(2) schedule(dynamic,1) for (int x0_blk0 = x_m; x0_blk0 <= x_M; x0_blk0 += x0_blk0_size) { for (int y0_blk0 = y_m; y0_blk0 <= y_M; y0_blk0 += y0_blk0_size) { for (int x = x0_blk0; x <= x0_blk0 + x0_blk0_size - 1; x += 1) { for (int y = y0_blk0 - 1, ys = 0; y <= y0_blk0 + y0_blk0_size - 1; y += 1, ys += 1) { #pragma omp simd aligned(u:32) for (int z = z_m; z <= z_M; z += 1) { r0[ys][z] = (8.33333333e-2F*(u[t0][x + 4][y + 3][z + 4] - u[t0][x + 4][y + 7][z + 4]) + 6.66666667e-1F*(-u[t0][x + 4][y + 4][z + 4] + u[t0][x + 4][y + 6][z + 4]))/h_y; } } for (int y = y0_blk0, ys = 0; y <= y0_blk0 + y0_blk0_size - 1; y += 1, ys += 1) { #pragma omp simd aligned(f,u:32) for (int z = z_m; z <= z_M; z += 1) { float r1 = 1.0F/h_y; u[t1][x + 4][y + 4][z + 4] = r1*(2*(f[x + 1][y + 2][z + 1]*f[x + 1][y + 2][z + 1])*r0[ys + 1][z]) + (-2*r1*r0[ys][z])*(f[x + 1][y + 1][z + 1]*f[x + 1][y + 1][z + 1]); } } } } } } }
By paying close attention to the generated code, we see that the r0
temporary only stores the u.dy
component, while the 2*f*f
term is left intact in the loop nest computing u
. This is due to a heuristic applied by the Devito compiler: in a derivative, only the sub-expressions representing additions -- and therefore inner derivatives -- are used as CIRE candidates. The logic behind this heuristic is that of minimizing the number of required temporaries at the price of potentially leaving some redundancies on the table. One can disable this heuristic via the cire-maxalias
option.
op16_b0_omp = Operator(eq, opt=('advanced', {'openmp': True, 'cire-maxalias': True}))
print_kernel(op16_b0_omp)
void bf0(struct dataobj *restrict f_vec, const float h_y, struct dataobj *restrict u_vec, const int y0_blk0_size, const int z_size, const int t0, const int t1, const int x0_blk0_size, const int x_M, const int x_m, const int y_M, const int y_m, const int z_M, const int z_m, const int nthreads, float * *restrict pr0_vec) { float (*restrict f)[f_vec->size[1]][f_vec->size[2]] __attribute__ ((aligned (64))) = (float (*)[f_vec->size[1]][f_vec->size[2]]) f_vec->data; float (*restrict u)[u_vec->size[1]][u_vec->size[2]][u_vec->size[3]] __attribute__ ((aligned (64))) = (float (*)[u_vec->size[1]][u_vec->size[2]][u_vec->size[3]]) u_vec->data; float **pr0 = (float**) pr0_vec; if (x0_blk0_size == 0 || y0_blk0_size == 0) { return; } #pragma omp parallel num_threads(nthreads) { const int tid = omp_get_thread_num(); float (*restrict r0)[z_size] __attribute__ ((aligned (64))) = (float (*)[z_size]) pr0[tid]; #pragma omp for collapse(2) schedule(dynamic,1) for (int x0_blk0 = x_m; x0_blk0 <= x_M; x0_blk0 += x0_blk0_size) { for (int y0_blk0 = y_m; y0_blk0 <= y_M; y0_blk0 += y0_blk0_size) { for (int x = x0_blk0; x <= x0_blk0 + x0_blk0_size - 1; x += 1) { for (int y = y0_blk0 - 1, ys = 0; y <= y0_blk0 + y0_blk0_size - 1; y += 1, ys += 1) { #pragma omp simd aligned(f,u:32) for (int z = z_m; z <= z_M; z += 1) { r0[ys][z] = (f[x + 1][y + 2][z + 1]*f[x + 1][y + 2][z + 1])*(8.33333333e-2F*(u[t0][x + 4][y + 3][z + 4] - u[t0][x + 4][y + 7][z + 4]) + 6.66666667e-1F*(-u[t0][x + 4][y + 4][z + 4] + u[t0][x + 4][y + 6][z + 4]))/((h_y*h_y)); } } for (int y = y0_blk0, ys = 0; y <= y0_blk0 + y0_blk0_size - 1; y += 1, ys += 1) { #pragma omp simd aligned(u:32) for (int z = z_m; z <= z_M; z += 1) { u[t1][x + 4][y + 4][z + 4] = 2*(-r0[ys][z] + r0[ys + 1][z]); } } } } } } }
Now we have a "fatter" temporary, and that's good -- but if, instead, we had had an Operator
such as the one below, the gain from a reduced operation count might be outweighed by the presence of more temporaries, which means larger working set and increased memory traffic.
eq = Eq(u.forward, (2*f*f*u.dy).dy + (3*f*u.dy).dy)
op16_b1_omp = Operator(eq, opt=('advanced', {'openmp': True}))
op16_b2_omp = Operator(eq, opt=('advanced', {'openmp': True, 'cire-maxalias': True}))
# print_kernel(op16_b1_omp) # Uncomment to see generated code with one temporary but more flops
# print_kernel(op16_b2_omp) # Uncomment to see generated code with two temporaries but fewer flops
advanced-fsg
mode¶The alternative advanced-fsg
optimization level applies the same passes as advanced
, but in a different order. The key difference is that -fsg
does not generate overlapped blocking code across CIRE-generated loop nests.
eq = Eq(u.forward, f**2*sin(f)*u.dy.dy) # Back to original running example
op17_omp = Operator(eq, opt=('advanced-fsg', {'openmp': True}))
print(op17_omp)
#define _POSIX_C_SOURCE 200809L #define START_TIMER(S) struct timeval start_ ## S , end_ ## S ; gettimeofday(&start_ ## S , NULL); #define STOP_TIMER(S,T) gettimeofday(&end_ ## S, NULL); T->S += (double)(end_ ## S .tv_sec-start_ ## S.tv_sec)+(double)(end_ ## S .tv_usec-start_ ## S .tv_usec)/1000000; #include "stdlib.h" #include "math.h" #include "sys/time.h" #include "xmmintrin.h" #include "pmmintrin.h" #include "omp.h" struct dataobj { void *restrict data; int * size; int * npsize; int * dsize; int * hsize; int * hofs; int * oofs; } ; struct profiler { double section0; double section1; } ; void bf0(struct dataobj *restrict f_vec, const float h_y, float *restrict r0_vec, struct dataobj *restrict u_vec, const int x_size, const int y_size, const int z_size, const int t0, const int t1, const int x0_blk0_size, const int x_M, const int x_m, const int y_M, const int y_m, const int z_M, const int z_m, const int nthreads, float * *restrict pr1_vec); int Kernel(struct dataobj *restrict f_vec, const float h_y, struct dataobj *restrict u_vec, const int x_size, const int y_size, const int z_size, const int time_M, const int time_m, const int x0_blk0_size, const int x_M, const int x_m, const int y_M, const int y_m, const int z_M, const int z_m, const int nthreads, struct profiler * timers) { float (*restrict f)[f_vec->size[1]][f_vec->size[2]] __attribute__ ((aligned (64))) = (float (*)[f_vec->size[1]][f_vec->size[2]]) f_vec->data; float (*r0)[y_size][z_size]; posix_memalign((void**)&r0, 64, sizeof(float[x_size][y_size][z_size])); float **pr1; posix_memalign((void**)&pr1, 64, sizeof(float*)*nthreads); #pragma omp parallel num_threads(nthreads) { const int tid = omp_get_thread_num(); posix_memalign((void**)&pr1[tid], 64, sizeof(float[y_size + 4][z_size])); } /* Flush denormal numbers to zero in hardware */ _MM_SET_DENORMALS_ZERO_MODE(_MM_DENORMALS_ZERO_ON); _MM_SET_FLUSH_ZERO_MODE(_MM_FLUSH_ZERO_ON); /* Begin section0 */ START_TIMER(section0) #pragma omp parallel num_threads(nthreads) { #pragma omp for collapse(2) schedule(static,1) for (int x = x_m; x <= x_M; x += 1) { for (int y = y_m; y <= y_M; y += 1) { #pragma omp simd aligned(f:32) for (int z = z_m; z <= z_M; z += 1) { r0[x][y][z] = sin(f[x + 1][y + 1][z + 1]); } } } } STOP_TIMER(section0,timers) /* End section0 */ for (int time = time_m, t0 = (time)%(2), t1 = (time + 1)%(2); time <= time_M; time += 1, t0 = (time)%(2), t1 = (time + 1)%(2)) { /* Begin section1 */ START_TIMER(section1) bf0(f_vec,h_y,(float *)r0,u_vec,x_size,y_size,z_size,t0,t1,x0_blk0_size,x_M - (x_M - x_m + 1)%(x0_blk0_size),x_m,y_M,y_m,z_M,z_m,nthreads,(float * *)pr1); bf0(f_vec,h_y,(float *)r0,u_vec,x_size,y_size,z_size,t0,t1,(x_M - x_m + 1)%(x0_blk0_size),x_M,x_M - (x_M - x_m + 1)%(x0_blk0_size) + 1,y_M,y_m,z_M,z_m,nthreads,(float * *)pr1); STOP_TIMER(section1,timers) /* End section1 */ } #pragma omp parallel num_threads(nthreads) { const int tid = omp_get_thread_num(); free(pr1[tid]); } free(r0); free(pr1); return 0; } void bf0(struct dataobj *restrict f_vec, const float h_y, float *restrict r0_vec, struct dataobj *restrict u_vec, const int x_size, const int y_size, const int z_size, const int t0, const int t1, const int x0_blk0_size, const int x_M, const int x_m, const int y_M, const int y_m, const int z_M, const int z_m, const int nthreads, float * *restrict pr1_vec) { float (*restrict f)[f_vec->size[1]][f_vec->size[2]] __attribute__ ((aligned (64))) = (float (*)[f_vec->size[1]][f_vec->size[2]]) f_vec->data; float (*restrict r0)[y_size][z_size] __attribute__ ((aligned (64))) = (float (*)[y_size][z_size]) r0_vec; float (*restrict u)[u_vec->size[1]][u_vec->size[2]][u_vec->size[3]] __attribute__ ((aligned (64))) = (float (*)[u_vec->size[1]][u_vec->size[2]][u_vec->size[3]]) u_vec->data; float **pr1 = (float**) pr1_vec; if (x0_blk0_size == 0) { return; } #pragma omp parallel num_threads(nthreads) { const int tid = omp_get_thread_num(); float (*restrict r1)[z_size] __attribute__ ((aligned (64))) = (float (*)[z_size]) pr1[tid]; #pragma omp for collapse(1) schedule(dynamic,1) for (int x0_blk0 = x_m; x0_blk0 <= x_M; x0_blk0 += x0_blk0_size) { for (int x = x0_blk0; x <= x0_blk0 + x0_blk0_size - 1; x += 1) { for (int y = y_m - 2; y <= y_M + 2; y += 1) { #pragma omp simd aligned(u:32) for (int z = z_m; z <= z_M; z += 1) { r1[y + 2][z] = (8.33333333e-2F*(u[t0][x + 4][y + 2][z + 4] - u[t0][x + 4][y + 6][z + 4]) + 6.66666667e-1F*(-u[t0][x + 4][y + 3][z + 4] + u[t0][x + 4][y + 5][z + 4]))/h_y; } } for (int y = y_m; y <= y_M; y += 1) { #pragma omp simd aligned(f,u:32) for (int z = z_m; z <= z_M; z += 1) { u[t1][x + 4][y + 4][z + 4] = (f[x + 1][y + 1][z + 1]*f[x + 1][y + 1][z + 1])*(8.33333333e-2F*(r1[y][z] - r1[y + 4][z]) + 6.66666667e-1F*(-r1[y + 1][z] + r1[y + 3][z]))*r0[x][y][z]/h_y; } } } } } }
The x
loop here is still shared by the two loop nests, but the y
one isn't. Analogously, if we consider the alternative eq
already used in op7_b0_omp
, we get two completely separate, and therefore individually blocked, loop nests.
eq = Eq(u.forward, f**2*sin(f)*(u.dy.dy + u.dx.dx))
op17_b0 = Operator(eq, opt=('advanced-fsg', {'openmp': True}))
print(op17_b0)
#define _POSIX_C_SOURCE 200809L #define START_TIMER(S) struct timeval start_ ## S , end_ ## S ; gettimeofday(&start_ ## S , NULL); #define STOP_TIMER(S,T) gettimeofday(&end_ ## S, NULL); T->S += (double)(end_ ## S .tv_sec-start_ ## S.tv_sec)+(double)(end_ ## S .tv_usec-start_ ## S .tv_usec)/1000000; #include "stdlib.h" #include "math.h" #include "sys/time.h" #include "xmmintrin.h" #include "pmmintrin.h" #include "omp.h" struct dataobj { void *restrict data; int * size; int * npsize; int * dsize; int * hsize; int * hofs; int * oofs; } ; struct profiler { double section0; double section1; } ; void bf0(const float h_x, const float h_y, float *restrict r1_vec, float *restrict r2_vec, struct dataobj *restrict u_vec, const int t0, const int x0_blk0_size, const int x_M, const int x_m, const int x_size, const int y0_blk0_size, const int y_M, const int y_m, const int y_size, const int z_M, const int z_m, const int z_size, const int nthreads); void bf1(struct dataobj *restrict f_vec, const float h_x, const float h_y, float *restrict r0_vec, float *restrict r1_vec, float *restrict r2_vec, struct dataobj *restrict u_vec, const int x_size, const int y_size, const int z_size, const int t1, const int x1_blk0_size, const int x_M, const int x_m, const int y1_blk0_size, const int y_M, const int y_m, const int z_M, const int z_m, const int nthreads); int Kernel(struct dataobj *restrict f_vec, const float h_x, const float h_y, struct dataobj *restrict u_vec, const int x_size, const int y_size, const int z_size, const int time_M, const int time_m, const int x0_blk0_size, const int x1_blk0_size, const int x_M, const int x_m, const int y0_blk0_size, const int y1_blk0_size, const int y_M, const int y_m, const int z_M, const int z_m, const int nthreads, struct profiler * timers) { float (*restrict f)[f_vec->size[1]][f_vec->size[2]] __attribute__ ((aligned (64))) = (float (*)[f_vec->size[1]][f_vec->size[2]]) f_vec->data; float (*r0)[y_size][z_size]; posix_memalign((void**)&r0, 64, sizeof(float[x_size][y_size][z_size])); float (*r1)[y_size + 4][z_size]; posix_memalign((void**)&r1, 64, sizeof(float[x_size + 4][y_size + 4][z_size])); float (*r2)[y_size + 4][z_size]; posix_memalign((void**)&r2, 64, sizeof(float[x_size + 4][y_size + 4][z_size])); /* Flush denormal numbers to zero in hardware */ _MM_SET_DENORMALS_ZERO_MODE(_MM_DENORMALS_ZERO_ON); _MM_SET_FLUSH_ZERO_MODE(_MM_FLUSH_ZERO_ON); /* Begin section0 */ START_TIMER(section0) #pragma omp parallel num_threads(nthreads) { #pragma omp for collapse(2) schedule(static,1) for (int x = x_m; x <= x_M; x += 1) { for (int y = y_m; y <= y_M; y += 1) { #pragma omp simd aligned(f:32) for (int z = z_m; z <= z_M; z += 1) { r0[x][y][z] = sin(f[x + 1][y + 1][z + 1]); } } } } STOP_TIMER(section0,timers) /* End section0 */ for (int time = time_m, t0 = (time)%(2), t1 = (time + 1)%(2); time <= time_M; time += 1, t0 = (time)%(2), t1 = (time + 1)%(2)) { /* Begin section1 */ START_TIMER(section1) bf0(h_x,h_y,(float *)r1,(float *)r2,u_vec,t0,x0_blk0_size,x_M - (x_M - x_m + 5)%(x0_blk0_size) + 2,x_m - 2,x_size,y0_blk0_size,y_M - (y_M - y_m + 5)%(y0_blk0_size) + 2,y_m - 2,y_size,z_M,z_m,z_size,nthreads); bf0(h_x,h_y,(float *)r1,(float *)r2,u_vec,t0,x0_blk0_size,x_M - (x_M - x_m + 5)%(x0_blk0_size) + 2,x_m - 2,x_size,(y_M - y_m + 5)%(y0_blk0_size),y_M + 2,y_M - (y_M - y_m + 5)%(y0_blk0_size) + 3,y_size,z_M,z_m,z_size,nthreads); bf0(h_x,h_y,(float *)r1,(float *)r2,u_vec,t0,(x_M - x_m + 5)%(x0_blk0_size),x_M + 2,x_M - (x_M - x_m + 5)%(x0_blk0_size) + 3,x_size,y0_blk0_size,y_M - (y_M - y_m + 5)%(y0_blk0_size) + 2,y_m - 2,y_size,z_M,z_m,z_size,nthreads); bf0(h_x,h_y,(float *)r1,(float *)r2,u_vec,t0,(x_M - x_m + 5)%(x0_blk0_size),x_M + 2,x_M - (x_M - x_m + 5)%(x0_blk0_size) + 3,x_size,(y_M - y_m + 5)%(y0_blk0_size),y_M + 2,y_M - (y_M - y_m + 5)%(y0_blk0_size) + 3,y_size,z_M,z_m,z_size,nthreads); bf1(f_vec,h_x,h_y,(float *)r0,(float *)r1,(float *)r2,u_vec,x_size,y_size,z_size,t1,x1_blk0_size,x_M - (x_M - x_m + 1)%(x1_blk0_size),x_m,y1_blk0_size,y_M - (y_M - y_m + 1)%(y1_blk0_size),y_m,z_M,z_m,nthreads); bf1(f_vec,h_x,h_y,(float *)r0,(float *)r1,(float *)r2,u_vec,x_size,y_size,z_size,t1,x1_blk0_size,x_M - (x_M - x_m + 1)%(x1_blk0_size),x_m,(y_M - y_m + 1)%(y1_blk0_size),y_M,y_M - (y_M - y_m + 1)%(y1_blk0_size) + 1,z_M,z_m,nthreads); bf1(f_vec,h_x,h_y,(float *)r0,(float *)r1,(float *)r2,u_vec,x_size,y_size,z_size,t1,(x_M - x_m + 1)%(x1_blk0_size),x_M,x_M - (x_M - x_m + 1)%(x1_blk0_size) + 1,y1_blk0_size,y_M - (y_M - y_m + 1)%(y1_blk0_size),y_m,z_M,z_m,nthreads); bf1(f_vec,h_x,h_y,(float *)r0,(float *)r1,(float *)r2,u_vec,x_size,y_size,z_size,t1,(x_M - x_m + 1)%(x1_blk0_size),x_M,x_M - (x_M - x_m + 1)%(x1_blk0_size) + 1,(y_M - y_m + 1)%(y1_blk0_size),y_M,y_M - (y_M - y_m + 1)%(y1_blk0_size) + 1,z_M,z_m,nthreads); STOP_TIMER(section1,timers) /* End section1 */ } free(r0); free(r1); free(r2); return 0; } void bf0(const float h_x, const float h_y, float *restrict r1_vec, float *restrict r2_vec, struct dataobj *restrict u_vec, const int t0, const int x0_blk0_size, const int x_M, const int x_m, const int x_size, const int y0_blk0_size, const int y_M, const int y_m, const int y_size, const int z_M, const int z_m, const int z_size, const int nthreads) { float (*restrict r1)[y_size + 4][z_size] __attribute__ ((aligned (64))) = (float (*)[y_size + 4][z_size]) r1_vec; float (*restrict r2)[y_size + 4][z_size] __attribute__ ((aligned (64))) = (float (*)[y_size + 4][z_size]) r2_vec; float (*restrict u)[u_vec->size[1]][u_vec->size[2]][u_vec->size[3]] __attribute__ ((aligned (64))) = (float (*)[u_vec->size[1]][u_vec->size[2]][u_vec->size[3]]) u_vec->data; if (x0_blk0_size == 0 || y0_blk0_size == 0) { return; } #pragma omp parallel num_threads(nthreads) { #pragma omp for collapse(2) schedule(dynamic,1) for (int x0_blk0 = x_m; x0_blk0 <= x_M; x0_blk0 += x0_blk0_size) { for (int y0_blk0 = y_m; y0_blk0 <= y_M; y0_blk0 += y0_blk0_size) { for (int x = x0_blk0; x <= x0_blk0 + x0_blk0_size - 1; x += 1) { for (int y = y0_blk0; y <= y0_blk0 + y0_blk0_size - 1; y += 1) { #pragma omp simd aligned(u:32) for (int z = z_m; z <= z_M; z += 1) { r1[x + 2][y + 2][z] = (8.33333333e-2F*(u[t0][x + 2][y + 4][z + 4] - u[t0][x + 6][y + 4][z + 4]) + 6.66666667e-1F*(-u[t0][x + 3][y + 4][z + 4] + u[t0][x + 5][y + 4][z + 4]))/h_x; r2[x + 2][y + 2][z] = (8.33333333e-2F*(u[t0][x + 4][y + 2][z + 4] - u[t0][x + 4][y + 6][z + 4]) + 6.66666667e-1F*(-u[t0][x + 4][y + 3][z + 4] + u[t0][x + 4][y + 5][z + 4]))/h_y; } } } } } } } void bf1(struct dataobj *restrict f_vec, const float h_x, const float h_y, float *restrict r0_vec, float *restrict r1_vec, float *restrict r2_vec, struct dataobj *restrict u_vec, const int x_size, const int y_size, const int z_size, const int t1, const int x1_blk0_size, const int x_M, const int x_m, const int y1_blk0_size, const int y_M, const int y_m, const int z_M, const int z_m, const int nthreads) { float (*restrict f)[f_vec->size[1]][f_vec->size[2]] __attribute__ ((aligned (64))) = (float (*)[f_vec->size[1]][f_vec->size[2]]) f_vec->data; float (*restrict r0)[y_size][z_size] __attribute__ ((aligned (64))) = (float (*)[y_size][z_size]) r0_vec; float (*restrict r1)[y_size + 4][z_size] __attribute__ ((aligned (64))) = (float (*)[y_size + 4][z_size]) r1_vec; float (*restrict r2)[y_size + 4][z_size] __attribute__ ((aligned (64))) = (float (*)[y_size + 4][z_size]) r2_vec; float (*restrict u)[u_vec->size[1]][u_vec->size[2]][u_vec->size[3]] __attribute__ ((aligned (64))) = (float (*)[u_vec->size[1]][u_vec->size[2]][u_vec->size[3]]) u_vec->data; if (x1_blk0_size == 0 || y1_blk0_size == 0) { return; } #pragma omp parallel num_threads(nthreads) { #pragma omp for collapse(2) schedule(dynamic,1) for (int x1_blk0 = x_m; x1_blk0 <= x_M; x1_blk0 += x1_blk0_size) { for (int y1_blk0 = y_m; y1_blk0 <= y_M; y1_blk0 += y1_blk0_size) { for (int x = x1_blk0; x <= x1_blk0 + x1_blk0_size - 1; x += 1) { for (int y = y1_blk0; y <= y1_blk0 + y1_blk0_size - 1; y += 1) { #pragma omp simd aligned(f,u:32) for (int z = z_m; z <= z_M; z += 1) { u[t1][x + 4][y + 4][z + 4] = (f[x + 1][y + 1][z + 1]*f[x + 1][y + 1][z + 1])*((8.33333333e-2F*(r2[x + 2][y][z] - r2[x + 2][y + 4][z]) + 6.66666667e-1F*(-r2[x + 2][y + 1][z] + r2[x + 2][y + 3][z]))/h_y + (8.33333333e-2F*(r1[x][y + 2][z] - r1[x + 4][y + 2][z]) + 6.66666667e-1F*(-r1[x + 1][y + 2][z] + r1[x + 3][y + 2][z]))/h_x)*r0[x][y][z]; } } } } } } }
[1] Fabio Luporini, Mathias Louboutin, Michael Lange, Navjot Kukreja, Philipp Witte, Jan Hückelheim, Charles Yount, Paul H. J. Kelly, Felix J. Herrmann, and Gerard J. Gorman. 2020. Architecture and Performance of Devito, a System for Automated Stencil Computation. ACM Trans. Math. Softw. 46, 1, Article 6 (April 2020), 28 pages. DOI:https://doi.org/10.1145/3374916