In the following section, we demonstrate single and nested loop generation in C using NRPy+.
In loop.py, the following functions are implemented for loop generation:
loop(idx_var, lower_bound, upper_bound, increment, pragma, padding=" ", interior="", tile_size="")
idx_var
: index variable for the loop (idxvar[0]
⇒ outermost loop, idxvar[N - 1]
⇒ innermost loop)lower_bound
: lower bound for idxvar
or idxvar[i]
upper_bound
: upper bound for idxvar
or idxvar[i]
increment
: increment for idxvar
or idxvar[i]
pragma
: OpenMP pragma for idxvar
or idxvar[i]
padding
: (optional) padding before a line (tab stop)interior
: (optional) interior of the looptile_size
: (optional) tile size for cache blockingsimple_loop(options, interior)
options
: options for loop generationinterior
: interior of the loopfrom loop import loop, simple_loop # Import NRPy+ module for loop generation
# 'AllPoints': loop over all points on a numerical grid, including ghost zones
print(simple_loop('AllPoints', '// <INTERIOR>'))
#pragma omp parallel for for (int i2 = 0; i2 < Nxx_plus_2NGHOSTS2; i2++) { for (int i1 = 0; i1 < Nxx_plus_2NGHOSTS1; i1++) { for (int i0 = 0; i0 < Nxx_plus_2NGHOSTS0; i0++) { // <INTERIOR> } // END LOOP: for (int i0 = 0; i0 < Nxx_plus_2NGHOSTS0; i0++) } // END LOOP: for (int i1 = 0; i1 < Nxx_plus_2NGHOSTS1; i1++) } // END LOOP: for (int i2 = 0; i2 < Nxx_plus_2NGHOSTS2; i2++)
# 'InteriorPoints': loop over the interior of a numerical grid, i.e. exclude ghost zones
print(simple_loop('InteriorPoints', '// <INTERIOR>'))
#pragma omp parallel for for (int i2 = NGHOSTS; i2 < NGHOSTS+Nxx2; i2++) { for (int i1 = NGHOSTS; i1 < NGHOSTS+Nxx1; i1++) { for (int i0 = NGHOSTS; i0 < NGHOSTS+Nxx0; i0++) { // <INTERIOR> } // END LOOP: for (int i0 = NGHOSTS; i0 < NGHOSTS+Nxx0; i0++) } // END LOOP: for (int i1 = NGHOSTS; i1 < NGHOSTS+Nxx1; i1++) } // END LOOP: for (int i2 = NGHOSTS; i2 < NGHOSTS+Nxx2; i2++)
# 'Read_xxs': read the xx[3][:] 1D coordinate arrays, as some interior dependency exists
print(simple_loop('AllPoints Read_xxs', '// <INTERIOR>'))
#HERE
#pragma omp parallel for for (int i2 = 0; i2 < Nxx_plus_2NGHOSTS2; i2++) { const REAL xx2 = xx[2][i2]; for (int i1 = 0; i1 < Nxx_plus_2NGHOSTS1; i1++) { const REAL xx1 = xx[1][i1]; for (int i0 = 0; i0 < Nxx_plus_2NGHOSTS0; i0++) { const REAL xx0 = xx[0][i0]; // <INTERIOR> } // END LOOP: for (int i0 = 0; i0 < Nxx_plus_2NGHOSTS0; i0++) } // END LOOP: for (int i1 = 0; i1 < Nxx_plus_2NGHOSTS1; i1++) } // END LOOP: for (int i2 = 0; i2 < Nxx_plus_2NGHOSTS2; i2++)
# 'enable_rfm_precompute': enable pre-computation of reference metric
print(simple_loop('AllPoints enable_rfm_precompute', '// <INTERIOR>'))
#pragma omp parallel for for (int i2 = 0; i2 < Nxx_plus_2NGHOSTS2; i2++) { #include "rfm_files/rfm_struct__read2.h" for (int i1 = 0; i1 < Nxx_plus_2NGHOSTS1; i1++) { #include "rfm_files/rfm_struct__read1.h" for (int i0 = 0; i0 < Nxx_plus_2NGHOSTS0; i0++) { #include "rfm_files/rfm_struct__read0.h" // <INTERIOR> } // END LOOP: for (int i0 = 0; i0 < Nxx_plus_2NGHOSTS0; i0++) } // END LOOP: for (int i1 = 0; i1 < Nxx_plus_2NGHOSTS1; i1++) } // END LOOP: for (int i2 = 0; i2 < Nxx_plus_2NGHOSTS2; i2++)
# 'enable_SIMD': enable SIMD support (https://en.wikipedia.org/wiki/SIMD)
print(simple_loop('AllPoints enable_rfm_precompute enable_SIMD', '// <INTERIOR>'))
#pragma omp parallel for for (int i2 = 0; i2 < Nxx_plus_2NGHOSTS2; i2++) { #include "rfm_files/rfm_struct__SIMD_outer_read2.h" for (int i1 = 0; i1 < Nxx_plus_2NGHOSTS1; i1++) { #include "rfm_files/rfm_struct__SIMD_outer_read1.h" for (int i0 = 0; i0 < Nxx_plus_2NGHOSTS0; i0 += SIMD_width) { #include "rfm_files/rfm_struct__SIMD_inner_read0.h" // <INTERIOR> } // END LOOP: for (int i0 = 0; i0 < Nxx_plus_2NGHOSTS0; i0 += SIMD_width) } // END LOOP: for (int i1 = 0; i1 < Nxx_plus_2NGHOSTS1; i1++) } // END LOOP: for (int i2 = 0; i2 < Nxx_plus_2NGHOSTS2; i2++)
# 'DisableOpenMP': disable OpenMP parallelization (https://en.wikipedia.org/wiki/OpenMP)
print(simple_loop('AllPoints DisableOpenMP', '// <INTERIOR>'))
for (int i2 = 0; i2 < Nxx_plus_2NGHOSTS2; i2++) { for (int i1 = 0; i1 < Nxx_plus_2NGHOSTS1; i1++) { for (int i0 = 0; i0 < Nxx_plus_2NGHOSTS0; i0++) { // <INTERIOR> } // END LOOP: for (int i0 = 0; i0 < Nxx_plus_2NGHOSTS0; i0++) } // END LOOP: for (int i1 = 0; i1 < Nxx_plus_2NGHOSTS1; i1++) } // END LOOP: for (int i2 = 0; i2 < Nxx_plus_2NGHOSTS2; i2++)
# Generate a one-dimensional loop over i from 0 to N, stepping by 1, with specified loop body
print(loop('i', '0', 'N', '1', '', interior='// <INTERIOR>'))
for (int i = 0; i < N; i++) { // <INTERIOR> } // END LOOP: for (int i = 0; i < N; i++)
# Generate a two-dimensional loop over i from 0 to Nx, stepping by 1,
# and j from 0 to Ny, stepping by 1, with specified loop body
print(loop(['i', 'j'], ['0', '0'], ['Nx', 'Ny'], ['1', '1'], ['', ''], interior='// <INTERIOR>'))
for (int i = 0; i < Nx; i++) { for (int j = 0; j < Ny; j++) { // <INTERIOR> } // END LOOP: for (int j = 0; j < Ny; j++) } // END LOOP: for (int i = 0; i < Nx; i++)
Goal: Reproduce the following output using the loop generation infrastructure in NRPy+ (solution)
for (int n = 0; n < (Nt - 1); n++) {
u[n][0] = u[n][Nx] = 0;
for (int k = 1; k < (Nx - 1); k++) {
u[n + 1][k] = u[n][k] + r*(u[n][k + 1] - 2*u[n][k] + u[n][k - 1]);
} // END LOOP: for (int k = 1; k < (Nx - 1); k++)
for (int k = 0; k < Nx; k++) {
u[n][k] = u[n + 1][k];
} // END LOOP: for (int k = 0; k < Nx; k++)
} // END LOOP: for (int n = 0; n < (Nt - 1); n++)
# Write Solution Here
In the following section, we demonstrate cache blocking (loop tiling) using NRPy+. The advantage of cache blocking is improved spatial locality by blocking or tiling a data structure to fit inside cache. We minimize the number of cache misses that occur (reduce memory bandwidth pressure) by reusing the subset of our data structure that was cached rather than accessing main memory (source). The following example of matrix-vector multiplication will demonstrate cache blocking using NRPy+.
print('// Untiled Loop')
print(loop(['i', 'j'], ['0', '0'], ['N', 'N'], ['1', '1'], ['', ''], interior='c[i] += a[i][j] * b[j];'))
print('// Tiled Loop\n#define MIN(x, y) (((x) < (y)) ? (x) : (y))')
print(loop(['i', 'j'], ['0', '0'], ['N', 'N'], ['1', '1'], ['', ''], interior='c[i] += a[i][j] * b[j];', tile_size=['2', '2']))
// Untiled Loop for (int i = 0; i < N; i++) { for (int j = 0; j < N; j++) { c[i] += a[i][j] * b[j]; } // END LOOP: for (int j = 0; j < N; j++) } // END LOOP: for (int i = 0; i < N; i++) // Tiled Loop #define MIN(x, y) (((x) < (y)) ? (x) : (y)) for (int iB = 0; iB < N; iB += 2) { for (int jB = 0; jB < N; jB += 2) { for (int i = iB; i < MIN(N, iB + 2); i++) { for (int j = jB; j < MIN(N, jB + 2); j++) { c[i] += a[i][j] * b[j]; } // END LOOP: for (int j = jB; j < MIN(N, jB + 2); j++) } // END LOOP: for (int i = iB; i < MIN(N, iB + 2); i++) } // END LOOP: for (int jB = 0; jB < N; jB += 2) } // END LOOP: for (int iB = 0; iB < N; iB += 2)
Order of Memory Access (Untiled): (a[0][0], b[0]), (a[0][1], b[1]), (a[0][2], b[2]), (a[0][3], b[3]), ...
Order of Memory Access (Tiled): (a[0][0], b[0]), (a[0][1], b[1]), (a[1][0], b[0]), (a[1][1], b[1]), ...
We should remark that cache blocking might not be a valid loop optimization whenever the order of memory access affects the resulting output. However, that potential issue does not occur in our example with matrix-vector multiplication. Moreover, the block or tile size will depend on the CPU architecture, and hence experimentation is required to determine the optimal size.
The following code cell converts this Jupyter notebook into a proper, clickable LATEX-formatted PDF file. After the cell is successfully run, the generated PDF may be found in the root NRPy+ tutorial directory, with filename Tutorial-Loop_Generation_Cache_Blocking.pdf. (Note that clicking on this link may not work; you may need to open the PDF file through another means.)
import cmdline_helper as cmd # NRPy+: Multi-platform Python command-line interface
cmd.output_Jupyter_notebook_to_LaTeXed_PDF("Tutorial-Loop_Generation_Cache_Blocking")
Created Tutorial-Loop_Generation_Cache_Blocking.tex, and compiled LaTeX file to PDF file Tutorial-Loop_Generation_Cache_Blocking.pdf