Understanding an OpenMP offloading example

Hello, I have this C++ code with OpenMP offloading directives. It has a 2-level nested loop on purpose because it actually represents a pattern of a larger chunk of code where I noticed exists the same behavior as follows.

#include <stdio.h>

#define NX 102400

int main(void)
{
    for(int k=0; k < 5; k++) {
        #pragma omp target teams distribute
        for (int j = 0; j < NX; j++) {
            #pragma omp parallel for
            for (int i = 0; i < NX; i++) {
            }
        }
    }   

    return 0;
}

So, after inspecting the nsys output and timeline on the target GPU (NVIDIA A100), I noticed that there are two memory operations before each kernel call that I don’t fully understand where they come from. One is a device memset of 40 bytes, and the other is a HtoD memcpy (pageable) of 600 bytes.

Another thing that caught my attention is the duration of the kernel’s execution, which is quite considerable even though it practically has no body.

nvkernel_main_F1L7_2
Begins: 1,02093s
Ends: 1,02362s (+2,684 ms)
grid:  <<<108, 1, 1>>>
block: <<<1024, 1, 1>>>
Launch Type: Regular
Static Shared Memory: 0 bytes
Dynamic Shared Memory: 8.824 bytes
Registers Per Thread: 64
Local Memory Per Thread: 0 bytes
Local Memory Total: 392.822.784 bytes
Shared Memory executed: 16.384 bytes
Shared Memory Bank Size: 4 B
Theoretical occupancy: 50 %
Launched from thread: 17566
Latency: ←12,259 μs
Correlation ID: 81
Stream: Stream 16

Below I also leave the compilation information that could be useful.

$ nvc++ -mp=gpu -Minfo -gpu=ptxinfo main.c
nvc++-Warning-CUDA_HOME has been deprecated. Please, use NVHPC_CUDA_HOME instead.
main:
      7, #omp target teams distribute
          7, Generating "nvkernel_main_F1L7_2" GPU kernel
          9, Loop parallelized across teams, schedule(static)
         11, #omp parallel
           11, Loop parallelized across threads, schedule(static)
ptxas info    : 78 bytes gmem
ptxas info    : Function properties for nvkernel_main_F1L7_2_F1L11_4
    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Function properties for __hxdTargetTeamsBegin
    56 bytes stack frame, 52 bytes spill stores, 52 bytes spill loads
ptxas info    : Compiling entry function 'nvkernel_main_F1L7_2' for 'sm_80'
ptxas info    : Function properties for nvkernel_main_F1L7_2
    616 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 64 registers, 368 bytes cmem[0]

My final goal is to be able to understand how these values are calculated and also be able to estimate them if I change the kernel configuration or if I use another GPU for example.

1 Like

Hi Rommel,

When splitting the “team distribute” from the “parallel for”, this causes the inner loop to become a call invoked by each team and accounts for the overhead you’re seeing. The call is required since you could include things like metadirectives which can dynamically change how the inner loop is parallelized. It also requires some configuration data to be passed in which is the ~600 bytes. While this pattern is more flexible in the type of constructs you can use in the target regions, this flexibility comes at the cost of higher overhead.

I’d suggest you try using the “loop” construct instead. While more restrictive, it has much lower overhead.

     for(int k=0; k < 5; k++) {
        #pragma omp target teams loop
        for (int j = 0; j < NX; j++) {
            #pragma omp loop
            for (int i = 0; i < NX; i++) {
            }
        }
    }

Other options would be to combine “teams distribute parallel for” on the outer loop and then collapse the inner:

    for(int k=0; k < 5; k++) {
        #pragma omp target teams distribute parallel for collapse(2)
        for (int j = 0; j < NX; j++) {
            for (int i = 0; i < NX; i++) {
            }
        }
    }

Collapse can be used with “loop” as well:

     for(int k=0; k < 5; k++) {
        #pragma omp target teams loop collapse(2)
        for (int j = 0; j < NX; j++) {
            for (int i = 0; i < NX; i++) {
            }
        }
    }

Hope this helps,
Mat

1 Like