Understanding an OpenMP offloading example

rommel.quintanilla.c · November 10, 2023, 4:43pm

Hello, I have this C++ code with OpenMP offloading directives. It has a 2-level nested loop on purpose because it actually represents a pattern of a larger chunk of code where I noticed exists the same behavior as follows.

#include <stdio.h>

#define NX 102400

int main(void)
{
    for(int k=0; k < 5; k++) {
        #pragma omp target teams distribute
        for (int j = 0; j < NX; j++) {
            #pragma omp parallel for
            for (int i = 0; i < NX; i++) {
            }
        }
    }   

    return 0;
}

So, after inspecting the nsys output and timeline on the target GPU (NVIDIA A100), I noticed that there are two memory operations before each kernel call that I don’t fully understand where they come from. One is a device memset of 40 bytes, and the other is a HtoD memcpy (pageable) of 600 bytes.

Another thing that caught my attention is the duration of the kernel’s execution, which is quite considerable even though it practically has no body.

nvkernel_main_F1L7_2
Begins: 1,02093s
Ends: 1,02362s (+2,684 ms)
grid:  <<<108, 1, 1>>>
block: <<<1024, 1, 1>>>
Launch Type: Regular
Static Shared Memory: 0 bytes
Dynamic Shared Memory: 8.824 bytes
Registers Per Thread: 64
Local Memory Per Thread: 0 bytes
Local Memory Total: 392.822.784 bytes
Shared Memory executed: 16.384 bytes
Shared Memory Bank Size: 4 B
Theoretical occupancy: 50 %
Launched from thread: 17566
Latency: ←12,259 μs
Correlation ID: 81
Stream: Stream 16

Below I also leave the compilation information that could be useful.

$ nvc++ -mp=gpu -Minfo -gpu=ptxinfo main.c
nvc++-Warning-CUDA_HOME has been deprecated. Please, use NVHPC_CUDA_HOME instead.
main:
      7, #omp target teams distribute
          7, Generating "nvkernel_main_F1L7_2" GPU kernel
          9, Loop parallelized across teams, schedule(static)
         11, #omp parallel
           11, Loop parallelized across threads, schedule(static)
ptxas info    : 78 bytes gmem
ptxas info    : Function properties for nvkernel_main_F1L7_2_F1L11_4
    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Function properties for __hxdTargetTeamsBegin
    56 bytes stack frame, 52 bytes spill stores, 52 bytes spill loads
ptxas info    : Compiling entry function 'nvkernel_main_F1L7_2' for 'sm_80'
ptxas info    : Function properties for nvkernel_main_F1L7_2
    616 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 64 registers, 368 bytes cmem[0]

My final goal is to be able to understand how these values are calculated and also be able to estimate them if I change the kernel configuration or if I use another GPU for example.

MatColgrove · November 10, 2023, 7:21pm

Hi Rommel,

When splitting the “team distribute” from the “parallel for”, this causes the inner loop to become a call invoked by each team and accounts for the overhead you’re seeing. The call is required since you could include things like metadirectives which can dynamically change how the inner loop is parallelized. It also requires some configuration data to be passed in which is the ~600 bytes. While this pattern is more flexible in the type of constructs you can use in the target regions, this flexibility comes at the cost of higher overhead.

I’d suggest you try using the “loop” construct instead. While more restrictive, it has much lower overhead.

     for(int k=0; k < 5; k++) {
        #pragma omp target teams loop
        for (int j = 0; j < NX; j++) {
            #pragma omp loop
            for (int i = 0; i < NX; i++) {
            }
        }
    }

Other options would be to combine “teams distribute parallel for” on the outer loop and then collapse the inner:

    for(int k=0; k < 5; k++) {
        #pragma omp target teams distribute parallel for collapse(2)
        for (int j = 0; j < NX; j++) {
            for (int i = 0; i < NX; i++) {
            }
        }
    }

Collapse can be used with “loop” as well:

     for(int k=0; k < 5; k++) {
        #pragma omp target teams loop collapse(2)
        for (int j = 0; j < NX; j++) {
            for (int i = 0; i < NX; i++) {
            }
        }
    }

Hope this helps,
Mat

Topic		Replies	Views
Different GPU memory usage between OpenACC and OpenMP Offload nvc, nvc++ and nvfortran	10	875	April 28, 2023
Performance of private arrays declared in a separate module nvc, nvc++ and nvfortran	9	48	February 13, 2025
Problem with the nvc++ compiler for OpeMP GPU offloading nvc, nvc++ and nvfortran	2	572	March 10, 2023
Out of range error with openmp gpu offload nvc, nvc++ and nvfortran	10	991	February 1, 2023
Target region with 25% of GPU occupancy nvc, nvc++ and nvfortran	9	757	September 1, 2021
Nvc++ OpenMP error inside llc nvc, nvc++ and nvfortran	5	1115	June 1, 2021
CUDA_ERROR_ILLEGAL_ADDRESS with OpenMP "distribute parallel for" nvc, nvc++ and nvfortran	2	237	May 15, 2024
OpenMP offload: Unknown variable reference when using target teams loop nvc, nvc++ and nvfortran	3	628	June 8, 2022
Nvc++ generating 1-dimension grid exactly the number of SMs, causing low performance nvc, nvc++ and nvfortran	2	23	March 4, 2025
Performance problem launching multiple target regions nvc, nvc++ and nvfortran	20	1149	August 23, 2021

Understanding an OpenMP offloading example

Related topics