Hello, I have this C++ code with OpenMP offloading directives. It has a 2-level nested loop on purpose because it actually represents a pattern of a larger chunk of code where I noticed exists the same behavior as follows.
#include <stdio.h>
#define NX 102400
int main(void)
{
for(int k=0; k < 5; k++) {
#pragma omp target teams distribute
for (int j = 0; j < NX; j++) {
#pragma omp parallel for
for (int i = 0; i < NX; i++) {
}
}
}
return 0;
}
So, after inspecting the nsys output and timeline on the target GPU (NVIDIA A100), I noticed that there are two memory operations before each kernel call that I don’t fully understand where they come from. One is a device memset of 40 bytes, and the other is a HtoD memcpy (pageable) of 600 bytes.
Another thing that caught my attention is the duration of the kernel’s execution, which is quite considerable even though it practically has no body.
nvkernel_main_F1L7_2
Begins: 1,02093s
Ends: 1,02362s (+2,684 ms)
grid: <<<108, 1, 1>>>
block: <<<1024, 1, 1>>>
Launch Type: Regular
Static Shared Memory: 0 bytes
Dynamic Shared Memory: 8.824 bytes
Registers Per Thread: 64
Local Memory Per Thread: 0 bytes
Local Memory Total: 392.822.784 bytes
Shared Memory executed: 16.384 bytes
Shared Memory Bank Size: 4 B
Theoretical occupancy: 50 %
Launched from thread: 17566
Latency: ←12,259 μs
Correlation ID: 81
Stream: Stream 16
Below I also leave the compilation information that could be useful.
$ nvc++ -mp=gpu -Minfo -gpu=ptxinfo main.c
nvc++-Warning-CUDA_HOME has been deprecated. Please, use NVHPC_CUDA_HOME instead.
main:
7, #omp target teams distribute
7, Generating "nvkernel_main_F1L7_2" GPU kernel
9, Loop parallelized across teams, schedule(static)
11, #omp parallel
11, Loop parallelized across threads, schedule(static)
ptxas info : 78 bytes gmem
ptxas info : Function properties for nvkernel_main_F1L7_2_F1L11_4
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Function properties for __hxdTargetTeamsBegin
56 bytes stack frame, 52 bytes spill stores, 52 bytes spill loads
ptxas info : Compiling entry function 'nvkernel_main_F1L7_2' for 'sm_80'
ptxas info : Function properties for nvkernel_main_F1L7_2
616 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 64 registers, 368 bytes cmem[0]
My final goal is to be able to understand how these values are calculated and also be able to estimate them if I change the kernel configuration or if I use another GPU for example.