Question about compiler optimization in openmp offloading

Dear Nvidia developers,

I’m trying to understand the behaviour of compiler in my actual code, in order to make a sort of “ninja optimization”, changing number of teams and threads. I’m using NVHPC 21.9

Starting from the following simple openmp offload function:

  subroutine add2s2_omp(a,b,c1,n)
  real a(n),b(n)
  do i=1,n

The compiler output says:

1960, !$omp target teams loop
1960, Generating “nvkernel_add2s2_omp__F1L1960_23” GPU kernel
Generating Tesla code
1962, Loop parallelized across teams, threads(128) ! blockidx%x threadidx%x
1960, Generating Multicore code
1962, Loop parallelized across threads
1960, Generating implicit map(tofrom:b(:),a(:))
1962, Generated vector simd code for the loop
FMA (fused multiply-add) instruction(s) generated

So, I suppose that the compiler set to use 1 teams and 128 threads. Using such settings, the function takes 4606168346 ns.

Now, Setting by hand the same setup:


The function takes 299951422797 ns, 65 times slower! Why? Is there misunderstanding? Thanks

Thi is my compilation string:

mpif90 -O2 -mp=gpu -Mcuda=cc80 -mcmodel=medium -Minfo=all -Mvect=levels:5 -Mpreprocess -r8 -mp=gpu

That’s probably expected given your limiting the kernel to only run with a single team. The “teams” in the complier feedback messages does not mean only a single team is used. It means that the runtime will dynamically set the number of teams based on the value of “n” and the thread length (i.e. something like num_teams = n/128).

Hi Mat, so it changes number of threads dinamically? Is it possible to know the number of threads used every time? Thanks.

so it changes number of threads dinamically?

Threads, no, teams, yes. When targeting NVIDIA devices, teams corresponds to the number of CUDA blocks. The number of blocks used is typically determined at runtime based on the loop trip count and number of threads in a block, or the OMP_NUM_TEAMS environment variable. However the number of threads is a team/block is fixed.

Is it possible to know the number of threads used every time?

Not sure if you’re meaning teams here instead of threads? Again the number of threads is fixed either implicitly by the compiler (typically at 128) or by the user via the thread_limit clause.

The number of teams can be fixed as well, but as your experiment shows, this may be detrimental to performance. .To see the actual number of teams/blocks used, you’ll want to use a profiler such as Nsight-systems.

Yes, sorry, I mean teams not threads :/

Hi Mat, attached the output of one kernel run using ncu, with n= 4669440

Grid Size = 36,480
Block size = 128
Threads = 4,669,440

Starting from here, how can I understand number of teams and threads in each teams are used? I’m using NVIDIA A100 Gpus. I note compiler use one CUDA thread for each element (4669440). Is it a good strategy? Thanks.

Teams is the total number of blocks in the grid, i.e. the “Grid Size”. The threads is the number of threads in a block, i.e. the “Block Size”. Threads is the total number of threads, i.e, Grid Size x Block Size.