Question about compiler optimization in openmp offloading

unrue · November 24, 2021, 9:19am

Dear Nvidia developers,

I’m trying to understand the behaviour of compiler in my actual code, in order to make a sort of “ninja optimization”, changing number of teams and threads. I’m using NVHPC 21.9

Starting from the following simple openmp offload function:

  subroutine add2s2_omp(a,b,c1,n)
  real a(n),b(n)
!$OMP TARGET TEAMS LOOP 
  do i=1,n
    a(i)=a(i)+c1*b(i)
  enddo
  return
  end

The compiler output says:

add2s2_omp:
1960, !$omp target teams loop
1960, Generating “nvkernel_add2s2_omp__F1L1960_23” GPU kernel
Generating Tesla code
1962, Loop parallelized across teams, threads(128) ! blockidx%x threadidx%x
1960, Generating Multicore code
1962, Loop parallelized across threads
1960, Generating implicit map(tofrom:b(:),a(:))
1962, Generated vector simd code for the loop
FMA (fused multiply-add) instruction(s) generated

So, I suppose that the compiler set to use 1 teams and 128 threads. Using such settings, the function takes 4606168346 ns.

Now, Setting by hand the same setup:

!$OMP TARGET TEAMS LOOP NUM_TEAMS(1) THREAD_LIMIT(128)

The function takes 299951422797 ns, 65 times slower! Why? Is there misunderstanding? Thanks

Thi is my compilation string:

mpif90 -O2 -mp=gpu -Mcuda=cc80 -mcmodel=medium -Minfo=all -Mvect=levels:5 -Mpreprocess -r8 -mp=gpu

MatColgrove · November 24, 2021, 5:15pm

That’s probably expected given your limiting the kernel to only run with a single team. The “teams” in the complier feedback messages does not mean only a single team is used. It means that the runtime will dynamically set the number of teams based on the value of “n” and the thread length (i.e. something like num_teams = n/128).

unrue · November 24, 2021, 7:38pm

Hi Mat, so it changes number of threads dinamically? Is it possible to know the number of threads used every time? Thanks.

MatColgrove · November 24, 2021, 7:58pm

so it changes number of threads dinamically?

Threads, no, teams, yes. When targeting NVIDIA devices, teams corresponds to the number of CUDA blocks. The number of blocks used is typically determined at runtime based on the loop trip count and number of threads in a block, or the OMP_NUM_TEAMS environment variable. However the number of threads is a team/block is fixed.

Is it possible to know the number of threads used every time?

Not sure if you’re meaning teams here instead of threads? Again the number of threads is fixed either implicitly by the compiler (typically at 128) or by the user via the thread_limit clause.

The number of teams can be fixed as well, but as your experiment shows, this may be detrimental to performance. .To see the actual number of teams/blocks used, you’ll want to use a profiler such as Nsight-systems.

unrue · November 24, 2021, 8:02pm

Yes, sorry, I mean teams not threads :/

unrue · November 29, 2021, 9:21am

Hi Mat, attached the output of one kernel run using ncu, with n= 4669440

Grid Size = 36,480
Block size = 128
Threads = 4,669,440

Starting from here, how can I understand number of teams and threads in each teams are used? I’m using NVIDIA A100 Gpus. I note compiler use one CUDA thread for each element (4669440). Is it a good strategy? Thanks.

MatColgrove · November 29, 2021, 9:41pm

Teams is the total number of blocks in the grid, i.e. the “Grid Size”. The threads is the number of threads in a block, i.e. the “Block Size”. Threads is the total number of threads, i.e, Grid Size x Block Size.

Topic		Replies	Views
Target region with 25% of GPU occupancy nvc, nvc++ and nvfortran	9	748	September 1, 2021
About num_teams value nvc, nvc++ and nvfortran	4	664	January 24, 2023
Understanding an OpenMP offloading example nvc, nvc++ and nvfortran	1	1029	November 10, 2023
how gang and vector parallelization of a loop map to the GPU Legacy PGI Compilers	5	8018	February 26, 2014
CUDA_ERROR_ILLEGAL_ADDRESS with OpenMP "distribute parallel for" nvc, nvc++ and nvfortran	2	223	May 15, 2024
Different GPU memory usage between OpenACC and OpenMP Offload nvc, nvc++ and nvfortran	10	826	April 28, 2023
num_threads(X) is isgnored in OpenMP target pragma for a X>128 nvc, nvc++ and nvfortran	1	643	March 6, 2023
Question about NVVP results: GPU's SMs and cores during CUDA kernel execution CUDA Programming and Performance	1	1174	April 13, 2019
OpenMP + OpenACC problem Legacy PGI Compilers	9	5260	April 17, 2019
Converting OpenMP from multicore to GPU question nvc, nvc++ and nvfortran	8	733	September 7, 2021

Question about compiler optimization in openmp offloading

Related topics