Dear Nvidia developers,
I’m trying to understand the behaviour of compiler in my actual code, in order to make a sort of “ninja optimization”, changing number of teams and threads. I’m using NVHPC 21.9
Starting from the following simple openmp offload function:
subroutine add2s2_omp(a,b,c1,n)
real a(n),b(n)
!$OMP TARGET TEAMS LOOP
do i=1,n
a(i)=a(i)+c1*b(i)
enddo
return
end
The compiler output says:
add2s2_omp:
1960, !$omp target teams loop
1960, Generating “nvkernel_add2s2_omp__F1L1960_23” GPU kernel
Generating Tesla code
1962, Loop parallelized across teams, threads(128) ! blockidx%x threadidx%x
1960, Generating Multicore code
1962, Loop parallelized across threads
1960, Generating implicit map(tofrom:b(:),a(:))
1962, Generated vector simd code for the loop
FMA (fused multiply-add) instruction(s) generated
So, I suppose that the compiler set to use 1 teams and 128 threads. Using such settings, the function takes 4606168346 ns.
Now, Setting by hand the same setup:
!$OMP TARGET TEAMS LOOP NUM_TEAMS(1) THREAD_LIMIT(128)
The function takes 299951422797 ns, 65 times slower! Why? Is there misunderstanding? Thanks
Thi is my compilation string:
mpif90 -O2 -mp=gpu -Mcuda=cc80 -mcmodel=medium -Minfo=all -Mvect=levels:5 -Mpreprocess -r8 -mp=gpu