Target region with 25% of GPU occupancy

Dear Nvidia user,

I’m working on a target regions that used just 25% of GPUs (A100). I already did optimizations using TEAMS LOOP and COLLAPSE directives. But I would to know if is it possible to do better:

``````!\$OMP  TARGET DATA MAP(ALLOC:dudr,duds,dudt)
!\$OMP&             MAP(TO:g1m1,g2m1,g3m1,g4m1,g5m1,g6m1)
!\$OMP&             MAP(TO:dxm1,dxtm1,u,helm1,helm2,bm1)
!\$OMP&             MAP(TOFROM:au)
!\$OMP TARGET TEAMS LOOP COLLAPSE(4) PRIVATE(tmph1,tmpu1,tmpu2,tmpu3,l)
do e=1,nelt    ! line 610
do k=1,lz1
do j=1,ly1
do i=1,lx1
tmph1 = helm1(i,j,k,e)
tmpu1 = 0.0
tmpu2 = 0.0
tmpu3 = 0.0
do l=1,lx1       ! line 618
tmpu1 = tmpu1 + dxm1(i,l)*u(l,j,k,e)
tmpu2 = tmpu2 + dxm1(j,l)*u(i,l,k,e)
tmpu3 = tmpu3 + dxm1(k,l)*u(i,j,l,e)
enddo
wr = g1m1(i,j,k,e)*tmpu1
\$          + g4m1(i,j,k,e)*tmpu2
\$          + g5m1(i,j,k,e)*tmpu3

ws = g2m1(i,j,k,e)*tmpu2
\$          + g4m1(i,j,k,e)*tmpu1
\$          + g6m1(i,j,k,e)*tmpu3

wt = g3m1(i,j,k,e)*tmpu3
\$          + g5m1(i,j,k,e)*tmpu1
\$          + g6m1(i,j,k,e)*tmpu2

dudr(i,j,k,e) = wr * tmph1
duds(i,j,k,e) = ws * tmph1
dudt(i,j,k,e) = wt * tmph1
enddo
enddo
enddo
enddo

!\$OMP TARGET TEAMS LOOP COLLAPSE(4) PRIVATE(tmph1,tmpu1,tmpu2,tmpu3,l)
do e=1,nelt     !line 646
do k=1,lz1
do j=1,ly1
do i=1,lx1
tmpu1 = 0.0
tmpu2 = 0.0
tmpu3 = 0.0
do l=1,lx1
tmpu1 = tmpu1 + dxtm1(i,l)*dudr(l,j,k,e)
tmpu2 = tmpu2 + dxtm1(j,l)*duds(i,l,k,e)
tmpu3 = tmpu3 + dxtm1(k,l)*dudt(i,j,l,e)
enddo
au(i,j,k,e) = tmpu1 + tmpu2 + tmpu3
enddo
enddo
enddo
enddo

!\$OMP END TARGET DATA

``````

The main problem are temporary values in the inner loops, like tmpu1, tmpu2, that inhibit the use of 5 levels of collapsing. Is there a way to optimize it better? This is the compiler output:

``````609, !\$omp target teams loop
609, Generating "nvkernel_axhelm_omp__F1L609_1" GPU kernel
Generating Tesla code
618, Loop run sequentially
609, Generating Multicore code
609, Generating implicit map(tofrom:g6m1(:,:,:,:),dudt(:,:,:,:),u(:,:,:,:),helm1(:,:,:,:),dxm1(:,:),g3m1(:,:,:,:),g4m1(:,:,:,:),g1m1(:,:,:,:),g5m1(:,:,:,:),g2m1(:,:,:,:),dudr(:,:,:,:),duds(:,:,:,:))
610, Loop not vectorized/parallelized: too deeply nested
611, Loop not vectorized/parallelized: too deeply nested
613, Loop distributed: 3 new loops
Loop interchange produces reordered loop nest: 618,613
Loop unrolled 8 times (completely unrolled)
Generated vector simd code for the loop
618, Loop is parallelizable
645, !\$omp target teams loop
645, Generating "nvkernel_axhelm_omp__F1L645_2" GPU kernel
Generating Tesla code
653, Loop run sequentially
645, Generating Multicore code
645, Generating implicit map(tofrom:dxtm1(:,:),dudt(:,:,:,:),au(:,:,:,:),dudr(:,:,:,:),duds(:,:,:,:))
646, Loop not vectorized/parallelized: too deeply nested
647, Loop not vectorized/parallelized: too deeply nested
649, Loop distributed: 3 new loops
Loop interchange produces reordered loop nest: 653,649
Loop unrolled 8 times (completely unrolled)
Generated vector simd code for the loop
653, Loop is parallelizable

``````

And values of loop counters:

``````
nelt:          9120
lz1:             8
ly1:             8
lx1:             8
``````

Thanks.

Is this theoretical or actual occupancy? (As measured by Nsight-Compute)

Is the low occupancy due to register usage, warp stalls, or something else?

The main problem are temporary values in the inner loops, like tmpu1, tmpu2, that inhibit the use of 5 levels of collapsing.

The inner most loop is not nested so can’t be collapsed. If “lx1” was bigger (like 64 or 128), it may be beneficial to use “LOOP BIND(PARALLEL)” on it, but with a loop trip count of 8, that would make for a very small thread block.

No idea if this will help, but I would try splitting this this into three loops to help with caching.

``````             do l=1,lx1
tmpu1 = tmpu1 + dxtm1(i,l)*dudr(l,j,k,e)
enddo
do l=1,lx1
tmpu2 = tmpu2 + dxtm1(j,l)*duds(i,l,k,e)
enddo
do l=1,lx1
tmpu3 = tmpu3 + dxtm1(k,l)*dudt(i,j,l,e)
enddo
``````

Schedule-wise, I’d also try something like:

``````!\$OMP TARGET TEAMS LOOP BIND(TEAMS)
do e=1,nelt     !line 646
!\$OMP LOOP COLLAPSE(3) BIND(PARALLEL) PRIVATE(tmph1,tmpu1,tmpu2,tmpu3,l)
do k=1,lz1
do j=1,ly1
do i=1,lx1
``````

This will have the outer loop map to the blocks and the inner loops map to the threads in a block. This may give you better data access and less warp stalled waiting for memory (assuming that’s the cause of the low occupancy)

Hi Mat

Is this theoretical or actual occupancy? (As measured by Nsight-Compute)

The actual occupancy is very low: 23% from Nsight-Compute

Is the low occupancy due to register usage, warp stalls, or something else?

Due to register usage. Attached the Nsight- Compute screenshot (1-2). But I don’t know using OMP offload how to reduce them. There is also very high memory usage.

This will have the outer loop map to the blocks and the inner loops map to the threads in a block. This may give you better data access and less warp stalled waiting for memory (assuming that’s the cause of the low occupancy)

I tried your optimizations with no performance improvement. Attached Nsight- Compute screenshot with optimizations.

Looks like the code is memory bound so increasing the occupancy may or may not help much. The good news is that the code is almost at 90% SOL on memory utilization. Basically your code is mostly streaming memory with little opportunity for re-use.

Though you may look at splitting the first kernel into 3, one each for computing dudr, duds, and dudt. This should reduce the register usage but mean computing tmpu’s each time. May not be beneficial, but worth an experiment.

-Mat

Hi Mat,

splitting the kernel into 3 smaller kernels, appears to get performance improvement. Total kernel time drops from 295 useconds to 148 useconds using Nsight Compute. Register usage drops from 123 to 116 and SM utilization is better of 61% See attachment number 1, where the pink bar is kernel before optimizations and blue bar with optimizations.

But, what I don’t understand is the profilation using nsys.

This is the simulation with no optimization about just axhelmp_omp kernel:

``````Time(%)  Total Time (ns)  Num Calls   Average    Minimum   Maximum    StdDev               Name
89,0      70803039700    1710428     41394,0      1480   2749216    77241,0  cuStreamSynchronize
``````
``````Time(%)  Total Time (ns)  Num Calls   Average    Minimum   Maximum    StdDev               Name
3,0       1767560701       5824   303496,0   293118   314560   8119,0  nvkernel_axhelm_omp__F1L609_1_
1,0        814545577       5824   139860,0   137248   164256   3543,0 nvkernel_axhelm_omp__F1L645_2_
1,0        623002468       4588   135789,0   133697   137664    404,0  nvkernel_axhelm_omp__F1L666_3_
``````

And the following with optimizations:

``````Time(%)  Total Time (ns)  Num Calls   Average    Minimum   Maximum    StdDev               Name
90,0      73628166186    1722076       42755,0          1450      12755155      79225,0  cuStreamSynchronize
``````
``````Time(%)  Total Time (ns)  Num Calls   Average    Minimum   Maximum    StdDev               Name
3,0       1794139997       5824      308059,0        305568        315424        824,0 nvkernel_axhelm_omp__F1L655_3_
2,0       1088017794       5824      186816,0        178080        196480       7324,0  nvkernel_axhelm_omp__F1L611_1_
1,0        780491825       5824      134013,0        132256        139808        630,0  nvkernel_axhelm_omp__F1L683_4_
``````

I’m bit confused. From ncu, axhelm_omp run in half time, but from Nsys no performance improvement are detected. What such results means? Thanks.

I’m a bit confused here. By “optimization” do you mean the first set is the original version and the second uses the split kernels? If so, why aren’t there 5 kernels in the second set? If you split these, I’d expect to see two additional kernels.

Also, the baseline in the profile uses the “610” kernel compared to the “683” kernel. Though even assuming the line numbers moved a bit, are you sure you’re comparing the same kernel? Should 683 be compared to 666, 655 to 645, and 611 to 609? In that case, the optimized version is slower.

Hi Mat,

I’m a bit confused here. By “optimization” do you mean the first set is the original version and the second uses the split kernels?

Exactly

If so, why aren’t there 5 kernels in the second set? If you split these, I’d expect to see two additional kernels.

I launched ncu using the following command:

srun ncu -o profile -f --target-processes all --kernel-name-base=function --kernel-regex axhelm_omp --launch-skip 297 --launch-count 1 “./nek5000”

So I suppose that ncu results is the aggregation of all target region inside axhelm_omp Fortran subroutine, right? If not, I don’t understand which kernel my command are profiling.

No, each target region would get mapped to a single kernel which should be displayed separately in the profile.

Do you have a single target region with multiple teams regions? If so, hat would produce a single kernel.

-Mat

Ok but

Is there a way to profile all target regions inside a Fortran subroutine (axhelm_omp) having a separated profiling for each?

And, since in mycase I have 5 target regions in a subroutine:

nvkernel_axhelm_omp__F1L655_3_
nvkernel_axhelm_omp__F1L631_2_
nvkernel_axhelm_omp__F1L611_1_
nvkernel_axhelm_omp__F1L681_4_
nvkernel_axhelm_omp__F1L712_5_

If I understand well, the last number is about how many target regions are into axhelm_omp function? So for example, If I want to profile second target region, I have to refer to nvkernel_axhelm_omp__F1L631_2_, right?

If yes I don’t understand the following kernel profilied from nsys:

were add2s2_omp has only one target region inside. What does “25” means in such case?

or

nvkernel_vlsc3_omp__F1L2353_82_

82 target regions into vlsc3_omp ?? It is a very small routine:

!\$OMP TARGET TEAMS LOOP REDUCTION(+:dt)
!\$OMP& MAP(TOFROM:dt)
do i=1,n
dt = dt+x(i)*y(i)*b(i)
enddo
c t=dt
vlsc3_omp = dt

Actually, I’m not sure but I don’t think it’s the number of kernels in the file. I believe it’s just a product of the demangled C++ name so I just ignore it.