Target region with 25% of GPU occupancy

Dear Nvidia user,

I’m working on a target regions that used just 25% of GPUs (A100). I already did optimizations using TEAMS LOOP and COLLAPSE directives. But I would to know if is it possible to do better:

!$OMP  TARGET DATA MAP(ALLOC:dudr,duds,dudt)
!$OMP&             MAP(TO:g1m1,g2m1,g3m1,g4m1,g5m1,g6m1)
!$OMP&             MAP(TO:dxm1,dxtm1,u,helm1,helm2,bm1)
!$OMP&             MAP(TOFROM:au)
!$OMP TARGET TEAMS LOOP COLLAPSE(4) PRIVATE(tmph1,tmpu1,tmpu2,tmpu3,l)
      do e=1,nelt    ! line 610
          do k=1,lz1
          do j=1,ly1
          do i=1,lx1
             tmph1 = helm1(i,j,k,e)
             tmpu1 = 0.0
             tmpu2 = 0.0
             tmpu3 = 0.0
             do l=1,lx1       ! line 618
                tmpu1 = tmpu1 + dxm1(i,l)*u(l,j,k,e)
                tmpu2 = tmpu2 + dxm1(j,l)*u(i,l,k,e)
                tmpu3 = tmpu3 + dxm1(k,l)*u(i,j,l,e)
             enddo
             wr = g1m1(i,j,k,e)*tmpu1
     $          + g4m1(i,j,k,e)*tmpu2
     $          + g5m1(i,j,k,e)*tmpu3

             ws = g2m1(i,j,k,e)*tmpu2
     $          + g4m1(i,j,k,e)*tmpu1
     $          + g6m1(i,j,k,e)*tmpu3

             wt = g3m1(i,j,k,e)*tmpu3
     $          + g5m1(i,j,k,e)*tmpu1
     $          + g6m1(i,j,k,e)*tmpu2

             dudr(i,j,k,e) = wr * tmph1
             duds(i,j,k,e) = ws * tmph1
             dudt(i,j,k,e) = wt * tmph1
          enddo
          enddo
          enddo
       enddo


!$OMP TARGET TEAMS LOOP COLLAPSE(4) PRIVATE(tmph1,tmpu1,tmpu2,tmpu3,l)
       do e=1,nelt     !line 646
          do k=1,lz1
          do j=1,ly1
          do i=1,lx1
             tmpu1 = 0.0
             tmpu2 = 0.0
             tmpu3 = 0.0
             do l=1,lx1
                tmpu1 = tmpu1 + dxtm1(i,l)*dudr(l,j,k,e)
                tmpu2 = tmpu2 + dxtm1(j,l)*duds(i,l,k,e)
                tmpu3 = tmpu3 + dxtm1(k,l)*dudt(i,j,l,e)
             enddo
             au(i,j,k,e) = tmpu1 + tmpu2 + tmpu3
          enddo
          enddo
          enddo
       enddo

!$OMP END TARGET DATA

The main problem are temporary values in the inner loops, like tmpu1, tmpu2, that inhibit the use of 5 levels of collapsing. Is there a way to optimize it better? This is the compiler output:

609, !$omp target teams loop
        609, Generating "nvkernel_axhelm_omp__F1L609_1" GPU kernel
             Generating Tesla code
          610, Loop parallelized across teams, threads(128) collapse(4) ! blockidx%x threadidx%x
          611,   ! blockidx%x threadidx%x collapsed
          612,   ! blockidx%x threadidx%x collapsed
          613,   ! blockidx%x threadidx%x collapsed
          618, Loop run sequentially
        609, Generating Multicore code
          610, Loop parallelized across threads
    609, Generating implicit map(tofrom:g6m1(:,:,:,:),dudt(:,:,:,:),u(:,:,:,:),helm1(:,:,:,:),dxm1(:,:),g3m1(:,:,:,:),g4m1(:,:,:,:),g1m1(:,:,:,:),g5m1(:,:,:,:),g2m1(:,:,:,:),dudr(:,:,:,:),duds(:,:,:,:))
    610, Loop not vectorized/parallelized: too deeply nested
    611, Loop not vectorized/parallelized: too deeply nested
    613, Loop distributed: 3 new loops
         Loop interchange produces reordered loop nest: 618,613
         Loop unrolled 8 times (completely unrolled)
         Generated vector simd code for the loop
         FMA (fused multiply-add) instruction(s) generated
    618, Loop is parallelizable
645, !$omp target teams loop
        645, Generating "nvkernel_axhelm_omp__F1L645_2" GPU kernel
             Generating Tesla code
          646, Loop parallelized across teams, threads(128) collapse(4) ! blockidx%x threadidx%x
          647,   ! blockidx%x threadidx%x collapsed
          648,   ! blockidx%x threadidx%x collapsed
          649,   ! blockidx%x threadidx%x collapsed
          653, Loop run sequentially
        645, Generating Multicore code
          646, Loop parallelized across threads
    645, Generating implicit map(tofrom:dxtm1(:,:),dudt(:,:,:,:),au(:,:,:,:),dudr(:,:,:,:),duds(:,:,:,:))
    646, Loop not vectorized/parallelized: too deeply nested
    647, Loop not vectorized/parallelized: too deeply nested
    649, Loop distributed: 3 new loops
         Loop interchange produces reordered loop nest: 653,649
         Loop unrolled 8 times (completely unrolled)
         Generated vector simd code for the loop
         FMA (fused multiply-add) instruction(s) generated
    653, Loop is parallelizable

And values of loop counters:


 nelt:          9120
  lz1:             8
  ly1:             8
  lx1:             8

Thanks.

Is this theoretical or actual occupancy? (As measured by Nsight-Compute)

Is the low occupancy due to register usage, warp stalls, or something else?

The main problem are temporary values in the inner loops, like tmpu1, tmpu2, that inhibit the use of 5 levels of collapsing.

The inner most loop is not nested so can’t be collapsed. If “lx1” was bigger (like 64 or 128), it may be beneficial to use “LOOP BIND(PARALLEL)” on it, but with a loop trip count of 8, that would make for a very small thread block.

No idea if this will help, but I would try splitting this this into three loops to help with caching.

             do l=1,lx1
                tmpu1 = tmpu1 + dxtm1(i,l)*dudr(l,j,k,e)
             enddo
             do l=1,lx1
                tmpu2 = tmpu2 + dxtm1(j,l)*duds(i,l,k,e)
             enddo
             do l=1,lx1
                tmpu3 = tmpu3 + dxtm1(k,l)*dudt(i,j,l,e)
             enddo

Schedule-wise, I’d also try something like:

!$OMP TARGET TEAMS LOOP BIND(TEAMS)
       do e=1,nelt     !line 646
!$OMP LOOP COLLAPSE(3) BIND(PARALLEL) PRIVATE(tmph1,tmpu1,tmpu2,tmpu3,l)
          do k=1,lz1
          do j=1,ly1
          do i=1,lx1

This will have the outer loop map to the blocks and the inner loops map to the threads in a block. This may give you better data access and less warp stalled waiting for memory (assuming that’s the cause of the low occupancy)

Hi Mat

Is this theoretical or actual occupancy? (As measured by Nsight-Compute)

The actual occupancy is very low: 23% from Nsight-Compute

Is the low occupancy due to register usage, warp stalls, or something else?

Due to register usage. Attached the Nsight- Compute screenshot (1-2). But I don’t know using OMP offload how to reduce them. There is also very high memory usage.

This will have the outer loop map to the blocks and the inner loops map to the threads in a block. This may give you better data access and less warp stalled waiting for memory (assuming that’s the cause of the low occupancy)

I tried your optimizations with no performance improvement. Attached Nsight- Compute screenshot with optimizations.





Looks like the code is memory bound so increasing the occupancy may or may not help much. The good news is that the code is almost at 90% SOL on memory utilization. Basically your code is mostly streaming memory with little opportunity for re-use.

Though you may look at splitting the first kernel into 3, one each for computing dudr, duds, and dudt. This should reduce the register usage but mean computing tmpu’s each time. May not be beneficial, but worth an experiment.

-Mat

Hi Mat,

splitting the kernel into 3 smaller kernels, appears to get performance improvement. Total kernel time drops from 295 useconds to 148 useconds using Nsight Compute. Register usage drops from 123 to 116 and SM utilization is better of 61% See attachment number 1, where the pink bar is kernel before optimizations and blue bar with optimizations.

But, what I don’t understand is the profilation using nsys.

This is the simulation with no optimization about just axhelmp_omp kernel:

Time(%)  Total Time (ns)  Num Calls   Average    Minimum   Maximum    StdDev               Name
 89,0      70803039700    1710428     41394,0      1480   2749216    77241,0  cuStreamSynchronize
Time(%)  Total Time (ns)  Num Calls   Average    Minimum   Maximum    StdDev               Name
 3,0       1767560701       5824   303496,0   293118   314560   8119,0  nvkernel_axhelm_omp__F1L609_1_             
 1,0        814545577       5824   139860,0   137248   164256   3543,0 nvkernel_axhelm_omp__F1L645_2_   
 1,0        623002468       4588   135789,0   133697   137664    404,0  nvkernel_axhelm_omp__F1L666_3_ 

And the following with optimizations:

Time(%)  Total Time (ns)  Num Calls   Average    Minimum   Maximum    StdDev               Name
 90,0      73628166186    1722076       42755,0          1450      12755155      79225,0  cuStreamSynchronize  
Time(%)  Total Time (ns)  Num Calls   Average    Minimum   Maximum    StdDev               Name
 3,0       1794139997       5824      308059,0        305568        315424        824,0 nvkernel_axhelm_omp__F1L655_3_   
 2,0       1088017794       5824      186816,0        178080        196480       7324,0  nvkernel_axhelm_omp__F1L611_1_  
 1,0        780491825       5824      134013,0        132256        139808        630,0  nvkernel_axhelm_omp__F1L683_4_             

I’m bit confused. From ncu, axhelm_omp run in half time, but from Nsys no performance improvement are detected. What such results means? Thanks.

I’m a bit confused here. By “optimization” do you mean the first set is the original version and the second uses the split kernels? If so, why aren’t there 5 kernels in the second set? If you split these, I’d expect to see two additional kernels.

Also, the baseline in the profile uses the “610” kernel compared to the “683” kernel. Though even assuming the line numbers moved a bit, are you sure you’re comparing the same kernel? Should 683 be compared to 666, 655 to 645, and 611 to 609? In that case, the optimized version is slower.

Hi Mat,

I’m a bit confused here. By “optimization” do you mean the first set is the original version and the second uses the split kernels?

Exactly

If so, why aren’t there 5 kernels in the second set? If you split these, I’d expect to see two additional kernels.

I launched ncu using the following command:

srun ncu -o profile -f --target-processes all --kernel-name-base=function --kernel-regex axhelm_omp --launch-skip 297 --launch-count 1 “./nek5000”

So I suppose that ncu results is the aggregation of all target region inside axhelm_omp Fortran subroutine, right? If not, I don’t understand which kernel my command are profiling.

No, each target region would get mapped to a single kernel which should be displayed separately in the profile.

Do you have a single target region with multiple teams regions? If so, hat would produce a single kernel.

-Mat

Ok but

Is there a way to profile all target regions inside a Fortran subroutine (axhelm_omp) having a separated profiling for each?

And, since in mycase I have 5 target regions in a subroutine:

nvkernel_axhelm_omp__F1L655_3_
nvkernel_axhelm_omp__F1L631_2_
nvkernel_axhelm_omp__F1L611_1_
nvkernel_axhelm_omp__F1L681_4_
nvkernel_axhelm_omp__F1L712_5_

If I understand well, the last number is about how many target regions are into axhelm_omp function? So for example, If I want to profile second target region, I have to refer to nvkernel_axhelm_omp__F1L631_2_, right?

If yes I don’t understand the following kernel profilied from nsys:

nvkernel_add2s2_omp__F1L1958_25_

were add2s2_omp has only one target region inside. What does “25” means in such case?

or

nvkernel_vlsc3_omp__F1L2353_82_

82 target regions into vlsc3_omp ?? It is a very small routine:

!$OMP TARGET TEAMS LOOP REDUCTION(+:dt)
!$OMP& MAP(TOFROM:dt)
do i=1,n
dt = dt+x(i)*y(i)*b(i)
enddo
c t=dt
vlsc3_omp = dt


Actually, I’m not sure but I don’t think it’s the number of kernels in the file. I believe it’s just a product of the demangled C++ name so I just ignore it.