Under Nvfortran 25.3 -stdpar=gpu -acc=gpu -gpu=mem:separate -O3 is still slow

As discussed here:

-O3 should have full parallelism at collapse(3) level when using -stdpar=gpu and do concurrent.

But after adding -acc=gpu -gpu=mem:separate, it becomes slow again.
Option -Minfo=accel shows the similar information as -stdpar without -O3:

    327, Loop run sequentially 
         Loop parallelized across CUDA threads(128) ! threadidx%x
         Loop parallelized across CUDA thread blocks ! blockidx%x

So how to make it fast under nomanaged memory ( -gpu=mem:separate or -gpu=nomanaged)?
Thanks!

Likely you need to add OpenACC data management directives otherwise the compiler must implicitly copy the data for you. It does this at each DC loop which can lead excessive data movement.

Though you should use Nsight-Systems to profile the code to determine the exact performance bottleneck.

I added OpenACC data management directives, and result data are correct. The problem is that -O3 fails to speed up for a “do concurrent (k=1:nt,j=1:ny,i=1:nx)” structure. In this case, there is only 1 heavily loaded DC loop shown at bottom.

Under unified memory mode, we guess NC interface does not check if data was transferred to host (It may not be true). So we tried nomanaged mode and it’s slow.

Besides, under unified memory with OpenACC directive (-stdpar=gpu -acc=gpu), just adding 1 data transfer directive line “!$ACC UPDATE HOST (minM, EDH, edh_insty)” would lead to slow down. It’s at line 379 below.

      do concurrent (k=1:nt,j=1:ny,i=1:nx)
359           d2m0  = d2m(i,j,k) - 273.15
360           t2m0  = t2m(i,j,k) - 273.15
361           sst0  = sst(i,j,k) - 273.15
362           msl0  = msl(i,j,k)/100
363           vel0  = vel(i,j,k)
367           rh=10.**(7.5* (d2m0/(237.3+d2m0)-t2m0/(237.3+t2m0))+2.)
368           z_r = 2.
373           call GetE(sst(i,j,k),t2m(i,j,k),msl0,rh,z_r,
374      * vel0, 0.00001, ME, EDH0, edh_insty0)
375           minM(i,j,k)=ME
376           EDH(i,j,k)=EDH0
377           edh_insty(i,j,k)=edh_insty0
378       enddo
379 !$ACC UPDATE HOST (minM, EDH, edh_insty)
380 

In conclusion, GPU computational time under nvfortran 25.3 are:
16s : -stdpar=gpu
2.3s : -stdpar=gpu -O3
16s : -stdpar=gpu -O3 -acc=gpu -gpu=mem:separate
16s : -stdpar=gpu -O3 -acc=gpu
2.3s : -O3 -acc=gpu (rewrite DC loop into openacc form)

Thanks!

It doesn’t check if the data is transferred, only if it’s present on the device or not. With Unified, the data is always present. Also, it’s the CUDA driver that handles the data movement and only does the transfer when the page is “dirty” (i.e. modified).

Ideally, you wan to minimize data movement by offloading as much compute as possible, and avoid touching the data on the host until necessary. Adding an update after the offload region will slow down the code. Though if you do need the data on the host at this point, then you’ll need to accept the cost.

16s : -stdpar=gpu
2.3s : -stdpar=gpu -O3

Best guess here is that at -O3, the back-end device code generator is able to inline “GetE”. Device side calls can be costly as they use about150 registers which lowers the occupancy of the kernel.

Thanks a lot, MAT!

It seem to me that it’s a compiler problem.

I guess -O3 enables -stdpar=gpu treating “do concurrent (k=1:nt,j=1:ny,i=1:nx)” as

!$ACC PARALLEL LOOP COLLAPSE(3) GANG VECTOR 
      do k=1,nt
      do j=1,ny
      do i=1,nx

, but if -acc=gpu appears, -O3 works for -acc=gpu instead of -stdpar=gpu, which makes -stdpar=gpu returns to its default low-speed behavior under nvfortran 25.3.

-Minfo=accel shows information for this DC loop:

-stdpar=gpu -O3 (nvfortran 25.3):

205, Generating NVIDIA GPU code
    205,   ! blockidx%x threadidx%x auto-collapsed
         Loop parallelized across CUDA thread blocks, CUDA threads(128) collapse(3) ! blockidx%x threadidx%x
    216, Loop run sequentially 
         Generating implicit reduction(min:..inline)

-stdpar=gpu (nvfortran 25.3):

205, Generating NVIDIA GPU code
    205, Loop run sequentially 
         Loop parallelized across CUDA threads(128) ! threadidx%x
         Loop parallelized across CUDA thread blocks ! blockidx%x

-stdpar=gpu (nvfortran 24.7)::

205, Generating NVIDIA GPU code
    205,   ! blockidx%x threadidx%x auto-collapsed
         Loop parallelized across CUDA thread blocks, CUDA threads(128) collapse(3) ! blockidx%x threadidx%x

-stdpar=gpu -O3 -acc=gpu -gpu=mem:separate (nvfortran 25.3) or
-stdpar=gpu -O3 -acc=gpu (nvfortran 25.3):

    205, Loop run sequentially 
         Loop parallelized across CUDA threads(128) ! threadidx%x
         Loop parallelized across CUDA thread blocks ! blockidx%x

It is apparent that -gpu=acc -O3 together under 25.3 disabled collapse(3). That might be an issue to be fixed. Or is there any option like “-O3=stdpar” which is more clear?

The compiler is likely determining that there’s a potential dependency so not able to parallelize the loops. Fortran passes variables by reference by default. Given this, if a global (typically a module scalar) is pass into the routine, it’s address could potentially be taken creating a dependency. While rare, it can happen so the compiler must assume it. However, when the subroutine gets inlined at -O3, it’s now visible and the compiler as “sees” that there’s no dependency.

If you can provide a reproducing example, I can confirm my guess here. I’m not sure if 25.3 is being to cautious or 24.7 wasn’t being cautious enough.

Note often times this can be fixed by using the “value” attribute on read-only arguments in the subroutine’s declaration section.

cmin3_compile_log.zip (3.3 KB)

This is an example. In this zip file, “cmin3.f” is the code and “cmin3_compile_log.txt” is the compiling records.

Here I used a single “cmin3.f” to illustrate the different results with different compilation options. In actual use, when options like -acc=gpu are employed, statements such as “!$ACC UPDATE HOST (minM, EDH, edh_insty)” will be added.

Please check it.

Thanks, MAT!