Under Nvfortran 25.3 -stdpar=gpu -acc=gpu -gpu=mem:separate -O3 is still slow

chenbr · July 22, 2025, 3:15am

As discussed here:

-O3 should have full parallelism at collapse(3) level when using -stdpar=gpu and do concurrent.

But after adding -acc=gpu -gpu=mem:separate, it becomes slow again.
Option -Minfo=accel shows the similar information as -stdpar without -O3:

    327, Loop run sequentially 
         Loop parallelized across CUDA threads(128) ! threadidx%x
         Loop parallelized across CUDA thread blocks ! blockidx%x

So how to make it fast under nomanaged memory ( -gpu=mem:separate or -gpu=nomanaged)?
Thanks!

MatColgrove · July 22, 2025, 3:04pm

Likely you need to add OpenACC data management directives otherwise the compiler must implicitly copy the data for you. It does this at each DC loop which can lead excessive data movement.

Though you should use Nsight-Systems to profile the code to determine the exact performance bottleneck.

chenbr · July 23, 2025, 8:41am

I added OpenACC data management directives, and result data are correct. The problem is that -O3 fails to speed up for a “do concurrent (k=1:nt,j=1:ny,i=1:nx)” structure. In this case, there is only 1 heavily loaded DC loop shown at bottom.

Under unified memory mode, we guess NC interface does not check if data was transferred to host (It may not be true). So we tried nomanaged mode and it’s slow.

Besides, under unified memory with OpenACC directive (-stdpar=gpu -acc=gpu), just adding 1 data transfer directive line “!$ACC UPDATE HOST (minM, EDH, edh_insty)” would lead to slow down. It’s at line 379 below.

      do concurrent (k=1:nt,j=1:ny,i=1:nx)
359           d2m0  = d2m(i,j,k) - 273.15
360           t2m0  = t2m(i,j,k) - 273.15
361           sst0  = sst(i,j,k) - 273.15
362           msl0  = msl(i,j,k)/100
363           vel0  = vel(i,j,k)
367           rh=10.**(7.5* (d2m0/(237.3+d2m0)-t2m0/(237.3+t2m0))+2.)
368           z_r = 2.
373           call GetE(sst(i,j,k),t2m(i,j,k),msl0,rh,z_r,
374      * vel0, 0.00001, ME, EDH0, edh_insty0)
375           minM(i,j,k)=ME
376           EDH(i,j,k)=EDH0
377           edh_insty(i,j,k)=edh_insty0
378       enddo
379 !$ACC UPDATE HOST (minM, EDH, edh_insty)
380

In conclusion, GPU computational time under nvfortran 25.3 are:
16s : -stdpar=gpu
2.3s : -stdpar=gpu -O3
16s : -stdpar=gpu -O3 -acc=gpu -gpu=mem:separate
16s : -stdpar=gpu -O3 -acc=gpu
2.3s : -O3 -acc=gpu (rewrite DC loop into openacc form)

Thanks!

MatColgrove · July 23, 2025, 2:28pm

It doesn’t check if the data is transferred, only if it’s present on the device or not. With Unified, the data is always present. Also, it’s the CUDA driver that handles the data movement and only does the transfer when the page is “dirty” (i.e. modified).

Ideally, you wan to minimize data movement by offloading as much compute as possible, and avoid touching the data on the host until necessary. Adding an update after the offload region will slow down the code. Though if you do need the data on the host at this point, then you’ll need to accept the cost.

16s : -stdpar=gpu
2.3s : -stdpar=gpu -O3

Best guess here is that at -O3, the back-end device code generator is able to inline “GetE”. Device side calls can be costly as they use about150 registers which lowers the occupancy of the kernel.

chenbr · July 24, 2025, 1:40am

Thanks a lot, MAT!

It seem to me that it’s a compiler problem.

I guess -O3 enables -stdpar=gpu treating “do concurrent (k=1:nt,j=1:ny,i=1:nx)” as

!$ACC PARALLEL LOOP COLLAPSE(3) GANG VECTOR 
      do k=1,nt
      do j=1,ny
      do i=1,nx

, but if -acc=gpu appears, -O3 works for -acc=gpu instead of -stdpar=gpu, which makes -stdpar=gpu returns to its default low-speed behavior under nvfortran 25.3.

-Minfo=accel shows information for this DC loop:

-stdpar=gpu -O3 (nvfortran 25.3):

205, Generating NVIDIA GPU code
    205,   ! blockidx%x threadidx%x auto-collapsed
         Loop parallelized across CUDA thread blocks, CUDA threads(128) collapse(3) ! blockidx%x threadidx%x
    216, Loop run sequentially 
         Generating implicit reduction(min:..inline)

-stdpar=gpu (nvfortran 25.3):

205, Generating NVIDIA GPU code
    205, Loop run sequentially 
         Loop parallelized across CUDA threads(128) ! threadidx%x
         Loop parallelized across CUDA thread blocks ! blockidx%x

-stdpar=gpu (nvfortran 24.7)::

205, Generating NVIDIA GPU code
    205,   ! blockidx%x threadidx%x auto-collapsed
         Loop parallelized across CUDA thread blocks, CUDA threads(128) collapse(3) ! blockidx%x threadidx%x

-stdpar=gpu -O3 -acc=gpu -gpu=mem:separate (nvfortran 25.3) or
-stdpar=gpu -O3 -acc=gpu (nvfortran 25.3):

    205, Loop run sequentially 
         Loop parallelized across CUDA threads(128) ! threadidx%x
         Loop parallelized across CUDA thread blocks ! blockidx%x

It is apparent that -gpu=acc -O3 together under 25.3 disabled collapse(3). That might be an issue to be fixed. Or is there any option like “-O3=stdpar” which is more clear?

MatColgrove · July 24, 2025, 4:39pm

The compiler is likely determining that there’s a potential dependency so not able to parallelize the loops. Fortran passes variables by reference by default. Given this, if a global (typically a module scalar) is pass into the routine, it’s address could potentially be taken creating a dependency. While rare, it can happen so the compiler must assume it. However, when the subroutine gets inlined at -O3, it’s now visible and the compiler as “sees” that there’s no dependency.

If you can provide a reproducing example, I can confirm my guess here. I’m not sure if 25.3 is being to cautious or 24.7 wasn’t being cautious enough.

Note often times this can be fixed by using the “value” attribute on read-only arguments in the subroutine’s declaration section.

chenbr · July 26, 2025, 2:28am

cmin3_compile_log.zip (3.3 KB)

This is an example. In this zip file, “cmin3.f” is the code and “cmin3_compile_log.txt” is the compiling records.

Here I used a single “cmin3.f” to illustrate the different results with different compilation options. In actual use, when options like -acc=gpu are employed, statements such as “!$ACC UPDATE HOST (minM, EDH, edh_insty)” will be added.

Please check it.

Thanks, MAT!

Topic		Replies	Views
Does nvfortran -stdpar=gpu support two GPUs with NVLink? nvc, nvc++ and nvfortran	12	203	April 2, 2025
About -stdpar=gpu -acc=gpu -gpu=nomanaged nvc, nvc++ and nvfortran	4	138	February 21, 2025
Performance Issue / End of Program Dump using Stdpar nvc, nvc++ and nvfortran gpu-computing	3	102	October 10, 2024
Strange -O3 optimization result for nvfortran nvc, nvc++ and nvfortran	2	596	July 22, 2021
RTX 5070 slower than V100 with nvfortran -stdpar=gpu nvc, nvc++ and nvfortran gpu	5	173	July 8, 2025
Using Fortran Standard Parallel Programming for GPU Acceleration Technical Blog	8	796	July 11, 2024
Nvfortran -stdpar triggers OpenACC directives to be evaluated nvc, nvc++ and nvfortran	1	53	November 22, 2024
[Fortran][do concurrent] Questions regarding compile options for managing offloading and performance nvc, nvc++ and nvfortran cuda	3	779	May 2, 2023
Accelerated Fortran stdpar code failing at runtime nvc, nvc++ and nvfortran	9	116	May 19, 2025
OpenACC on GPU and ISO Fortran on multicore nvc, nvc++ and nvfortran	3	560	October 6, 2023

Under Nvfortran 25.3 -stdpar=gpu -acc=gpu -gpu=mem:separate -O3 is still slow

Related topics