Nvfortran implicit reduction is faster than explicit reduction?

I am learning OpenACC for Fortran and started with these slides as I wanted to start with an elliptic PDE solver. Slide 39 titled “Laplace Solver (OpenACC in Fortran , version 3)” has an incomplete Fortran code which I’ve completed below:

program laplace_gpu

implicit none

integer, parameter :: rows = 10000, columns = 10000, max_iterations = 10000
real, parameter    :: max_temp_error = 1.0e-6

integer :: i, j, iteration
real    :: start_time, end_time, dt, A_new(rows + 2, columns + 2), A(rows + 2, columns + 2)

do i = 2, rows + 1
    A(i, 1)           = 1.0
    A(i, columns + 2) = 2.0
end do

do j = 2, columns + 1
    A(1, j)        = 3.0
    A(rows + 2, j) = 4.0
end do

dt        = 10.0*max_temp_error
iteration = 0

call cpu_time(start_time)

!$acc data copy(A), create(A_new)
do while ( dt > max_temp_error .and. iteration <= max_iterations)
    !$acc parallel loop
    do j=2,columns+1
        do i=2,rows+1
            A_new(i,j)=0.25*(A(i+1,j)+A(i-1,j)+ A(i,j+1)+A(i,j-1) )
        enddo
    enddo
    !$acc end parallel loop
    dt=0.0
    !$acc parallel loop
    do j=2,columns+1
        do i=2,rows+1
            dt = max( abs(A_new(i,j) - A(i,j)), dt )
            A(i,j) = A_new(i,j)
        enddo
    enddo
    !$acc end parallel loop
    iteration = iteration+1
enddo
!$acc end data

call cpu_time(end_time)

print *, "Final iteration number:", iteration - 1
print *, end_time - start_time

end program laplace_gpu

I can compile this with nvfortran -acc -fast -Minfo=all -Mneginfo=all laplace_gpu.f90. The run time printed at the end is 163.9991 seconds on my older Nvidia GPU. The info printout says Generating implicit reduction(max:dt) so I suppose nvfortran is smart enough to figure out what is needed.

Note that in slide 39, the presenter has a side note: “Explicitly specify the reduction operator and variable.” However, the associated parallel region has no explicitly specified reduction, which I guess was an oversight. If I modify the second !$acc parallel loop line to be !$acc parallel loop reduction(max:dt), recompile, and run, I find that the code takes 243.6209 seconds. The only difference in the info printout is that it now says Generating reduction(max:dt) (no “implicit” word), which makes sense as the reduction is now explicit.

Why is my explicit reduction slower than my implicit reduction? I think I must be doing something wrong.

Side note: I believe this issue will be present in version 25.9, but I’m using 25.7. I can’t use nvfortran 25.9 as I am using an older Pascal GPU and the release notes for 25.9 says “Maxwell, Pascal, and Volta GPUs are no longer supported starting with CUDA 13.0.”. I just get a somewhat vague compiler error saying error: -arch=compute_61 is an unsupported option if I try 25.9.

Hi btrettel2,

Can you show how you added the reduction clauses?

When I try with 25.7, my times go from 16 seconds to 15 seconds on a H100. So a slight improvement.

    !$acc parallel loop copy(dt) reduction(max:dt)
    do j=2,columns+1
        !$acc loop reduction(max:dt)
        do i=2,rows+1
            dt = max( abs(A_new(i,j) - A(i,j)), dt )
            A(i,j) = A_new(i,j)
        enddo
    enddo
    !$acc end parallel loop

Note that if you download the 25.9 multi-cuda package, you’ll get both CUDA 13.0 and 12.9. So if you use CUDA 12.9, you should be able to target CC61 by adding “-gpu=cuda12.9”.

Also, if you have an older CUDA SDK install, you can set the environment variable “NVHPC_CUDA_HOME” to the full path to this installation and the NVHPC compilers will use it instead.

It’s really CUDA 13.0 where the support for these older devices was dropped. The compilers do follow suit in that we also have dropped support, but this really means that we don’t test these devices any longer, nor if there is a problem with targeting them that we’d fix the issue, but we haven’t removed anything so I’d expect it to still work with older CUDA versions.

-Mat

1 Like

@MatColgrove, I appreciate your detailed reply.

I installed the multi-cuda package and got it working, so I’m now using nvfortran 25.9.

At the bottom of this comment I’ve put 3 different code listings. I’ve run nvfortran 25.9 on all of these and put the timings below:

  • laplace_gpu_1 (implicit reduction): 162.5680 s
  • laplace_gpu_2 (how I added the reduction before): 238.5115 s
  • laplace_gpu_3 (how you added the reduction): 238.5305 s

So your approach is still slower than approach 1 on my old GPU.

Can you try CC61 to see if that causes the slowdown I’m seeing? I assume that Nvidia’s newer GPUs are backwards compatible. If the problem is due to the older CC, that’s fine, this is mostly for learning at this point. If the older CC doesn’t reproduce this problem, then I’ll just not worry about it.

Thanks again.


laplace_gpu_1:

program laplace_gpu_1

implicit none

integer, parameter :: rows = 10000, columns = 10000, max_iterations = 10000
real, parameter    :: max_temp_error = 1.0e-6

integer :: i, j, iteration
real    :: start_time, end_time, dt, A_new(rows + 2, columns + 2), A(rows + 2, columns + 2)

do i = 2, rows + 1
    A(i, 1)           = 1.0
    A(i, columns + 2) = 2.0
end do

do j = 2, columns + 1
    A(1, j)        = 3.0
    A(rows + 2, j) = 4.0
end do

dt        = 10.0*max_temp_error
iteration = 0

call cpu_time(start_time)

!$acc data copy(A), create(A_new)
do while ( dt > max_temp_error .and. iteration <= max_iterations)
    !$acc parallel loop
    do j=2,columns+1
        do i=2,rows+1
            A_new(i,j)=0.25*(A(i+1,j)+A(i-1,j)+ A(i,j+1)+A(i,j-1) )
        enddo
    enddo
    !$acc end parallel loop
    dt=0.0
    !$acc parallel loop
    do j=2,columns+1
        do i=2,rows+1
            dt = max( abs(A_new(i,j) - A(i,j)), dt )
            A(i,j) = A_new(i,j)
        enddo
    enddo
    !$acc end parallel loop
    iteration = iteration+1
enddo
!$acc end data

call cpu_time(end_time)

print *, "laplace_gpu_1"
print *, "Final iteration number:", iteration - 1
print *, end_time - start_time

end program laplace_gpu_1

laplace_gpu_2:

program laplace_gpu_2

implicit none

integer, parameter :: rows = 10000, columns = 10000, max_iterations = 10000
real, parameter    :: max_temp_error = 1.0e-6

integer :: i, j, iteration
real    :: start_time, end_time, dt, A_new(rows + 2, columns + 2), A(rows + 2, columns + 2)

do i = 2, rows + 1
    A(i, 1)           = 1.0
    A(i, columns + 2) = 2.0
end do

do j = 2, columns + 1
    A(1, j)        = 3.0
    A(rows + 2, j) = 4.0
end do

dt        = 10.0*max_temp_error
iteration = 0

call cpu_time(start_time)

!$acc data copy(A), create(A_new)
do while ( dt > max_temp_error .and. iteration <= max_iterations)
    !$acc parallel loop
    do j=2,columns+1
        do i=2,rows+1
            A_new(i,j)=0.25*(A(i+1,j)+A(i-1,j)+ A(i,j+1)+A(i,j-1) )
        enddo
    enddo
    !$acc end parallel loop
    dt=0.0
    !$acc parallel loop reduction(max:dt)
    do j=2,columns+1
        do i=2,rows+1
            dt = max( abs(A_new(i,j) - A(i,j)), dt )
            A(i,j) = A_new(i,j)
        enddo
    enddo
    !$acc end parallel loop
    iteration = iteration+1
enddo
!$acc end data

call cpu_time(end_time)

print *, "laplace_gpu_2"
print *, "Final iteration number:", iteration - 1
print *, end_time - start_time

end program laplace_gpu_2

laplace_gpu_3:

program laplace_gpu_3

implicit none

integer, parameter :: rows = 10000, columns = 10000, max_iterations = 10000
real, parameter    :: max_temp_error = 1.0e-6

integer :: i, j, iteration
real    :: start_time, end_time, dt, A_new(rows + 2, columns + 2), A(rows + 2, columns + 2)

do i = 2, rows + 1
    A(i, 1)           = 1.0
    A(i, columns + 2) = 2.0
end do

do j = 2, columns + 1
    A(1, j)        = 3.0
    A(rows + 2, j) = 4.0
end do

dt        = 10.0*max_temp_error
iteration = 0

call cpu_time(start_time)

!$acc data copy(A), create(A_new)
do while ( dt > max_temp_error .and. iteration <= max_iterations)
    !$acc parallel loop
    do j=2,columns+1
        do i=2,rows+1
            A_new(i,j)=0.25*(A(i+1,j)+A(i-1,j)+ A(i,j+1)+A(i,j-1) )
        enddo
    enddo
    !$acc end parallel loop
    dt=0.0
    !$acc parallel loop copy(dt) reduction(max:dt)
    do j=2,columns+1
        !$acc loop reduction(max:dt)
        do i=2,rows+1
            dt = max( abs(A_new(i,j) - A(i,j)), dt )
            A(i,j) = A_new(i,j)
        enddo
    enddo
    !$acc end parallel loop
    iteration = iteration+1
enddo
!$acc end data

call cpu_time(end_time)

print *, "laplace_gpu_3"
print *, "Final iteration number:", iteration - 1
print *, end_time - start_time

end program laplace_gpu_3

Our lab still had a Titan X, so was able to run on CC61. Here’s my times:

% ./lap1.out
 laplace_gpu_1
 Final iteration number:        10000
    99.03124
% ./lap2.out
 laplace_gpu_2
 Final iteration number:        10000
    104.3110
% ./lap3.out
 laplace_gpu_3
 Final iteration number:        10000
    104.2141

So a slight slow-down, but nothing like what you’re seeing.

Also given the 70 second difference between your and my laplace1 example, my guess something else is going on, like extra data movement or something is off with your system.

If you haven’t used it before, this might be a good time to try using Nsight-Systems (nsys) to profile your code. This might help understanding if the difference is really due to the kernels, data movement, or something else.

Here’s my output from “nsys profile --stats=true ./lap1.out”

Laplace1:

 Time (%)  Total Time (ns)  Instances   Avg (ns)     Med (ns)    Min (ns)   Max (ns)   StdDev (ns)            Name
 --------  ---------------  ---------  -----------  -----------  ---------  ---------  -----------  -------------------------
     54.6   53,725,219,034     10,001  5,371,984.7  5,377,453.0  4,846,454  5,631,641     82,102.3  laplace_gpu_1_28_gpu
     45.3   44,606,507,711     10,001  4,460,204.8  4,458,754.0  4,345,696  4,595,696     34,767.6  laplace_gpu_1_36_gpu
      0.1      118,561,544     10,001     11,855.0     11,841.0     11,329     14,305        186.8  laplace_gpu_1_36_gpu__red

[7/8] Executing 'cuda_gpu_mem_time_sum' stats report

 Time (%)  Total Time (ns)  Count     Avg (ns)      Med (ns)     Min (ns)    Max (ns)    StdDev (ns)           Operation
 --------  ---------------  ------  ------------  ------------  ----------  -----------  -----------  ----------------------------
     67.0      197,830,284  10,002      19,779.1         736.0         704  190,361,102  1,903,413.2  [CUDA memcpy Device-to-Host]
     30.7       90,737,123       1  90,737,123.0  90,737,123.0  90,737,123   90,737,123          0.0  [CUDA memcpy Host-to-Device]
      2.3        6,690,084  10,001         668.9         640.0         608        1,248         80.3  [CUDA memset]

Laplace3:

 Time (%)  Total Time (ns)  Instances   Avg (ns)     Med (ns)    Min (ns)   Max (ns)   StdDev (ns)            Name
 --------  ---------------  ---------  -----------  -----------  ---------  ---------  -----------  -------------------------
     51.8   53,750,483,592     10,001  5,374,510.9  5,379,874.0  5,023,730  5,647,692     82,039.3  laplace_gpu_3_28_gpu
     48.1   49,990,643,443     10,001  4,998,564.5  4,996,307.0  4,839,245  5,194,204     45,244.9  laplace_gpu_3_36_gpu
      0.1      118,323,460     10,001     11,831.2     11,840.0     11,296     14,241        175.8  laplace_gpu_3_36_gpu__red

[7/8] Executing 'cuda_gpu_mem_time_sum' stats report

 Time (%)  Total Time (ns)  Count     Avg (ns)       Med (ns)      Min (ns)     Max (ns)    StdDev (ns)           Operation
 --------  ---------------  ------  -------------  -------------  -----------  -----------  -----------  ----------------------------
     64.1      197,284,687  10,002       19,724.5          736.0          704  189,812,916  1,897,931.9  [CUDA memcpy Device-to-Host]
     33.7      103,669,671       1  103,669,671.0  103,669,671.0  103,669,671  103,669,671          0.0  [CUDA memcpy Host-to-Device]
      2.2        6,711,952  10,001          671.1          640.0          608        1,280         86.6  [CUDA memset]
1 Like

Thanks for this info! I thought my TitanXP became a cool paperweight with the update - but now I can use it again!