Array reductions with OpenACC to compute averages of field data

Dear all,

I was wondering if it is advisable to use array reductions to calculate, e.g., a 1D average of a 3D field as below in Fortran:

program reduction_acc
  implicit none
  integer, parameter :: n = 150
  real, allocatable :: p1d(:),p(:,:,:)
  integer :: i,j,k
  allocate(p1d(n))
  allocate(p(n,n,n))
 !$acc enter data create(p1d,p)
 !$acc kernels default(present)
 p1d(:) = 0.
 p(:,:,:) = 1.
 !$acc end kernels
 !$acc parallel loop collapse(3) default(present) reduction(+:p1d)
 do k=1,n
   do j=1,n
     do i=1,n
       p1d(k) = p1d(k) + p(i,j,k)
     end do
   end do
 end do
 !$acc exit data copyout(p1d)
 print*,p1d(10)
end

I think this approach is certainly not good, because I get an Out of memory error for quite small values of n.

My current approach without using array reductions uses a private auxiliary scalar variable and is as follows:

program reduction_acc
  implicit none
  integer, parameter :: n = 150
  real, allocatable :: p1d(:),p(:,:,:)
  real :: p1d_s
  integer :: i,j,k
  allocate(p1d(n))
  allocate(p(n,n,n))
  !$acc enter data create(p1d,p)
  !$acc kernels default(present)
  p1d(:) = 0.
  p(:,:,:) = 1.
  !$acc end kernels
  !$acc parallel loop gang default(present) private(p1d_s)
  do k=1,n
    p1d_s = 0.
    !$acc loop collapse(2)
    do j=1,n
      do i=1,n
        p1d_s = p1d_s + p(i,j,k)
      end do
    end do
    p1d(k) = p1d_s
  end do
  !$acc exit data copyout(p1d)
  print*,p1d(10)
end

Is there a good alternative here that can leverage array reductions so I do not need to define a scalar variable to compute the average?

Thanks in advance!
Pedro

Hi Pedro,

I think this approach is certainly not good, because I get an Out of memory error for quite small values of n.

Not surprising. For reductions, each thread will need it’s own private copy of the array in order to perform the partial reduction so can take up quite a bit of memory. Plus, there will be a high amount of overhead needed to perform the final reduction over the partial reductions. For most cases, array reductions are not the best choice.

Is there a good alternative here that can leverage array reductions so I do not need to define a scalar variable to compute the average?`

Atomics. They do have overhead as well, but aren’t too bad if there aren’t a lot of collisions. Though performance-wise with this example, the inner loop scalar reduction is the best choice.

To see what the performance impact is, I wrote an example we can use to compare 4 different schedules:

Case 1: Array Reduction
Case 2: Atomic update
Case 3: Atomic update with loop interchange to lessen the number of collisions
Case 4: Inner loop scalar reduction

Then I use Nsight-Systems to see the performance of each kernel:

% cat test.F90
program reduction_acc
  implicit none
  integer, parameter :: n = 150
  real :: p1d_s
  real, allocatable :: p1d(:),p(:,:,:)
  integer :: i,j,k
  allocate(p1d(n))
  allocate(p(n,n,n))
 !$acc enter data create(p1d,p)
 p1d(:) = 0.
 p(:,:,:) = 1.
!$acc update device(p1d,p)

! CASE 1: Array Reduction
 !$acc parallel loop collapse(3) default(present) reduction(+:p1d)
 do k=1,n
   do j=1,n
     do i=1,n
       p1d(k) = p1d(k) + p(i,j,k)
     end do
   end do
 end do
 !$acc update self(p1d)
 print*,p1d(10)


 p1d(:) = 0.
 p(:,:,:) = 1.
!$acc update device(p1d,p)

! CASE 2: Atomic
 !$acc parallel loop collapse(3) default(present)
 do k=1,n
   do j=1,n
     do i=1,n
!$acc atomic update
       p1d(k) = p1d(k) + p(i,j,k)
     end do
   end do
 end do
 !$acc update self(p1d)
 print*,p1d(10)
p1d(:) = 0.
 p(:,:,:) = 1.
!$acc update device(p1d,p)

! CASE 3: Atomic interchange
 !$acc parallel loop collapse(3) default(present)
  do j=1,n
    do i=1,n
      do k=1,n
!$acc atomic update
       p1d(k) = p1d(k) + p(i,j,k)
     end do
   end do
 end do
 !$acc update self(p1d)
 print*,p1d(10)

 p1d(:) = 0.
 p(:,:,:) = 1.
!$acc update device(p1d,p)

! CASE 3: Scalar Reduction
  !$acc parallel loop gang default(present)
  do k=1,n
    p1d_s = 0.
    !$acc loop collapse(2) reduction(+:p1d_s)
    do j=1,n
      do i=1,n
        p1d_s = p1d_s + p(i,j,k)
      end do
    end do
    p1d(k) = p1d_s
  end do
 !$acc update self(p1d)
  print*,p1d(10)

 !$acc exit data delete(p1d,p)
end
% nvfortran -acc test.F90 -Minfo=accel -fast -V22.7
reduction_acc:
      9, Generating enter data create(p(:,:,:),p1d(:))
     12, Generating update device(p1d(:),p(:,:,:))
     15, Generating copy(p1d(:)) [if not already present]
         Generating NVIDIA GPU code
         16, !$acc loop gang, vector(128) collapse(3) ! blockidx%x threadidx%x
             Generating reduction(+:p1d(:))
         17,   ! blockidx%x threadidx%x collapsed
         18,   ! blockidx%x threadidx%x collapsed
     15, Generating default present(p(1:150,1:150,1:150))
     23, Generating update self(p1d(:))
     29, Generating update device(p(:,:,:),p1d(:))
     32, Generating NVIDIA GPU code
         33, !$acc loop gang, vector(128) collapse(3) ! blockidx%x threadidx%x
         34,   ! blockidx%x threadidx%x collapsed
         35,   ! blockidx%x threadidx%x collapsed
     32, Generating default present(p1d(:),p(1:150,1:150,1:150))
     41, Generating update self(p1d(:))
     45, Generating update device(p1d(:),p(:,:,:))
     48, Generating NVIDIA GPU code
         49, !$acc loop gang, vector(128) collapse(3) ! blockidx%x threadidx%x
         50,   ! blockidx%x threadidx%x collapsed
         51,   ! blockidx%x threadidx%x collapsed
     48, Generating default present(p1d(:),p(1:150,1:150,1:150))
     57, Generating update self(p1d(:))
     62, Generating update device(p1d(:),p(:,:,:))
     65, Generating NVIDIA GPU code
         66, !$acc loop gang ! blockidx%x
         69, !$acc loop vector(128) collapse(2) ! threadidx%x
             Generating reduction(+:p1d_s)
         70,   ! threadidx%x collapsed
     65, Generating default present(p1d(1:150),p(1:150,1:150,1:150))
     66, Generating implicit private(p1d_s)
     69, Loop is parallelizable
     70, Loop is parallelizable
     76, Generating update self(p1d(:))
     79, Generating exit data delete(p1d(:),p(:,:,:))
% nsys profile --stats=true a.out
    22500.00
    22500.00
    22500.00
    22500.00
Generating '/tmp/nsys-report-1bef.qdstrm'
[1/8] [========================100%] report9.nsys-rep
[2/8] [========================100%] report9.sqlite
[3/8] Executing 'nvtxsum' stats report
SKIPPED: /local/home/mcolgrove/report9.sqlite does not contain NV Tools Extension (NVTX) data.
[4/8] Executing 'osrtsum' stats report

Operating System Runtime API Statistics:

 Time (%)  Total Time (ns)  Num Calls    Avg (ns)     Med (ns)    Min (ns)   Max (ns)    StdDev (ns)        Name
 --------  ---------------  ---------  ------------  -----------  --------  -----------  ------------  --------------
     81.0      801,665,228         23  34,855,009.9  6,459,118.0    76,151  100,244,723  43,184,942.2  poll
     17.7      175,173,497      1,216     144,057.2     37,752.5     1,226   35,705,356   1,253,597.2  ioctl
      0.4        3,957,508         53      74,670.0      5,003.0     1,193    3,621,343     496,628.7  fopen
      0.4        3,729,234         91      40,980.6     27,832.0     5,348    1,701,513     176,650.2  mmap64
      0.3        2,925,930         14     208,995.0    162,444.0    68,396    1,039,874     243,294.2  sem_timedwait
      0.1          973,346        134       7,263.8      6,501.5     2,136       24,531       3,135.0  open64
      0.1          761,445          4     190,361.3    191,960.5   183,531      193,993       4,668.5  pthread_create
      0.0          235,333          4      58,833.3     53,884.0    48,952       78,613      13,801.2  fgets
      0.0          188,950         20       9,447.5      6,917.0     2,253       35,534       8,357.1  mmap
      0.0           88,335         44       2,007.6      1,951.5     1,059        3,672         604.8  fclose
      0.0           62,374         15       4,158.3      3,865.0     2,203        8,595       1,441.3  write
      0.0           60,426         18       3,357.0      3,757.5     1,210        5,258       1,085.1  read
      0.0           44,018          6       7,336.3      7,580.0     3,851       10,085       2,203.5  open
      0.0           30,873          8       3,859.1      4,133.5     2,067        5,367       1,213.8  munmap
      0.0           29,104          5       5,820.8      5,490.0     4,854        7,948       1,241.5  fwrite
      0.0           12,374         10       1,237.4      1,203.5     1,109        1,480         115.5  fcntl
      0.0           12,246          2       6,123.0      6,123.0     2,783        9,463       4,723.5  fread
      0.0           10,948          2       5,474.0      5,474.0     3,819        7,129       2,340.5  socket
      0.0            7,982          1       7,982.0      7,982.0     7,982        7,982           0.0  connect
      0.0            7,931          1       7,931.0      7,931.0     7,931        7,931           0.0  pipe2
      0.0            2,331          1       2,331.0      2,331.0     2,331        2,331           0.0  bind
      0.0            1,140          1       1,140.0      1,140.0     1,140        1,140           0.0  listen

[5/8] Executing 'cudaapisum' stats report

CUDA API Statistics:

 Time (%)  Total Time (ns)  Num Calls    Avg (ns)      Med (ns)     Min (ns)    Max (ns)     StdDev (ns)           Name
 --------  ---------------  ---------  ------------  ------------  ----------  -----------  -------------  --------------------
     93.8      505,897,413         18  28,105,411.8       4,591.5         659  494,303,345  116,358,306.7  cuStreamSynchronize
      5.2       27,769,758          1  27,769,758.0  27,769,758.0  27,769,758   27,769,758            0.0  cuMemHostAlloc
      0.8        4,151,876          4   1,037,969.0     705,371.5       5,757    2,735,376    1,244,893.8  cuMemAlloc_v2
      0.2          874,402          1     874,402.0     874,402.0     874,402      874,402            0.0  cuMemAllocHost_v2
      0.0          134,166          1     134,166.0     134,166.0     134,166      134,166            0.0  cuModuleLoadDataEx
      0.0           95,150          8      11,893.8      10,890.0       7,107       20,567        4,924.6  cuMemcpyHtoDAsync_v2
      0.0           75,503          4      18,875.8      18,855.0       9,193       28,600        9,199.5  cuLaunchKernel
      0.0           51,224          4      12,806.0       5,946.0       4,270       35,062       14,881.4  cuMemcpyDtoHAsync_v2
      0.0           17,776          7       2,539.4       2,381.0       2,327        3,453          405.6  cuEventRecord
      0.0           10,840          6       1,806.7       1,603.0       1,160        3,074          758.4  cuEventSynchronize
      0.0            8,613          3       2,871.0       1,765.0         427        6,421        3,146.3  cuEventCreate

[6/8] Executing 'gpukernsum' stats report

CUDA Kernel Statistics:

 Time (%)  Total Time (ns)  Instances    Avg (ns)       Med (ns)      Min (ns)     Max (ns)    StdDev (ns)          Name
 --------  ---------------  ---------  -------------  -------------  -----------  -----------  -----------  --------------------
     98.6      494,294,125          1  494,294,125.0  494,294,125.0  494,294,125  494,294,125          0.0  reduction_acc_15_gpu
      1.3        6,717,316          1    6,717,316.0    6,717,316.0    6,717,316    6,717,316          0.0  reduction_acc_32_gpu
      0.1          423,388          1      423,388.0      423,388.0      423,388      423,388          0.0  reduction_acc_48_gpu
      0.0           39,200          1       39,200.0       39,200.0       39,200       39,200          0.0  reduction_acc_65_gpu

[7/8] Executing 'gpumemtimesum' stats report

CUDA Memory Operation Statistics (by time):

 Time (%)  Total Time (ns)  Count  Avg (ns)   Med (ns)   Min (ns)  Max (ns)   StdDev (ns)      Operation
 --------  ---------------  -----  ---------  ---------  --------  ---------  -----------  ------------------
     99.9        4,383,961      8  547,995.1  547,483.5     1,120  1,098,262    584,516.4  [CUDA memcpy HtoD]
      0.1            6,496      4    1,624.0    1,344.0     1,216      2,592        648.1  [CUDA memcpy DtoH]

[8/8] Executing 'gpumemsizesum' stats report

CUDA Memory Operation Statistics (by size):

 Total (MB)  Count  Avg (MB)  Med (MB)  Min (MB)  Max (MB)  StdDev (MB)      Operation
 ----------  -----  --------  --------  --------  --------  -----------  ------------------
     54.002      8     6.750     6.750     0.001    13.500        7.216  [CUDA memcpy HtoD]
      0.002      4     0.001     0.001     0.001     0.001        0.000  [CUDA memcpy DtoH]


Profile Results:

Case        Time in ns
     1     494,294,125
     2       6,717,316
     3         423,388
     4          39,200

As you can see, the overhead to perform the final reduction is very high. The atomics do much better, especially if we interchange the loops, but are still high. The inner scalar reduction is the best by far.

Of course in your real code, these result may not hold true, but are things you should consider.

Hope this helps,
Mat

1 Like

Dear Mat,

Thank you very much for taking your time to make these nice examples!

I have also used atomics indeed, it is good to be aware of the possible performance penalty.

Pedro