Hi Pedro,
I think this approach is certainly not good, because I get an Out of memory error for quite small values of n
.
Not surprising. For reductions, each thread will need it’s own private copy of the array in order to perform the partial reduction so can take up quite a bit of memory. Plus, there will be a high amount of overhead needed to perform the final reduction over the partial reductions. For most cases, array reductions are not the best choice.
Is there a good alternative here that can leverage array reductions so I do not need to define a scalar variable to compute the average?`
Atomics. They do have overhead as well, but aren’t too bad if there aren’t a lot of collisions. Though performance-wise with this example, the inner loop scalar reduction is the best choice.
To see what the performance impact is, I wrote an example we can use to compare 4 different schedules:
Case 1: Array Reduction
Case 2: Atomic update
Case 3: Atomic update with loop interchange to lessen the number of collisions
Case 4: Inner loop scalar reduction
Then I use Nsight-Systems to see the performance of each kernel:
% cat test.F90
program reduction_acc
implicit none
integer, parameter :: n = 150
real :: p1d_s
real, allocatable :: p1d(:),p(:,:,:)
integer :: i,j,k
allocate(p1d(n))
allocate(p(n,n,n))
!$acc enter data create(p1d,p)
p1d(:) = 0.
p(:,:,:) = 1.
!$acc update device(p1d,p)
! CASE 1: Array Reduction
!$acc parallel loop collapse(3) default(present) reduction(+:p1d)
do k=1,n
do j=1,n
do i=1,n
p1d(k) = p1d(k) + p(i,j,k)
end do
end do
end do
!$acc update self(p1d)
print*,p1d(10)
p1d(:) = 0.
p(:,:,:) = 1.
!$acc update device(p1d,p)
! CASE 2: Atomic
!$acc parallel loop collapse(3) default(present)
do k=1,n
do j=1,n
do i=1,n
!$acc atomic update
p1d(k) = p1d(k) + p(i,j,k)
end do
end do
end do
!$acc update self(p1d)
print*,p1d(10)
p1d(:) = 0.
p(:,:,:) = 1.
!$acc update device(p1d,p)
! CASE 3: Atomic interchange
!$acc parallel loop collapse(3) default(present)
do j=1,n
do i=1,n
do k=1,n
!$acc atomic update
p1d(k) = p1d(k) + p(i,j,k)
end do
end do
end do
!$acc update self(p1d)
print*,p1d(10)
p1d(:) = 0.
p(:,:,:) = 1.
!$acc update device(p1d,p)
! CASE 3: Scalar Reduction
!$acc parallel loop gang default(present)
do k=1,n
p1d_s = 0.
!$acc loop collapse(2) reduction(+:p1d_s)
do j=1,n
do i=1,n
p1d_s = p1d_s + p(i,j,k)
end do
end do
p1d(k) = p1d_s
end do
!$acc update self(p1d)
print*,p1d(10)
!$acc exit data delete(p1d,p)
end
% nvfortran -acc test.F90 -Minfo=accel -fast -V22.7
reduction_acc:
9, Generating enter data create(p(:,:,:),p1d(:))
12, Generating update device(p1d(:),p(:,:,:))
15, Generating copy(p1d(:)) [if not already present]
Generating NVIDIA GPU code
16, !$acc loop gang, vector(128) collapse(3) ! blockidx%x threadidx%x
Generating reduction(+:p1d(:))
17, ! blockidx%x threadidx%x collapsed
18, ! blockidx%x threadidx%x collapsed
15, Generating default present(p(1:150,1:150,1:150))
23, Generating update self(p1d(:))
29, Generating update device(p(:,:,:),p1d(:))
32, Generating NVIDIA GPU code
33, !$acc loop gang, vector(128) collapse(3) ! blockidx%x threadidx%x
34, ! blockidx%x threadidx%x collapsed
35, ! blockidx%x threadidx%x collapsed
32, Generating default present(p1d(:),p(1:150,1:150,1:150))
41, Generating update self(p1d(:))
45, Generating update device(p1d(:),p(:,:,:))
48, Generating NVIDIA GPU code
49, !$acc loop gang, vector(128) collapse(3) ! blockidx%x threadidx%x
50, ! blockidx%x threadidx%x collapsed
51, ! blockidx%x threadidx%x collapsed
48, Generating default present(p1d(:),p(1:150,1:150,1:150))
57, Generating update self(p1d(:))
62, Generating update device(p1d(:),p(:,:,:))
65, Generating NVIDIA GPU code
66, !$acc loop gang ! blockidx%x
69, !$acc loop vector(128) collapse(2) ! threadidx%x
Generating reduction(+:p1d_s)
70, ! threadidx%x collapsed
65, Generating default present(p1d(1:150),p(1:150,1:150,1:150))
66, Generating implicit private(p1d_s)
69, Loop is parallelizable
70, Loop is parallelizable
76, Generating update self(p1d(:))
79, Generating exit data delete(p1d(:),p(:,:,:))
% nsys profile --stats=true a.out
22500.00
22500.00
22500.00
22500.00
Generating '/tmp/nsys-report-1bef.qdstrm'
[1/8] [========================100%] report9.nsys-rep
[2/8] [========================100%] report9.sqlite
[3/8] Executing 'nvtxsum' stats report
SKIPPED: /local/home/mcolgrove/report9.sqlite does not contain NV Tools Extension (NVTX) data.
[4/8] Executing 'osrtsum' stats report
Operating System Runtime API Statistics:
Time (%) Total Time (ns) Num Calls Avg (ns) Med (ns) Min (ns) Max (ns) StdDev (ns) Name
-------- --------------- --------- ------------ ----------- -------- ----------- ------------ --------------
81.0 801,665,228 23 34,855,009.9 6,459,118.0 76,151 100,244,723 43,184,942.2 poll
17.7 175,173,497 1,216 144,057.2 37,752.5 1,226 35,705,356 1,253,597.2 ioctl
0.4 3,957,508 53 74,670.0 5,003.0 1,193 3,621,343 496,628.7 fopen
0.4 3,729,234 91 40,980.6 27,832.0 5,348 1,701,513 176,650.2 mmap64
0.3 2,925,930 14 208,995.0 162,444.0 68,396 1,039,874 243,294.2 sem_timedwait
0.1 973,346 134 7,263.8 6,501.5 2,136 24,531 3,135.0 open64
0.1 761,445 4 190,361.3 191,960.5 183,531 193,993 4,668.5 pthread_create
0.0 235,333 4 58,833.3 53,884.0 48,952 78,613 13,801.2 fgets
0.0 188,950 20 9,447.5 6,917.0 2,253 35,534 8,357.1 mmap
0.0 88,335 44 2,007.6 1,951.5 1,059 3,672 604.8 fclose
0.0 62,374 15 4,158.3 3,865.0 2,203 8,595 1,441.3 write
0.0 60,426 18 3,357.0 3,757.5 1,210 5,258 1,085.1 read
0.0 44,018 6 7,336.3 7,580.0 3,851 10,085 2,203.5 open
0.0 30,873 8 3,859.1 4,133.5 2,067 5,367 1,213.8 munmap
0.0 29,104 5 5,820.8 5,490.0 4,854 7,948 1,241.5 fwrite
0.0 12,374 10 1,237.4 1,203.5 1,109 1,480 115.5 fcntl
0.0 12,246 2 6,123.0 6,123.0 2,783 9,463 4,723.5 fread
0.0 10,948 2 5,474.0 5,474.0 3,819 7,129 2,340.5 socket
0.0 7,982 1 7,982.0 7,982.0 7,982 7,982 0.0 connect
0.0 7,931 1 7,931.0 7,931.0 7,931 7,931 0.0 pipe2
0.0 2,331 1 2,331.0 2,331.0 2,331 2,331 0.0 bind
0.0 1,140 1 1,140.0 1,140.0 1,140 1,140 0.0 listen
[5/8] Executing 'cudaapisum' stats report
CUDA API Statistics:
Time (%) Total Time (ns) Num Calls Avg (ns) Med (ns) Min (ns) Max (ns) StdDev (ns) Name
-------- --------------- --------- ------------ ------------ ---------- ----------- ------------- --------------------
93.8 505,897,413 18 28,105,411.8 4,591.5 659 494,303,345 116,358,306.7 cuStreamSynchronize
5.2 27,769,758 1 27,769,758.0 27,769,758.0 27,769,758 27,769,758 0.0 cuMemHostAlloc
0.8 4,151,876 4 1,037,969.0 705,371.5 5,757 2,735,376 1,244,893.8 cuMemAlloc_v2
0.2 874,402 1 874,402.0 874,402.0 874,402 874,402 0.0 cuMemAllocHost_v2
0.0 134,166 1 134,166.0 134,166.0 134,166 134,166 0.0 cuModuleLoadDataEx
0.0 95,150 8 11,893.8 10,890.0 7,107 20,567 4,924.6 cuMemcpyHtoDAsync_v2
0.0 75,503 4 18,875.8 18,855.0 9,193 28,600 9,199.5 cuLaunchKernel
0.0 51,224 4 12,806.0 5,946.0 4,270 35,062 14,881.4 cuMemcpyDtoHAsync_v2
0.0 17,776 7 2,539.4 2,381.0 2,327 3,453 405.6 cuEventRecord
0.0 10,840 6 1,806.7 1,603.0 1,160 3,074 758.4 cuEventSynchronize
0.0 8,613 3 2,871.0 1,765.0 427 6,421 3,146.3 cuEventCreate
[6/8] Executing 'gpukernsum' stats report
CUDA Kernel Statistics:
Time (%) Total Time (ns) Instances Avg (ns) Med (ns) Min (ns) Max (ns) StdDev (ns) Name
-------- --------------- --------- ------------- ------------- ----------- ----------- ----------- --------------------
98.6 494,294,125 1 494,294,125.0 494,294,125.0 494,294,125 494,294,125 0.0 reduction_acc_15_gpu
1.3 6,717,316 1 6,717,316.0 6,717,316.0 6,717,316 6,717,316 0.0 reduction_acc_32_gpu
0.1 423,388 1 423,388.0 423,388.0 423,388 423,388 0.0 reduction_acc_48_gpu
0.0 39,200 1 39,200.0 39,200.0 39,200 39,200 0.0 reduction_acc_65_gpu
[7/8] Executing 'gpumemtimesum' stats report
CUDA Memory Operation Statistics (by time):
Time (%) Total Time (ns) Count Avg (ns) Med (ns) Min (ns) Max (ns) StdDev (ns) Operation
-------- --------------- ----- --------- --------- -------- --------- ----------- ------------------
99.9 4,383,961 8 547,995.1 547,483.5 1,120 1,098,262 584,516.4 [CUDA memcpy HtoD]
0.1 6,496 4 1,624.0 1,344.0 1,216 2,592 648.1 [CUDA memcpy DtoH]
[8/8] Executing 'gpumemsizesum' stats report
CUDA Memory Operation Statistics (by size):
Total (MB) Count Avg (MB) Med (MB) Min (MB) Max (MB) StdDev (MB) Operation
---------- ----- -------- -------- -------- -------- ----------- ------------------
54.002 8 6.750 6.750 0.001 13.500 7.216 [CUDA memcpy HtoD]
0.002 4 0.001 0.001 0.001 0.001 0.000 [CUDA memcpy DtoH]
Profile Results:
Case Time in ns
1 494,294,125
2 6,717,316
3 423,388
4 39,200
As you can see, the overhead to perform the final reduction is very high. The atomics do much better, especially if we interchange the loops, but are still high. The inner scalar reduction is the best by far.
Of course in your real code, these result may not hold true, but are things you should consider.
Hope this helps,
Mat