Significant deterioration of performance with array reduction in OpenACC

clementguillet · April 1, 2022, 7:48am

Hello,

In an application using OpenACC for GPU I have a loop requiring a reduction on an array and I encounter a significant deterioration of the acceleration with the reduction. In comparison to a sequential execution the accelerated part is about ten times faster with the reduction whereas it is about fifty-five times faster without (on a Tesla V100).
Here is the code of the loop with the reduction:

!$ACC PARALLEL NUM_GANGS(80) VECTOR_LENGTH(128) PRESENT(xv(:,:),grid(:,:,:))
!$ACC LOOP REDUCTION(+:grid)
do ip=1,Ns
ixr = xv(ip,1)/2**(n-k)
ix = ixr
ixr = ixr - ix
iyr = xv(ip,2)/2**(n-l)
iy = iyr
iyr = iyr - iy
izr = xv(ip,3)/2**(n-m)
iz = izr
izr = izr - iz
sx1 = (1._RP-ixr)*2k
sx2 = ixr *2k
sy1 = (1._RP-iyr)*2l
sy2 = iyr *2l
sz1 = (1._RP-izr)*2m
sz2 = izr *2m
    	grid(ix  ,iy  ,iz)          =  grid(ix  ,iy  ,iz)         + sx1*sy1*sz1*W
    	grid(ix+1,iy  ,iz)        =  grid(ix+1,iy  ,iz)       + sx2*sy1*sz1*W
    	grid(ix  ,iy+1,iz)        =  grid(ix  ,iy+1,iz)       + sx1*sy2*sz1*W
    	grid(ix+1,iy+1,iz)      =  grid(ix+1,iy+1,iz)     + sx2*sy2*sz1*W	
    	grid(ix  ,iy  ,iz+1)      =  grid(ix  ,iy  ,iz+1)     + sx1*sy1*sz2*W
    	grid(ix+1,iy  ,iz+1)    =  grid(ix+1,iy  ,iz+1)   + sx2*sy1*sz2*W
    	grid(ix  ,iy+1,iz+1)    =  grid(ix  ,iy+1,iz+1)   + sx1*sy2*sz2*W
    	grid(ix+1,iy+1,iz+1)  =  grid(ix+1,iy+1,iz+1) + sx2*sy2*sz2*W
end do
!$ACC END LOOP
!$ACC END PARALLEL

The number of gang is set to the number of streaming multi-processor of the GPU because it provided the best performance in my case. The code without the reduction is:

!$ACC PARALLEL PRESENT(xv(:,:),grid(:,:,:)) PRIVATE(grid(:,:,:))
!$ACC LOOP
do ip=1,Ns
ixr = xv(ip,1)/2**(n-k)
ix = ixr
ixr = ixr - ix
iyr = xv(ip,2)/2**(n-l)
iy = iyr
iyr = iyr - iy
izr = xv(ip,3)/2**(n-m)
iz = izr
izr = izr - iz
sx1 = (1._RP-ixr)*2k
sx2 = ixr *2k
sy1 = (1._RP-iyr)*2l
sy2 = iyr *2l
sz1 = (1._RP-izr)*2m
sz2 = izr *2m
    	grid(ix  ,iy  ,iz)          =  grid(ix  ,iy  ,iz)         + sx1*sy1*sz1*W
    	grid(ix+1,iy  ,iz)        =  grid(ix+1,iy  ,iz)       + sx2*sy1*sz1*W
    	grid(ix  ,iy+1,iz)        =  grid(ix  ,iy+1,iz)       + sx1*sy2*sz1*W
    	grid(ix+1,iy+1,iz)      =  grid(ix+1,iy+1,iz)     + sx2*sy2*sz1*W	
    	grid(ix  ,iy  ,iz+1)      =  grid(ix  ,iy  ,iz+1)     + sx1*sy1*sz2*W
    	grid(ix+1,iy  ,iz+1)    =  grid(ix+1,iy  ,iz+1)   + sx2*sy1*sz2*W
    	grid(ix  ,iy+1,iz+1)    =  grid(ix  ,iy+1,iz+1)   + sx1*sy2*sz2*W
    	grid(ix+1,iy+1,iz+1)  =  grid(ix+1,iy+1,iz+1) + sx2*sy2*sz2*W
end do
!$ACC END LOOP
!$ACC END PARALLEL

In the latter case, if the number of gangs is set to 80 or not specified it gives approximately the same acceleration (about five times faster than with the reduction).
I have heard that reduction of array within OpenACC 2.7 could alter the performance. Is this amount of deterioration always to be expected whenever a reduction of array occurs ? Or an acceleration better than the one achieved with the reduction could be obtained ?

Thanks,
Clément.

MatColgrove · April 1, 2022, 4:37pm

Since array reduction support is new, I personally don’t have much experience with it. However, the expectations is that they do cause extra overhead than can adversely effect performance. Especially in this case since “grid” is a 3D array where every vector will need it’s own private copy in order to collect the partial reduction. Then the final reduction needs to be performed on each element of the array. My guess is that most of the extra time would be in the final reduction kernel, which you can check via the Nsight-Systems profiler or by setting the environment variable “NV_ACC_TIME=1”.

In this case, you may be better off using atomics instead which have gotten much faster. I’m not certain, but would be worth an experiment.

-Mat

clementguillet · April 14, 2022, 8:40am

Hello Mat,

Thank you for your answer.

I have already used the environment variable “PGI_ACC_TIME=1” before but cannot find the time devoted to the reductions from the informations given at the execution.

In addition, I have tried using atomics instructions instead of the reduction clause but the performance was approximately the same than with the reduction. However, rewriting the code in another form I have succeeded to achieve best performance.

Best regards,

Clément.

MatColgrove · April 14, 2022, 4:17pm

Look for the kernels names ending in “_red”. These are the auto-generated final reduction kernels. Though if the reductions are in a worker/vector loop within the compute region (i.e. not on the outer most loop), then the final reduction would be added to the kernel itself, and not be a separate kernel.

However, rewriting the code in another form I have succeeded to achieve best performance.

Good to hear. One thought is if you can provide a reproducing example of the poor performance of array reductions, I can ask our engineers to take a look and see if any improvements can be done. I’m not sure if they can, but might be worth a look.

clementguillet · April 21, 2022, 9:56am

This is the only results I get with the NV_ACC_TIME=1 environment variable:

without reduction:

project  NVIDIA  devicenum=0
    time(us): 352,665
    100: compute region reached 64 times
        100: kernel launched 64 times
            grid: [80]  block: [128]
             device time(us): total=352,665 max=107,294 min=2,761 avg=5,510
            elapsed time(us): total=354,147 max=107,350 min=2,783 avg=5,533
    100: data region reached 128 times

with reduction:

project  NVIDIA  devicenum=0
    time(us): 1,514,197
    30: compute region reached 64 times
        30: kernel launched 64 times
            grid: [80]  block: [128]
             device time(us): total=1,514,197 max=120,852 min=12,760 avg=23,659
            elapsed time(us): total=1,515,777 max=120,913 min=12,782 avg=23,684
    30: data region reached 128 times

Besides, when using atomics, I encountered a few problems: at the compilation the following warning error line:

nvvmCompileProgram error 9: NVVM_ERROR_COMPILATION.
Error: /tmp/pgaccXsbNtPeT579B.gpu (1131, 25): parse '@__pgi_atomicAddd_llvm' defined with type 'double (i8*, double)*'
ptxas /tmp/pgaccbsbNdr4djZyC.ptx, line 1; fatal   : Missing .version directive at start of file '/tmp/pgaccbsbNdr4djZyC.ptx'
ptxas fatal   : Ptx assembly aborted due to errors
NVFORTRAN-W-0155-Compiler failed to translate accelerator region (see -Minfo messages): Device compiler exited with error status code

for the code lines:

!$ACC ATOMIC UPDATE 
grid(ix  ,iy  ,iz)    =  grid(ix  ,iy  ,iz)    + sx1*sy1*sz1*W
!$ACC ATOMIC UPDATE 
grid(ix+1,iy  ,iz)    =  grid(ix+1,iy  ,iz)    + sx2*sy1*sz1*W
...

with double precision reals W,sx1,sy1,etc. and a segmentation fault at the execution. I fixed the issue by using the internal compiler flag “-Mx,231,0x1” to revert to using the older atomics. Is this a lasting solution or should I expect the meaning of the internal flag to change ?

I will look if I can produce a simple reproducing example of the problem.

Best regards,
Clément.

MatColgrove · April 21, 2022, 4:05pm

Which compiler version are you using? We’ve fixed all the reported issues with the new atomics, so if using an older version, consider updating to our 22.3 release. If the error still occurs in 22.3, then I can report the issue if you can provide a reproducing example.

clementguillet · April 22, 2022, 6:45am

I was using the version 22.1 of the compiler nvhpc. Updating to the 22.3 release indeed fixes the atomic issue.

Thanks,

Clément Guillet.

system · May 6, 2022, 6:46am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.