Hello,
In an application using OpenACC for GPU I have a loop requiring a reduction on an array and I encounter a significant deterioration of the acceleration with the reduction. In comparison to a sequential execution the accelerated part is about ten times faster with the reduction whereas it is about fifty-five times faster without (on a Tesla V100).
Here is the code of the loop with the reduction:
!$ACC PARALLEL NUM_GANGS(80) VECTOR_LENGTH(128) PRESENT(xv(:,:),grid(:,:,:))
!$ACC LOOP REDUCTION(+:grid)
do ip=1,Ns
ixr = xv(ip,1)/2**(n-k)
ix = ixr
ixr = ixr - ix
iyr = xv(ip,2)/2**(n-l)
iy = iyr
iyr = iyr - iy
izr = xv(ip,3)/2**(n-m)
iz = izr
izr = izr - iz
sx1 = (1._RP-ixr)*2k
sx2 = ixr *2k
sy1 = (1._RP-iyr)*2l
sy2 = iyr *2l
sz1 = (1._RP-izr)*2m
sz2 = izr *2mgrid(ix ,iy ,iz) = grid(ix ,iy ,iz) + sx1*sy1*sz1*W grid(ix+1,iy ,iz) = grid(ix+1,iy ,iz) + sx2*sy1*sz1*W grid(ix ,iy+1,iz) = grid(ix ,iy+1,iz) + sx1*sy2*sz1*W grid(ix+1,iy+1,iz) = grid(ix+1,iy+1,iz) + sx2*sy2*sz1*W grid(ix ,iy ,iz+1) = grid(ix ,iy ,iz+1) + sx1*sy1*sz2*W grid(ix+1,iy ,iz+1) = grid(ix+1,iy ,iz+1) + sx2*sy1*sz2*W grid(ix ,iy+1,iz+1) = grid(ix ,iy+1,iz+1) + sx1*sy2*sz2*W grid(ix+1,iy+1,iz+1) = grid(ix+1,iy+1,iz+1) + sx2*sy2*sz2*W
end do
!$ACC END LOOP
!$ACC END PARALLEL
The number of gang is set to the number of streaming multi-processor of the GPU because it provided the best performance in my case. The code without the reduction is:
!$ACC PARALLEL PRESENT(xv(:,:),grid(:,:,:)) PRIVATE(grid(:,:,:))
!$ACC LOOP
do ip=1,Ns
ixr = xv(ip,1)/2**(n-k)
ix = ixr
ixr = ixr - ix
iyr = xv(ip,2)/2**(n-l)
iy = iyr
iyr = iyr - iy
izr = xv(ip,3)/2**(n-m)
iz = izr
izr = izr - iz
sx1 = (1._RP-ixr)*2k
sx2 = ixr *2k
sy1 = (1._RP-iyr)*2l
sy2 = iyr *2l
sz1 = (1._RP-izr)*2m
sz2 = izr *2mgrid(ix ,iy ,iz) = grid(ix ,iy ,iz) + sx1*sy1*sz1*W grid(ix+1,iy ,iz) = grid(ix+1,iy ,iz) + sx2*sy1*sz1*W grid(ix ,iy+1,iz) = grid(ix ,iy+1,iz) + sx1*sy2*sz1*W grid(ix+1,iy+1,iz) = grid(ix+1,iy+1,iz) + sx2*sy2*sz1*W grid(ix ,iy ,iz+1) = grid(ix ,iy ,iz+1) + sx1*sy1*sz2*W grid(ix+1,iy ,iz+1) = grid(ix+1,iy ,iz+1) + sx2*sy1*sz2*W grid(ix ,iy+1,iz+1) = grid(ix ,iy+1,iz+1) + sx1*sy2*sz2*W grid(ix+1,iy+1,iz+1) = grid(ix+1,iy+1,iz+1) + sx2*sy2*sz2*W
end do
!$ACC END LOOP
!$ACC END PARALLEL
In the latter case, if the number of gangs is set to 80 or not specified it gives approximately the same acceleration (about five times faster than with the reduction).
I have heard that reduction of array within OpenACC 2.7 could alter the performance. Is this amount of deterioration always to be expected whenever a reduction of array occurs ? Or an acceleration better than the one achieved with the reduction could be obtained ?
Thanks,
Clément.