I wrote a simple program to perform reduction on the GPU. I was very supersize to see that my implementation is about 4.5 - 15 times slower than intrinsic sum (see comparison). I did not expect to get maximum throughput but this poor performance I cannot explain. I’ve tried to use every trick in the book (that I know of) to make it as fast as possible. There is nothing obvious bottle-necking my kernel (or is it?).
It looks to me like either I’m doing something extremely inefficient or sum does something really, really clever. Any thoughts? Enclosed is my subroutine. Compilation details:
version: 17.4-0 64-bit
options: -fast -Mcuda=cuda8.0,cc3+
Thank you
Bart
! Kernel: reduction of X (stored either in X or S)
attributes(global) subroutine reduced(X,S)
integer :: i, idx, tx
real(wp), shared :: cache(256)
real(wp), intent(inout), device :: X(:)
real(wp), optional, intent(inout), device :: S(:)
!
idx = threadIdx%x + blockDim%x * (blockIdx%x-1)
tx = threadIdx%x
!
! load from global to shared memory
cache(tx) = 0.0_wp
if (idx<=size(X)) cache(tx) = X(idx)
call syncthreads()
!
if (tx<=128) cache(tx) = cache(tx) + cache(tx+128)
call syncthreads()
if (tx<=64) cache(tx) = cache(tx) + cache(tx+64)
call syncthreads()
!
if (tx<=32) then
cache(tx) = cache(tx) + cache(tx+32)
cache(tx) = cache(tx) + __shfl_down(cache(tx),16)
cache(tx) = cache(tx) + __shfl_down(cache(tx),8)
cache(tx) = cache(tx) + __shfl_down(cache(tx),4)
cache(tx) = cache(tx) + __shfl_down(cache(tx),2)
cache(tx) = cache(tx) + __shfl_down(cache(tx),1)
end if
!
! copy back to global memory
if (present(S)) then
if (tx==1) S(blockIdx%x) = cache(1)
else
if (tx==1) X(blockIdx%x) = cache(1) ! overrides X
end if
return
end subroutine reduced
!
! My sum
real(wp) function my_sum(X)
integer :: m, istat
integer :: tBlock, grid
real(wp), allocatable, device :: S(:)
real(wp), intent(inout), device :: X(:)
!
tBlock = 256
grid = ceiling(real(size(X))/tBlock)
! first reduction
allocate(S(grid))
call reduced<<<grid,tBlock>>>(X,S)
!
! reduce further
do while (grid>1)
m = grid
grid = ceiling(real(grid)/tBlock)
call reduced<<<grid,tBlock>>>(S(1:m))
end do
!
my_sum = S(1)
deallocate(S)
return
end function my_sum
We launch only one kernel, each thread can sum more than one element, then we do the final sum of the partial sums on the host.
It is not much slower to move 1024 elements back than to move 1 element back.
That is close to what we do. Make sure to compile with nordc to minimize the function call overhead (or inline the functions).
Here is our basic structure:
! Use a two-stage shfl
! reduction (up to 32*32 = 1024 threads per block)
! *** blockDim%x must be a power of 2 and >= 32 ***
attributes(global) subroutine partialSumShflShflR4(p, a, n)
implicit none
real, intent(out) :: p(*)
real, intent(in) :: a(n)
integer, intent(in), value :: n
real, shared :: p_s(32)
real :: sloc, qloc
integer :: i, ig, laneID, i2, iGrid
integer :: warpId, width
iGrid = gridDim%x*blockDim%x
sloc = 0.0
ig = (blockIdx%x-1)blockDim%x + threadIdx%x
! Loop over all elements
do i = ig, n, iGrid
sloc = sloc + a(i)
end do
! reduce within warp
qloc = __shfl_xor(sloc, 1)
sloc = sloc + qloc
qloc = __shfl_xor(sloc, 2)
sloc = sloc + qloc
qloc = __shfl_xor(sloc, 4)
sloc = sloc + qloc
qloc = __shfl_xor(sloc, 8)
sloc = sloc + qloc
qloc = __shfl_xor(sloc, 16)
sloc = sloc + qloc
if (BlockDim%x == 32) then
if (threadIdx%x==1) p(blockIdx%x) = sloc
else
! one thread in each warp writes to p_s(1:32), then
! reduce in first warp
warpID = (threadIdx%x-1)/32+1
laneID = iand(threadIdx%x,31)
if (laneID == 1) p_s(warpID) = sloc
call syncthreads()
! now reduce in first warp
if (warpID == 1) then
width = blockDim%x/32
sloc = p_s(threadIdx%x)
i = 1
do while (i < width)
qloc = __shfl_xor(sloc, i, width)
sloc = sloc + qloc
i = i2
end do
if (threadIdx%x == 1) p(blockIdx%x) = sloc
end if
end if
end subroutine partialSumShflShflR4