sum reduction

Hi all,

I wrote a simple program to perform reduction on the GPU. I was very supersize to see that my implementation is about 4.5 - 15 times slower than intrinsic sum (see comparison). I did not expect to get maximum throughput but this poor performance I cannot explain. I’ve tried to use every trick in the book (that I know of) to make it as fast as possible. There is nothing obvious bottle-necking my kernel (or is it?).

It looks to me like either I’m doing something extremely inefficient or sum does something really, really clever. Any thoughts? Enclosed is my subroutine. Compilation details:

  • version: 17.4-0 64-bit
  • options: -fast -Mcuda=cuda8.0,cc3+

Thank you
Bart

!  Kernel: reduction of X (stored either in X or S)		
	attributes(global) subroutine reduced(X,S)
	integer :: i, idx, tx
	real(wp), shared :: cache(256)        
	real(wp), intent(inout), device :: X(:) 
	real(wp), optional, intent(inout), device :: S(:) 
!
	      idx = threadIdx%x + blockDim%x * (blockIdx%x-1) 
	       tx = threadIdx%x 			 	 		
!
!	      load from global to shared memory
	      cache(tx) = 0.0_wp
	      if (idx<=size(X)) cache(tx) = X(idx)
	      call syncthreads()  
!
	      if (tx<=128) cache(tx) = cache(tx) + cache(tx+128) 
	      call syncthreads()
	      if  (tx<=64) cache(tx) = cache(tx) + cache(tx+64) 
	      call syncthreads()
!
	      if (tx<=32) then 
		    cache(tx) = cache(tx) + cache(tx+32) 	 
		    cache(tx) = cache(tx) + __shfl_down(cache(tx),16)
		    cache(tx) = cache(tx) + __shfl_down(cache(tx),8)
		    cache(tx) = cache(tx) + __shfl_down(cache(tx),4)
		    cache(tx) = cache(tx) + __shfl_down(cache(tx),2)
		    cache(tx) = cache(tx) + __shfl_down(cache(tx),1)
	      end if	
!
!	   copy back to global memory 
	      if (present(S)) then
	            if (tx==1) S(blockIdx%x) = cache(1)
	      else
	            if (tx==1) X(blockIdx%x) = cache(1) ! overrides X 
	      end if
	return
	end subroutine reduced
!
!    My sum
	real(wp) function my_sum(X)
	integer :: m, istat
	integer :: tBlock, grid
	real(wp), allocatable, device :: S(:)
	real(wp), intent(inout), device :: X(:)	
!
	      tBlock = 256
	      	grid = ceiling(real(size(X))/tBlock)

!	  first reduction
	      allocate(S(grid))
	      call reduced<<<grid,tBlock>>>(X,S)
!
!        reduce further 
	      do while (grid>1)
		    m = grid
		    grid = ceiling(real(grid)/tBlock)
	      	    call reduced<<<grid,tBlock>>>(S(1:m))
	      end do
!
	      my_sum = S(1)
	      deallocate(S)	      
	return
	end function my_sum

Yes, really, really clever :)

We launch only one kernel, each thread can sum more than one element, then we do the final sum of the partial sums on the host.
It is not much slower to move 1024 elements back than to move 1 element back.

Hi brentl,

thank you for your insight! That indeed is really clever.
I’ve tried to write something like that (see my new code below)

  • multiple reductions per thread,
  • only one kernel invocation,
  • no shared memory (only register)
  • fast instruction (shuffel)

I still cannot close the gap. What am I missing here?

Thank you
Bart

!       Device: reduce a warp of size 32
	attributes(device) subroutine warp_reduce(psum)
	real(wp), intent(inout) :: psum
!	
	      psum = psum + __shfl_down(psum,16)
	      psum = psum + __shfl_down(psum,8)
	      psum = psum + __shfl_down(psum,4)
	      psum = psum + __shfl_down(psum,2)
	      psum = psum + __shfl_down(psum,1)
	return
	end subroutine warp_reduce
!
!       Device: reduce a block using warps of size 32
	attributes(device) subroutine block_reduce(psum)
	integer :: tx, wx, lx, d
	real(wp), shared :: cache(*)
	real(wp), intent(inout) :: psum
!	
	      tx = threadIdx%x          ! thread index
	      wx = iand(tx,warpSize-1)  ! warp index
	      lx = rshift(tx,5)+1       ! warp lane
	       d = rshift(blockDim%x,5) ! cache size	
!    
! 	      reduce all warps
	      call warp_reduce(psum)
	      if (wx==1) cache(lx) = psum
	      call syncthreads()
!
!  	      reduce final warp
	      if (tx<=d) then
		    psum = cache(tx)		     
		    call warp_reduce(psum)
	      end if
	return
	end subroutine block_reduce
!
!       Kernel: reduction (partial sums stored in S)		
	attributes(global) subroutine reduce_0(X,S,N)
	real(wp) :: psum
	integer, intent(in) :: N
	integer :: i, tx, idx, block, grid  
	real(wp), intent(in), device :: X(:)
	real(wp), intent(inout), device :: S(:)    
! 		
		block = blockDim%x
	         grid = block * 2 * gridDim%x
	           tx = threadIdx%x	 			
	            i = tx + block * 2 * (blockIdx%x-1)  	 		
!
!	      load from global to shared memory partially reducing
	      psum = 0.0_wp
	      do while (i<=N)
		    psum = psum + X(i)
		    idx = i + block
		    if (idx<=N) psum = psum + X(idx)
		    i = i + grid
	      end do	
!
	      call block_reduce(psum)
	      if (tx==1) S(blockIdx%x) = psum
	return
	end subroutine reduce_0

That is close to what we do. Make sure to compile with nordc to minimize the function call overhead (or inline the functions).

Here is our basic structure:

! Use a two-stage shfl
! reduction (up to 32*32 = 1024 threads per block)
! *** blockDim%x must be a power of 2 and >= 32 ***

attributes(global) subroutine partialSumShflShflR4(p, a, n)
implicit none
real, intent(out) :: p(*)
real, intent(in) :: a(n)
integer, intent(in), value :: n
real, shared :: p_s(32)
real :: sloc, qloc
integer :: i, ig, laneID, i2, iGrid
integer :: warpId, width

iGrid = gridDim%x*blockDim%x
sloc = 0.0
ig = (blockIdx%x-1)blockDim%x + threadIdx%x
! Loop over all elements
do i = ig, n, iGrid
sloc = sloc + a(i)
end do
! reduce within warp
qloc = __shfl_xor(sloc, 1)
sloc = sloc + qloc
qloc = __shfl_xor(sloc, 2)
sloc = sloc + qloc
qloc = __shfl_xor(sloc, 4)
sloc = sloc + qloc
qloc = __shfl_xor(sloc, 8)
sloc = sloc + qloc
qloc = __shfl_xor(sloc, 16)
sloc = sloc + qloc
if (BlockDim%x == 32) then
if (threadIdx%x==1) p(blockIdx%x) = sloc
else
! one thread in each warp writes to p_s(1:32), then
! reduce in first warp
warpID = (threadIdx%x-1)/32+1
laneID = iand(threadIdx%x,31)
if (laneID == 1) p_s(warpID) = sloc
call syncthreads()
! now reduce in first warp
if (warpID == 1) then
width = blockDim%x/32
sloc = p_s(threadIdx%x)
i = 1
do while (i < width)
qloc = __shfl_xor(sloc, i, width)
sloc = sloc + qloc
i = i
2
end do
if (threadIdx%x == 1) p(blockIdx%x) = sloc
end if
end if
end subroutine partialSumShflShflR4

Also note, there is only 1 call to syncthreads