Global Array Reduction

naiski · June 24, 2013, 2:43pm

I want to test the performance of global reduction with atomic operations so I took the fastest reduction kernel provided by the SDK and replaced all instances of shared memory with a global array taken as a parameter and all instances of regular addition with atomic addition.
However, and global synchronization becomes a problem and the kernel spits out garbage. I understand a solution to this problem is to split the kernel into multiple invocations and do a global synchronization with cudaThreadSynchronize or cudaDeviceSynchronize between invocations but I’m not sure exactly where I need to split the kernel.

Here is the code: template <unsigned int blockSize>__global__ void reduce4(int *g_odata, int *g_ - Pastebin.com

Surely I need to split the kernel at line 12 after the first layer of reduction is performed but would I need to split at every __syncthreads() point after that? Before warp reduction as well?

seibert · June 24, 2013, 3:25pm

This doesn’t answer your general questions, but I wanted to quickly note:

If you split the kernel into different invocations, you DO NOT need to call cudaThreadSynchronize or cudaDeviceSynchronize between invocations. All kernels submitted to the same stream (and if you don’t specify a stream, the kernel runs on stream 0) run sequentially with a global memory barrier between them. Pretty much the only time you need to call one of the synchronization functions explicitly on the host is for debugging purposes, or if you need to wait until an asynchronous memory copy finishes.

naiski · June 24, 2013, 5:37pm

This is very good to know, thank you.
I’m curious, does this global memory barrier apply to device functions called from within a kernel or only to global functions called from the host?

seibert · June 24, 2013, 6:10pm

This ordering is only guaranteed when launching kernels specifically. Calling device functions is a thread-level thing, so there are all the usual race condition issues.

naiski · June 24, 2013, 7:14pm

That makes sense, thanks.

Topic		Replies	Views
Synchronize all blocks in CUDA CUDA Programming and Performance	12	46173	October 25, 2013
Global Sync CUDA Programming and Performance	7	6024	October 4, 2007
concurrent memory writes CUDA Programming and Performance	8	5555	September 15, 2008
Reduction: shared VS global memory CUDA Programming and Performance	4	7737	June 1, 2008
Global thread barrier CUDA Programming and Performance	78	85865	December 23, 2011
Syncthread and global memory CUDA Programming and Performance	1	1063	January 7, 2017
Possible problem with atomic on global memory CUDA Programming and Performance	8	1292	November 20, 2013
problem with atomic operations on global memory implement mutual exclusion with atomicop CUDA Programming and Performance	1	1592	February 26, 2008
Advancing Computed Values.. Help CUDA Programming and Performance	4	765	August 16, 2011
Synchronization in nested CUDA kernel invocations CUDA Programming and Performance cuda , kernel	3	736	April 14, 2023

Global Array Reduction

Related topics