Global Array Reduction

I want to test the performance of global reduction with atomic operations so I took the fastest reduction kernel provided by the SDK and replaced all instances of shared memory with a global array taken as a parameter and all instances of regular addition with atomic addition.
However, and global synchronization becomes a problem and the kernel spits out garbage. I understand a solution to this problem is to split the kernel into multiple invocations and do a global synchronization with cudaThreadSynchronize or cudaDeviceSynchronize between invocations but I’m not sure exactly where I need to split the kernel.

Here is the code: http://pastebin.com/qUk3G4LF

Surely I need to split the kernel at line 12 after the first layer of reduction is performed but would I need to split at every __syncthreads() point after that? Before warp reduction as well?

This doesn’t answer your general questions, but I wanted to quickly note:

If you split the kernel into different invocations, you DO NOT need to call cudaThreadSynchronize or cudaDeviceSynchronize between invocations. All kernels submitted to the same stream (and if you don’t specify a stream, the kernel runs on stream 0) run sequentially with a global memory barrier between them. Pretty much the only time you need to call one of the synchronization functions explicitly on the host is for debugging purposes, or if you need to wait until an asynchronous memory copy finishes.

This is very good to know, thank you.
I’m curious, does this global memory barrier apply to device functions called from within a kernel or only to global functions called from the host?

This ordering is only guaranteed when launching kernels specifically. Calling device functions is a thread-level thing, so there are all the usual race condition issues.

That makes sense, thanks.