I want to test the performance of global reduction with atomic operations so I took the fastest reduction kernel provided by the SDK and replaced all instances of shared memory with a global array taken as a parameter and all instances of regular addition with atomic addition.
However, and global synchronization becomes a problem and the kernel spits out garbage. I understand a solution to this problem is to split the kernel into multiple invocations and do a global synchronization with cudaThreadSynchronize or cudaDeviceSynchronize between invocations but I’m not sure exactly where I need to split the kernel.
Here is the code: http://pastebin.com/qUk3G4LF
Surely I need to split the kernel at line 12 after the first layer of reduction is performed but would I need to split at every __syncthreads() point after that? Before warp reduction as well?