Reduction: shared VS global memory

Hi everyone,

in order to test the difference in speed between the shared and the global memory with CUDA I have simply taken the reduction code (from the paper of Mark Harris) and have replaced there the shared memory array by a global one given as a parameter. But this is not working. And my simple question is why not? I hope that the answer is as simple as the question :D. Just as info: I have simply allocated a device pointer of the same size as the used shared memory, removed the shared pointer and passed the global one as an additional parameter to the reduction kernel. The rest of the code is unchanged.

Thanks a lot!


per definition shared memory is private to a block. So the reduction scheduling possibly only schedules between the threads but not for the different blocks. So in my opinion, all your blocks write to the same location in global memory.


You can’t just replace shared with global memory, as you don’t have the same synchronization primitives. For example, __syncthreads() does not guarantee that all your global memory writes have finished.

Thanks a lot guys.


If you just want to experiment with reduction through global memory, try the __syncblocks() construct posted previously. Its just a spin loop that idles out all the thread in a MP. The construct is illusive but fails for large blocks.

Curious to know has anybody tried the same constructs using atomic instruction in 1.1 Hardware?