in order to test the difference in speed between the shared and the global memory with CUDA I have simply taken the reduction code (from the paper of Mark Harris) and have replaced there the shared memory array by a global one given as a parameter. But this is not working. And my simple question is why not? I hope that the answer is as simple as the question :D. Just as info: I have simply allocated a device pointer of the same size as the used shared memory, removed the shared pointer and passed the global one as an additional parameter to the reduction kernel. The rest of the code is unchanged.
Thanks a lot!