shared memory reverse array example


In the Dr Dobbs CUDA tutorial, there exists two reverse array GPU kernels with the first one not utilizing shared memory and the second one utilizing shared memory. Here is the link below.…TMY32JVN?pgno=2

I have timed the two kernels and don’t see much difference in performance between the two. Is this normal? What might be the problem (e.g. automatic optimizations, not setting a flag in the compilation command, something wrong with my timing, etc.)? Thanks.

What GPU are you using? Compute capability 1.2 and greater devices should be able to reverse an array efficiently without shared memory thanks to the improved memory controller.