Shared memory for N-body problem

Hi folk,

I have one interesting question. If we look into the chapter 31 of “GPU Gems 3” (Fast N-Body Simulation
with CUDA), which can be found in the examples folder (…\NVIDIA GPU Computing SDK 3.2\C\src\nbody\doc) supplied with SDK, we see that there is a way to accelerate the N-body calculation by using the shared memory and separating the N-body array into tiles. In another book, people used essentially the same algorithm and they claimed they had achieved almost 8-fold speedup (from 1559.0 ms to 240.97 ms) for some benchmark computation. Looks promising, isn’t it? So I did so, I copied the example from the “GPU Gems” verbatim, but what I found was that there was no speedup at all. Without using the shared memory, with direct reading from global memory, I got execution time of 277.9 ms, with using the shared memory it was 279.6 ms. And, varying the number of bodies N, I found that execution time of the code with shared memory is always higher than that of the code without shared memory, and in the limit of large N, these two times become almost equal.

So my (rather general) question is why cannot I achieve any speedup by using shared memory? Why the code from the book does not give any speedup in my computer? Any ideas/hints/suggestions are welcome!

Needless to say, that playing with different tile sizes did not help. Sure, no errors in the code. My GPU is Tesla C2050.

Just a wild guess: the performance given and the algorithm optimisations described in the book correspond to an implementation on a G80 GPU, which is rather old.
Actually, since Fermi (GF1xx) and compute capability 2.x, GPUs support caching for global memory accesses, which can make a huge difference.
While reading the algorithm as described in the book, I don’t see anything done with the shared memory that couldn’t be achieved as effectively (if not better) by the cache… So what you measured might simply be the effect of the good job NVIDIA guys did on designing their hardware, making many hand-crafted programs not necessary any more for achieving good performances…