I have one interesting question. If we look into the chapter 31 of â€œGPU Gems 3â€ (Fast N-Body Simulation
with CUDA), which can be found in the examples folder (â€¦\NVIDIA GPU Computing SDK 3.2\C\src\nbody\doc) supplied with SDK, we see that there is a way to accelerate the N-body calculation by using the shared memory and separating the N-body array into tiles. In another book, people used essentially the same algorithm and they claimed they had achieved almost 8-fold speedup (from 1559.0 ms to 240.97 ms) for some benchmark computation. Looks promising, isnâ€™t it? So I did so, I copied the example from the â€œGPU Gemsâ€ verbatim, but what I found was that there was no speedup at all. Without using the shared memory, with direct reading from global memory, I got execution time of 277.9 ms, with using the shared memory it was 279.6 ms. And, varying the number of bodies N, I found that execution time of the code with shared memory is always higher than that of the code without shared memory, and in the limit of large N, these two times become almost equal.
So my (rather general) question is why cannot I achieve any speedup by using shared memory? Why the code from the book does not give any speedup in my computer? Any ideas/hints/suggestions are welcome!
Needless to say, that playing with different tile sizes did not help. Sure, no errors in the code. My GPU is Tesla C2050.