As a next step I recoded the function to place the data in shared memory, as well as another version that uses global memory.
Of course global memory is the slowest, no surprise there. But I had assumed that the stack, since it is relatively large, would be placed in global memory as well. It turns out though that the stack version is much faster than the shared memory version. I had expected that shared memory would be the fastest.
So where is the stack frame stored? With 2504 bytes, I doubt that it is in registers. Can anyone enlighten me, as I was not able to find useful info in the docs or online.
Yes, it is a Fermi-based card (GTX 460). Conceptually it totally makes sense what you say, but my experiments show the following results:
Sorted from from fastest to slowest, when I place the data on:
the stack (100%)
in global memory (explicit, 200% runtime)
in shared memory (300% runtime)
I can understand why the stack is faster than explicit allocation in global memory because maybe the cache engine is able to predict access better, but why is explicit allocation in shared memory so much slower?
Hm, the code does use a lot of global memory to store the results, but this is write only. The are no other read accesses.
Any other insights for me? I’d like to understand why the card behaves the way it does…
(1) For the shared memory variant, how much re-use is there of each data item? Is there is just one use, moving data through shared memory could simply increase overhead compared to leaving the data in global memory and letting the cache do its work.
(2) Have shared memory bank conflicts been ruled out? Bank conflicts can significantly increase the run-time of code making heavy use of shared memory.
(3) Is the code in question known to be memory bound? There may be significant differences in dynamic instruction count between the various versions you have looked at, which may explain some of the differences.
I made a mistake in this comparison: Because the data I use is relatively large, I tested with only 16 threads in the shared memory version (not much more would fit). When I use 16 threads per block for all versions, or less, then I can see how the shared memory version becomes equally fast, or faster, than the other versions.
Yes there is heavy reuse of the data. I tested with 20 000 000 iterations and each accessed ~two of the data items.
No, I have looked at memory bank conflicts yet.
What I am doing is generating a large amount of random numbers using the marsenne twister algorithm. It does on average two reads and two writes on the data in question per operation. The arithmetic operations in the algorithm often reuse one operand and thus should fit nicely in registers. So I am pretty sure that the algorithm is memory bound.
More important probably: write accesses to global memory only uses the shared L2 cache, while local memory(like stack) uses the faster per-SM L1-Cache.
Minor point: for the stack version, the compiler might also be able to save some instructions in the address calculation.