Which memory is used for the stack frame?

Hi,

I’m starting up with cuda programming and a naïve implementation just put a big block of data on the stack, so that ptxas says:

2504 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads

As a next step I recoded the function to place the data in shared memory, as well as another version that uses global memory.

Of course global memory is the slowest, no surprise there. But I had assumed that the stack, since it is relatively large, would be placed in global memory as well. It turns out though that the stack version is much faster than the shared memory version. I had expected that shared memory would be the fastest.

So where is the stack frame stored? With 2504 bytes, I doubt that it is in registers. Can anyone enlighten me, as I was not able to find useful info in the docs or online.

Thanks!

~David

If you have a Fermi-based GPU, the L1 and L2 cache can make frequently used sections of global memory (like a stack) seem as fast as shared memory.

It is also possible that the shared memory version of your code has reduced the maximum occupancy significantly, making memory latency more evident.

Thanks for your response seibert!

Yes, it is a Fermi-based card (GTX 460). Conceptually it totally makes sense what you say, but my experiments show the following results:

Sorted from from fastest to slowest, when I place the data on:

  • the stack (100%)

  • in global memory (explicit, 200% runtime)

  • in shared memory (300% runtime)

I can understand why the stack is faster than explicit allocation in global memory because maybe the cache engine is able to predict access better, but why is explicit allocation in shared memory so much slower?

Hm, the code does use a lot of global memory to store the results, but this is write only. The are no other read accesses.

Any other insights for me? I’d like to understand why the card behaves the way it does…

(1) For the shared memory variant, how much re-use is there of each data item? Is there is just one use, moving data through shared memory could simply increase overhead compared to leaving the data in global memory and letting the cache do its work.

(2) Have shared memory bank conflicts been ruled out? Bank conflicts can significantly increase the run-time of code making heavy use of shared memory.

(3) Is the code in question known to be memory bound? There may be significant differences in dynamic instruction count between the various versions you have looked at, which may explain some of the differences.

I made a mistake in this comparison: Because the data I use is relatively large, I tested with only 16 threads in the shared memory version (not much more would fit). When I use 16 threads per block for all versions, or less, then I can see how the shared memory version becomes equally fast, or faster, than the other versions.

Yes there is heavy reuse of the data. I tested with 20 000 000 iterations and each accessed ~two of the data items.

No, I have looked at memory bank conflicts yet.

What I am doing is generating a large amount of random numbers using the marsenne twister algorithm. It does on average two reads and two writes on the data in question per operation. The arithmetic operations in the algorithm often reuse one operand and thus should fit nicely in registers. So I am pretty sure that the algorithm is memory bound.

More important probably: write accesses to global memory only uses the shared L2 cache, while local memory(like stack) uses the faster per-SM L1-Cache.

Minor point: for the stack version, the compiler might also be able to save some instructions in the address calculation.