Which memory is used for the stack frame?

squisher · September 23, 2011, 9:03pm

Hi,

I’m starting up with cuda programming and a naÃ¯ve implementation just put a big block of data on the stack, so that ptxas says:

2504 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads

As a next step I recoded the function to place the data in shared memory, as well as another version that uses global memory.

Of course global memory is the slowest, no surprise there. But I had assumed that the stack, since it is relatively large, would be placed in global memory as well. It turns out though that the stack version is much faster than the shared memory version. I had expected that shared memory would be the fastest.

So where is the stack frame stored? With 2504 bytes, I doubt that it is in registers. Can anyone enlighten me, as I was not able to find useful info in the docs or online.

Thanks!

~David

seibert · September 24, 2011, 12:17pm

If you have a Fermi-based GPU, the L1 and L2 cache can make frequently used sections of global memory (like a stack) seem as fast as shared memory.

It is also possible that the shared memory version of your code has reduced the maximum occupancy significantly, making memory latency more evident.

squisher · September 28, 2011, 8:07pm

Thanks for your response seibert!

Yes, it is a Fermi-based card (GTX 460). Conceptually it totally makes sense what you say, but my experiments show the following results:

Sorted from from fastest to slowest, when I place the data on:

the stack (100%)
in global memory (explicit, 200% runtime)
in shared memory (300% runtime)

I can understand why the stack is faster than explicit allocation in global memory because maybe the cache engine is able to predict access better, but why is explicit allocation in shared memory so much slower?

Hm, the code does use a lot of global memory to store the results, but this is write only. The are no other read accesses.

Any other insights for me? I’d like to understand why the card behaves the way it does…

njuffa · September 28, 2011, 8:15pm

(1) For the shared memory variant, how much re-use is there of each data item? Is there is just one use, moving data through shared memory could simply increase overhead compared to leaving the data in global memory and letting the cache do its work.

(2) Have shared memory bank conflicts been ruled out? Bank conflicts can significantly increase the run-time of code making heavy use of shared memory.

(3) Is the code in question known to be memory bound? There may be significant differences in dynamic instruction count between the various versions you have looked at, which may explain some of the differences.

squisher · September 28, 2011, 8:48pm

I made a mistake in this comparison: Because the data I use is relatively large, I tested with only 16 threads in the shared memory version (not much more would fit). When I use 16 threads per block for all versions, or less, then I can see how the shared memory version becomes equally fast, or faster, than the other versions.

squisher · September 28, 2011, 8:58pm

Yes there is heavy reuse of the data. I tested with 20 000 000 iterations and each accessed ~two of the data items.

No, I have looked at memory bank conflicts yet.

What I am doing is generating a large amount of random numbers using the marsenne twister algorithm. It does on average two reads and two writes on the data in question per operation. The arithmetic operations in the algorithm often reuse one operand and thus should fit nicely in registers. So I am pretty sure that the algorithm is memory bound.

Nighthawk13 · September 29, 2011, 8:36am

More important probably: write accesses to global memory only uses the shared L2 cache, while local memory(like stack) uses the faster per-SM L1-Cache.

Minor point: for the stack version, the compiler might also be able to save some instructions in the address calculation.

Topic		Replies	Views
What's the difference between CUDA stack and local memory? CUDA Programming and Performance	3	1224	September 13, 2024
Where best to allocate memory On the local stack or in shared memory CUDA Programming and Performance	11	5620	January 26, 2009
shared stack, local stack, ptofiler output CUDA Programming and Performance	6	2355	July 8, 2008
Speed - Global memory access vs. bitwise operation CUDA Programming and Performance	1	3187	October 24, 2011
Shared memory as slow as global memory CUDA Programming and Performance	8	4630	September 5, 2016
__shared__ memory confused me. __shared__ memory CUDA Programming and Performance	7	4085	August 1, 2009
Minimizing stack size depending on user's application OptiX	8	544	June 4, 2024
why is shared memory example not faster CUDA Programming and Performance	7	1418	May 16, 2012
Postfix expression evaluation on GPU CUDA Programming and Performance	9	1615	September 19, 2016
About the different memories CUDA Programming and Performance	12	11985	December 6, 2007

Which memory is used for the stack frame?

Related topics