What's the difference between CUDA stack and local memory?

illoj · August 29, 2024, 9:07am

Here’s what I know so far:

Local memory is used to handle register spillovers. It’s located in global memory and isn’t cached, which makes it slow.
I don’t know much about the CUDA stack, but I know there’s something called alloca that can allocate memory on the stack, and the stack also supports recursion.

I’m quite familiar with how stacks are used in CPU programming, but I find the GPU stack confusing.

Here are my main questions:

Where is the GPU stack physically stored?
Why not merge the stack and local memory? In CPU programming, the stack is used for storing intermediate variables.
Which is faster, the stack or local memory?
It seems like there’s hardly any mention of the GPU stack in the documentation. Did I miss something (CUDA C++ Programming Guide 12.6)?

Greg · August 29, 2024, 12:52pm

CUDA Programming Guide - Local Memory provides a fairly detailed description of Local Memory.

In C/C++ a thread’s stack is used to handle register spills. The stack is a portion of the per thread local memory. Local memory is cached in L1.

The “CUDA stack” as you reference is the C/C++ thread stack. This is a region of reserved thread local memory for storage of automatic or local variables. In addition it contains frame data for function calls and can support function scoped dynamic allocation using alloca. See Stack-based memory allocation - Wikipedia.

The thread stack is part of the GPU Local Memory allocation which resides in Global Memory Address space. The allocation can physically reside in device memory or system memory. For performance reasons the local memory allocation is allocated in device memory.

The size of thread local memory is controlled automatically by the CUDA Driver. If additional stack of alloca size is required then per thread size can be specified using cudaDeviceSetLimit(cudaLimitStackSize, sizeInBytesPerThread). Increasing this size can result in very large memory allocations as the allocation is cudaDevAttrMultiProcessorCount x cudaDevAttrMaxThreadsPerMultiProcessor x (cudaLimitStackSize + driverStackSize + rounding). This can result in 100s to 1000s of MiBs.

The major difference between local memory and global memory is that local memory is interleaved at 4-byte granularity across the 32 lanes of a warp. This allows coalesced accesses to memory when all threads are accessing a local variable.

QUESTONS

Where is the GPU stack physically stored?

Local Memory is in Global Virtual Address space. The memory can be physically in device memory or system memory. In almost all cases it will be in device memory for optimal latency and throughput.

Why not merge the stack and local memory? In CPU programming, the stack is used for storing intermediate variables.

The C/C++ stack is in thread local memory.

Which is faster, the stack or local memory?

These are the same.

It seems like there’s hardly any mention of the GPU stack in the documentation. Did I miss something (CUDA C++ Programming Guide 12.6)?

Yes, you missed a fair bit of documentation throughout the CUDA C++ Programming Guide. There is also useful information in the PTX ISA 8.5 documentation.

illoj · August 30, 2024, 1:12am

Thank you so much for taking the time to provide such a detailed answer!

system · September 13, 2024, 1:12am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Where best to allocate memory On the local stack or in shared memory CUDA Programming and Performance	11	5406	January 26, 2009
About the different memories CUDA Programming and Performance	12	11584	December 6, 2007
basic doubts about cuda CUDA Programming and Performance	9	3764	February 7, 2008
Which memory is used for the stack frame? CUDA Programming and Performance	6	2924	September 29, 2011
memory organization CUDA Programming and Performance	3	4335	March 10, 2008
large local variables CUDA Programming and Performance	3	3474	May 27, 2007
A basic question about nvcc: stack frame, spilling ld/st CUDA Programming and Performance	8	3398	April 4, 2014
How does the compiler lay out local variables in local memory CUDA Programming and Performance	1	1577	April 30, 2021
How is memory type chosen for stack variable? CUDA Programming and Performance	5	6159	November 5, 2007
__constant__ use CUDA Programming and Performance	6	15141	June 14, 2008

What's the difference between CUDA stack and local memory?

Related topics