What's the difference between CUDA stack and local memory?

Here’s what I know so far:

  1. Local memory is used to handle register spillovers. It’s located in global memory and isn’t cached, which makes it slow.
  2. I don’t know much about the CUDA stack, but I know there’s something called alloca that can allocate memory on the stack, and the stack also supports recursion.

I’m quite familiar with how stacks are used in CPU programming, but I find the GPU stack confusing.

Here are my main questions:

  1. Where is the GPU stack physically stored?
  2. Why not merge the stack and local memory? In CPU programming, the stack is used for storing intermediate variables.
  3. Which is faster, the stack or local memory?
  4. It seems like there’s hardly any mention of the GPU stack in the documentation. Did I miss something (CUDA C++ Programming Guide 12.6)?
1 Like

CUDA Programming Guide - Local Memory provides a fairly detailed description of Local Memory.

In C/C++ a thread’s stack is used to handle register spills. The stack is a portion of the per thread local memory. Local memory is cached in L1.

The “CUDA stack” as you reference is the C/C++ thread stack. This is a region of reserved thread local memory for storage of automatic or local variables. In addition it contains frame data for function calls and can support function scoped dynamic allocation using alloca. See Stack-based memory allocation - Wikipedia.

The thread stack is part of the GPU Local Memory allocation which resides in Global Memory Address space. The allocation can physically reside in device memory or system memory. For performance reasons the local memory allocation is allocated in device memory.

The size of thread local memory is controlled automatically by the CUDA Driver. If additional stack of alloca size is required then per thread size can be specified using cudaDeviceSetLimit(cudaLimitStackSize, sizeInBytesPerThread). Increasing this size can result in very large memory allocations as the allocation is cudaDevAttrMultiProcessorCount x cudaDevAttrMaxThreadsPerMultiProcessor x (cudaLimitStackSize + driverStackSize + rounding). This can result in 100s to 1000s of MiBs.

The major difference between local memory and global memory is that local memory is interleaved at 4-byte granularity across the 32 lanes of a warp. This allows coalesced accesses to memory when all threads are accessing a local variable.

QUESTONS

  1. Where is the GPU stack physically stored?

Local Memory is in Global Virtual Address space. The memory can be physically in device memory or system memory. In almost all cases it will be in device memory for optimal latency and throughput.

  1. Why not merge the stack and local memory? In CPU programming, the stack is used for storing intermediate variables.

The C/C++ stack is in thread local memory.

  1. Which is faster, the stack or local memory?

These are the same.

  1. It seems like there’s hardly any mention of the GPU stack in the documentation. Did I miss something (CUDA C++ Programming Guide 12.6)?

Yes, you missed a fair bit of documentation throughout the CUDA C++ Programming Guide. There is also useful information in the PTX ISA 8.5 documentation.

2 Likes

Thank you so much for taking the time to provide such a detailed answer!

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.