How does the compiler lay out local variables in local memory

(I originally posted this to the wrong forum topic. I think this is a general CUDA question; please let me know if I still don’t have the right topic.)

Suppose I have a bunch of local variable declarations inside a kernel:

double a1;
double a2;

double a30;

There are enough of them that they can’t all go in registers. (Aside: when we say there are N registers available per thread, or some such, what are the units of N? Is that bytes, 4-byte words, 8-byte, words…?) So, most if not all of these doubles will have to go to local memory.

How does the compiler and/or runtime lay these out in local memory?

Ideally, I’d like it to lay out all of the a1’s for all of the threads next to each other, followed by all of the a2’s for all of the threads next to each other, etc. That way, when the threads all access their a1 variables, they will be coalesced. (Well, if 8-byte offsets count as properly coalesced; do they have to be 4-byte offsets.)

Is this what the compiler will do? Or will the runtime allocate the local memory? Or do I have to be very explicit about how to lay stuff out in local memory?



Yes, the compiler does what you are describing.

local memory in CUDA is a logical space (just like global memory is a logical space). Both are physically backed by GPU DRAM. registers are another type of physical resource, however they don’t represent the “backing” for variables because they are ephemeral, i.e. temporary in nature, and they don’t inherently embody any form of addressing.

local variables are physically backed by GPU DRAM. Where they “live” at any particular instant is something that only can be discovered by inspecting a specific piece of code. It’s possible that a local variable may never “touch” GPU DRAM because the compiler had no need to cause that to happen. However, if you have enough local memory variables defined (and in use), some of them will likely be resident in GPU DRAM, from time to time.

There are caching effects here as well (just like there would be for logical global space access), and these caching effects vary by GPU architecture. But for understanding we can ignore the caches for now.

When a local variable is resident in GPU DRAM, if your code accesses it, then the memory controller will generate DRAM access cycles to retrieve the data, just as you’d expect. However the question you are raising is about the storage pattern. The compiler arranges the storage pattern (i.e. the physical addresses of variables in memory) such that adjacent threads, reading the same local variable, will access DRAM memory in a coalesced fashion.

when the compiler must write out a local variable that is occupying a register to GPU DRAM memory, this is referred to as a spill store. Likewise when the compiler loads a local variable from GPU DRAM memory into a register, this is a spill load. Spill loads/stores are something you can inspect at compile time (via passing the right arguments to nvcc) so therefore we can surmise that most of the decision making about local memory storage patterns and register usage are made at compile time. These are generally not runtime decisions or things that require runtime intervention.

To confirm these statements, or inspect this behavior in detail, you would probably use the cuda binary utilities. I’m not going to give a tutorial here, but using the tool cuobjdump -sass my_exe will display the compiled SASS code. The SASS instruction to load register data from the logical local space in DRAM would typically be LDL, and the instruction to store register data to the local space in DRAM would typically be STL.

Summary documentation can be found here, scroll down to the “Local Memory” section.