How does the compiler lay out local variables in local memory?

Suppose I have a bunch of local variable declarations inside a kernel:

double a1;
double a2;

double a30;

There are enough of them that they can’t all go in registers. (Aside: when we say there are N registers available per thread, or some such, what are the units of N? Is that bytes, 4-byte words, 8-byte, words…?) So, most if not all of these doubles will have to go to local memory.

How does the compiler and/or runtime lay these out in local memory?

Ideally, I’d like it to lay out all of the a1’s for all of the threads next to each other, followed by all of the a2’s for all of the threads next to each other, etc. That way, when the threads all access their a1 variables, they will be coalesced. (Well, if 8-byte offsets count as properly coalesced; do they have to be 4-byte offsets.)

Is this what the compiler will do? Or will the runtime allocate the local memory? Or do I have to be very explicit about how to lay stuff out in local memory?



(Edited: apologies: I posted this to the wrong forum. Sorry about that.)

Hi Rob,

How does the compiler and/or runtime lay these out in local memory?

See the “local memory” section in 5.3.2 of the CUDA Programing Guide (Programming Guide :: CUDA Toolkit Documentation)

Local memory accesses only occur for some automatic variables as mentioned in Variable Memory Space Specifiers. Automatic variables that the compiler is likely to place in local memory are:

  • Arrays for which it cannot determine that they are indexed with constant quantities,
  • Large structures or arrays that would consume too much register space,
  • Any variable if the kernel uses more registers than available (this is also known as register spilling ).

Inspection of the PTX assembly code (obtained by compiling with the -ptx or-keep option) will tell if a variable has been placed in local memory during the first compilation phases as it will be declared using the .local mnemonic and accessed using the ld.local and st.local mnemonics. Even if it has not, subsequent compilation phases might still decide otherwise though if they find it consumes too much register space for the targeted architecture: Inspection of the cubin object using cuobjdump will tell if this is the case. Also, the compiler reports total local memory usage per kernel (lmem) when compiling with the --ptxas-options=-v option. Note that some mathematical functions have implementation paths that might access local memory.

The local memory space resides in device memory, so local memory accesses have the same high latency and low bandwidth as global memory accesses and are subject to the same requirements for memory coalescing as described in Device Memory Accesses. Local memory is however organized such that consecutive 32-bit words are accessed by consecutive thread IDs. Accesses are therefore fully coalesced as long as all threads in a warp access the same relative address (e.g., same index in an array variable, same member in a structure variable).