How does the compiler lay out local variables in local memory?

rknop · April 29, 2021, 5:57pm

Suppose I have a bunch of local variable declarations inside a kernel:

double a1;
double a2;
…
double a30;

There are enough of them that they can’t all go in registers. (Aside: when we say there are N registers available per thread, or some such, what are the units of N? Is that bytes, 4-byte words, 8-byte, words…?) So, most if not all of these doubles will have to go to local memory.

How does the compiler and/or runtime lay these out in local memory?

Ideally, I’d like it to lay out all of the a1’s for all of the threads next to each other, followed by all of the a2’s for all of the threads next to each other, etc. That way, when the threads all access their a1 variables, they will be coalesced. (Well, if 8-byte offsets count as properly coalesced; do they have to be 4-byte offsets.)

Is this what the compiler will do? Or will the runtime allocate the local memory? Or do I have to be very explicit about how to lay stuff out in local memory?

Thanks,

-Rob

(Edited: apologies: I posted this to the wrong forum. Sorry about that.)

MatColgrove · April 30, 2021, 3:00pm

Hi Rob,

How does the compiler and/or runtime lay these out in local memory?

See the “local memory” section in 5.3.2 of the CUDA Programing Guide (CUDA C++ Programming Guide)

Local memory accesses only occur for some automatic variables as mentioned in Variable Memory Space Specifiers. Automatic variables that the compiler is likely to place in local memory are:

Arrays for which it cannot determine that they are indexed with constant quantities,

Large structures or arrays that would consume too much register space,

Any variable if the kernel uses more registers than available (this is also known as register spilling ).

Inspection of the PTX assembly code (obtained by compiling with the -ptx or-keep option) will tell if a variable has been placed in local memory during the first compilation phases as it will be declared using the .local mnemonic and accessed using the ld.local and st.local mnemonics. Even if it has not, subsequent compilation phases might still decide otherwise though if they find it consumes too much register space for the targeted architecture: Inspection of the cubin object using cuobjdump will tell if this is the case. Also, the compiler reports total local memory usage per kernel (lmem) when compiling with the --ptxas-options=-v option. Note that some mathematical functions have implementation paths that might access local memory.

The local memory space resides in device memory, so local memory accesses have the same high latency and low bandwidth as global memory accesses and are subject to the same requirements for memory coalescing as described in Device Memory Accesses. Local memory is however organized such that consecutive 32-bit words are accessed by consecutive thread IDs. Accesses are therefore fully coalesced as long as all threads in a warp access the same relative address (e.g., same index in an array variable, same member in a structure variable).

Topic		Replies	Views
How does the compiler lay out local variables in local memory CUDA Programming and Performance	1	1931	April 30, 2021
Coalescing of local arrays CUDA Programming and Performance	0	909	June 10, 2009
How fast is local memory? the doc doesn't say much CUDA Programming and Performance	24	8563	August 20, 2007
Local memory? CUDA Programming and Performance	6	5245	April 23, 2007
Local faster than global. Why? CUDA Programming and Performance	15	13182	March 20, 2009
how to know what variables are placed in local memory? CUDA Programming and Performance	9	5539	January 29, 2010
Local memory layout and 32-bit words CUDA Programming and Performance cuda	3	1391	February 23, 2022
large local variables CUDA Programming and Performance	3	3564	May 27, 2007
Local vs Global memory is local memory access always coalesced ? CUDA Programming and Performance	4	4501	June 30, 2009
How is memory type chosen for stack variable? CUDA Programming and Performance	5	6304	November 5, 2007

How does the compiler lay out local variables in local memory?

Related topics