Yes, the compiler does what you are describing.
local memory in CUDA is a logical space (just like global memory is a logical space). Both are physically backed by GPU DRAM. registers are another type of physical resource, however they don’t represent the “backing” for variables because they are ephemeral, i.e. temporary in nature, and they don’t inherently embody any form of addressing.
local variables are physically backed by GPU DRAM. Where they “live” at any particular instant is something that only can be discovered by inspecting a specific piece of code. It’s possible that a local variable may never “touch” GPU DRAM because the compiler had no need to cause that to happen. However, if you have enough local memory variables defined (and in use), some of them will likely be resident in GPU DRAM, from time to time.
There are caching effects here as well (just like there would be for logical global space access), and these caching effects vary by GPU architecture. But for understanding we can ignore the caches for now.
When a local variable is resident in GPU DRAM, if your code accesses it, then the memory controller will generate DRAM access cycles to retrieve the data, just as you’d expect. However the question you are raising is about the storage pattern. The compiler arranges the storage pattern (i.e. the physical addresses of variables in memory) such that adjacent threads, reading the same local variable, will access DRAM memory in a coalesced fashion.
when the compiler must write out a local variable that is occupying a register to GPU DRAM memory, this is referred to as a spill store. Likewise when the compiler loads a local variable from GPU DRAM memory into a register, this is a spill load. Spill loads/stores are something you can inspect at compile time (via passing the right arguments to nvcc) so therefore we can surmise that most of the decision making about local memory storage patterns and register usage are made at compile time. These are generally not runtime decisions or things that require runtime intervention.
To confirm these statements, or inspect this behavior in detail, you would probably use the cuda binary utilities. I’m not going to give a tutorial here, but using the tool
cuobjdump -sass my_exe will display the compiled SASS code. The SASS instruction to load register data from the logical local space in DRAM would typically be LDL, and the instruction to store register data to the local space in DRAM would typically be STL.
Summary documentation can be found here, scroll down to the “Local Memory” section.