Local memory layout and 32-bit words

Consider the example in the following forum post: https://forums.developer.nvidia.com/t/how-does-the-compiler-lay-out-local-variables-in-local-memory/176706. For simplicity, assume that all the doubles are stored in local memory for each thread.

The following sentences are found in chapter 5 of the CUDA programming guide:

…local memory accesses… are subject to the same requirements for memory coalescing as described in Device Memory Accesses. Local memory is however organized such that consecutive 32-bit words are accessed by consecutive thread IDs. Accesses are therefore fully coalesced as long as all threads in a warp access the same relative address (e.g., same index in an array variable, same member in a structure variable).

According to Robert Crovella (see the linked post), the compiler lays out all of the a1s for different threads next to each other in local memory, all of the a2s next to each other, etc. My question is this: What does

Local memory is however organized such that consecutive 32-bit words are accessed by consecutive thread IDs.

mean? Doubles are 64-bits. So, in the case of this example, we might say that consecutive 64-bit words are accessed by consecutive thread IDs. This seems to conflict with what is in the guide.

I looked in the linked post and didn’t see any comments from me there.

Sorry, that was the wrong link. It has now been corrected.

It’s certainly possible my statements are incorrect.

If I built a proper test case, and observed the presence of LDL.64 instructions in the SASS code, my opinion would be that my statements are probably correct, although I acknowledge this probably isn’t 100% conclusive if we imagine lots of additional things going on under the hood.

I think for this particular example, whether the compiler lays out 32-bit words in an adjacent fashion or 64-bit words in an adjacent fashion would have no meaningful difference, that I can think of. Let’s do a thought experiment. Suppose that the compiler breaks out those 64 bit quantities into 32 bit chunks, and then arranges the chunks in memory so that adjacent threads read adjacent 32-bit chunks. This is my attempt to conform almost exactly to your doc excerpt. In that case a 64-bit load across the threads would (somehow) be broken into a 32-bit warp-wide load, coalesced, followed by another 32-bit warp-wide load, coalesced. Then end result would be 64 bits loaded per thread, pretty much optimally. The operation is broken into 2 transactions, each 128 bytes in length.

The other possibility (I can imagine) is that 64-bit quantities are arranged in an adjacent fashion, and in that case I would assume the load mechanics would be exactly the same as global load mechanics. That means loading a 64-bit quantity per thread would result in 2 transactions, one of 128 bytes loading the first 16 threads quantities in their entirety, followed by the second transactions feeding the other 16 threads. Each of these transactions would be 128 bytes, fully coalesced.

Yes, the underlying storage pattern would be different. But for most typical usage that I can think of, the difference would be imperceptible, from a programmer’s perspective, and/or from a performance perspective.

If you would like to see an improvement in the CUDA docs, I suggest filing a bug.

I believe the docs are trying to communicate that the local storage pattern will place a given logical local item at consecutive addresses, to promote coalescing. That was what I was trying to communicate or should have tried to communicate.

1 Like

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.