Consider the example in the following forum post: https://forums.developer.nvidia.com/t/how-does-the-compiler-lay-out-local-variables-in-local-memory/176706. For simplicity, assume that all the doubles are stored in local memory for each thread.
The following sentences are found in chapter 5 of the CUDA programming guide:
…local memory accesses… are subject to the same requirements for memory coalescing as described in Device Memory Accesses. Local memory is however organized such that consecutive 32-bit words are accessed by consecutive thread IDs. Accesses are therefore fully coalesced as long as all threads in a warp access the same relative address (e.g., same index in an array variable, same member in a structure variable).
According to Robert Crovella (see the linked post), the compiler lays out all of the a1s for different threads next to each other in local memory, all of the a2s next to each other, etc. My question is this: What does
Local memory is however organized such that consecutive 32-bit words are accessed by consecutive thread IDs.
mean? Doubles are 64-bits. So, in the case of this example, we might say that consecutive 64-bit words are accessed by consecutive thread IDs. This seems to conflict with what is in the guide.