Suppose I have a bunch of local variable declarations inside a kernel:
There are enough of them that they can’t all go in registers. (Aside: when we say there are N registers available per thread, or some such, what are the units of N? Is that bytes, 4-byte words, 8-byte, words…?) So, most if not all of these doubles will have to go to local memory.
How does the compiler and/or runtime lay these out in local memory?
Ideally, I’d like it to lay out all of the a1’s for all of the threads next to each other, followed by all of the a2’s for all of the threads next to each other, etc. That way, when the threads all access their a1 variables, they will be coalesced. (Well, if 8-byte offsets count as properly coalesced; do they have to be 4-byte offsets.)
Is this what the compiler will do? Or will the runtime allocate the local memory? Or do I have to be very explicit about how to lay stuff out in local memory?
(Edited: apologies: I posted this to the wrong forum. Sorry about that.)