The CUDA programming guide (2.1) states:
“Local memory accesses are always coalesced though since they are per-thread by definition.”
Is this true even when a local variable is an array (i.e. one array for each thread)? What if the index to this local array is a variable such that the compiler cannot know apriori how to interleave the array to ensure coalescing?
Thanks in advance for any clarification.