Because the compiler ‘knows’ how to read the data to make sure it is coalesced. If all threads are accessing element 0 of an array, they are in fact accessing element [0+threadIdx] of a global array e.g.
I think that (1) is the answer - I see a lot more reads from gmem in the slower case - but it then makes me question why this is as the code is identical.
The only difference is the way that the pointers are assigned. In one case they are assigned from global memory allocated by me and in the other case from the local variable. The local variables are definitely in local/global memory as the cubin file shows this.
I happen to this thread and simply wonder how we can declare a pointer points to local memory. I notice that the programming guide says it is the complier’s duty to decide whether a certain variable is placed in local mem or in reg. So, how can we explicitly allocate local mem?
Turns out that NVIDIA have confirmed that local memory is autocoalesced. I changed my global memory version of the code so that the arrays were interleaved and got the same or better performance than when the arrays were placed in local memory. Shame that the CUDA compiler seems to limit locals to 16k as it woudl be very, very handy to not have to explicitly code the inteleaving.
I don’t think it is documented in the release notes - is it? I thought it was but maybe it was passed onto me by an Nvidia contact we’re working with. All I can see in the release notes is a comment that the profiler behaviour doesn’t disable incompatable features so results look the same as when you have 0 non-coalesced accesses; not good me thinks. Anyway, this is such a fundamental part of CUDA development that it either needs to ‘slap you in the face’ or be fixed but it definitely should not silently fail.
All I can say is that I saw it today in the visual profiler under linux (the version released with 2.1 beta). But I must say both my cards are compute 1.3, so maybe it only disables it when all cards fall under this category.
I am still very confused about the comment in the programming guide that “local memory accesses are always coalesced”. In this thread it was hypothesized that local memory arrays are interleaved in device memory and this automatically gives coalescence. But that would only give coalescense if all threads were accessing the same index of their respective local memory arrays. And the programming guide doesn’t add that stipulation. Is this an omission in the programming guide, or is there some other resolution here?
I think you’re right. For ‘plain’ variables (not arrays) it makes sense that they would be automatically interleaved and always coalesced. But with arrays, I don’t see how it could always be coalesced when the indices can be variables.