Local faster than global. Why?

If I change my kernel to use an array in local memory rather than global I see a big increase in performance and a big decrease in global loads and stores.

What I am wondering is why? Everything else in my code is the same apart from assigning the pointers to either a local array of a global array.

I know that the local automatic array is in definitely in local memory from the cubin output.

The manual say that:

'As accesses are, by definition, per-thread, they are automatically coalesced'

Does this explain what I’m seeing? Perhaps I’m just dim but I don’t understand why they would automatically be coalesced.

thanks
Mark

Because the compiler ‘knows’ how to read the data to make sure it is coalesced. If all threads are accessing element 0 of an array, they are in fact accessing element [0+threadIdx] of a global array e.g.

I see two reasons that could explain why your code is faster with local memory:

1/ You have non-coalesced accesses in your global memory code. You can check that with the profiler.

2/ Some of your local memory variables are put into registers, and in this case it’s way faster than a global/local memory access.

I think that (1) is the answer - I see a lot more reads from gmem in the slower case - but it then makes me question why this is as the code is identical.

The only difference is the way that the pointers are assigned. In one case they are assigned from global memory allocated by me and in the other case from the local variable. The local variables are definitely in local/global memory as the cubin file shows this.

//TComplexReal* termNodes = termNodes_start + ((numT+1) * ref);    <<<< slow
//TComplexReal* probNodes = probNodes_start + ((numT+1) * ref);

TComplexReal termNodesa[256];   <<< fast
TComplexReal probNodesa[256];
TComplexReal* termNodes = &termNodesa[0];
TComplexReal* probNodes = &probNodesa[0];

So, given that the code is the same - bar the bit above - and that the arrays are put into local memory - which is the same as global memory - why does the local version go almost 6 times faster?

Local memory is auto-coalesced (interleaved per thread), I think

I think that you may well be right.

Hello!

I happen to this thread and simply wonder how we can declare a pointer points to local memory. I notice that the programming guide says it is the complier’s duty to decide whether a certain variable is placed in local mem or in reg. So, how can we explicitly allocate local mem?

Thanks!

I don’t think that you can but you can check to see where it’s been placed by looking at the cubin file.

Turns out that NVIDIA have confirmed that local memory is autocoalesced. I changed my global memory version of the code so that the arrays were interleaved and got the same or better performance than when the arrays were placed in local memory. Shame that the CUDA compiler seems to limit locals to 16k as it woudl be very, very handy to not have to explicitly code the inteleaving.

One thing, which is not widely publicised (!), is that the profiler does not record non-coalesced memory accesses on the newer hardware e.g. GTX280 or C1060.

regards

Mark

When selecting signals in the later versions of the profiler, non-coalesced items are grayed out when running on this hardware. It has also been for a long time in the release notes of the profiler.

The fact that people don’t read release notes (me included) can not be held against NVIDIA.

I don’t think it is documented in the release notes - is it? I thought it was but maybe it was passed onto me by an Nvidia contact we’re working with. All I can see in the release notes is a comment that the profiler behaviour doesn’t disable incompatable features so results look the same as when you have 0 non-coalesced accesses; not good me thinks. Anyway, this is such a fundamental part of CUDA development that it either needs to ‘slap you in the face’ or be fixed but it definitely should not silently fail.

All I can say is that I saw it today in the visual profiler under linux (the version released with 2.1 beta). But I must say both my cards are compute 1.3, so maybe it only disables it when all cards fall under this category.

Yes, I observed the same. I don’t understand what mistake NVIDIA made in implementing local that reimplementing it manually though global gives better performance.

P.S. Hiding a comment in the Release Notes but not mentioning this huge fault anywhere else (esp the Guide) certainly counts as ‘not widely publicised’! NVIDIA seems to like doing that a lot.

I am still very confused about the comment in the programming guide that “local memory accesses are always coalesced”. In this thread it was hypothesized that local memory arrays are interleaved in device memory and this automatically gives coalescence. But that would only give coalescense if all threads were accessing the same index of their respective local memory arrays. And the programming guide doesn’t add that stipulation. Is this an omission in the programming guide, or is there some other resolution here?

I think you’re right. For ‘plain’ variables (not arrays) it makes sense that they would be automatically interleaved and always coalesced. But with arrays, I don’t see how it could always be coalesced when the indices can be variables.