Local faster than global. Why?

mgibbons · December 19, 2008, 12:34pm

If I change my kernel to use an array in local memory rather than global I see a big increase in performance and a big decrease in global loads and stores.

What I am wondering is why? Everything else in my code is the same apart from assigning the pointers to either a local array of a global array.

I know that the local automatic array is in definitely in local memory from the cubin output.

The manual say that:

'As accesses are, by definition, per-thread, they are automatically coalesced'

Does this explain what I’m seeing? Perhaps I’m just dim but I don’t understand why they would automatically be coalesced.

thanks
Mark

E.D_Riedijk · December 19, 2008, 6:05pm

Because the compiler ‘knows’ how to read the data to make sure it is coalesced. If all threads are accessing element 0 of an array, they are in fact accessing element [0+threadIdx] of a global array e.g.

Geka · December 19, 2008, 7:50pm

I see two reasons that could explain why your code is faster with local memory:

1/ You have non-coalesced accesses in your global memory code. You can check that with the profiler.

2/ Some of your local memory variables are put into registers, and in this case it’s way faster than a global/local memory access.

mgibbons · December 22, 2008, 10:49am

I think that (1) is the answer - I see a lot more reads from gmem in the slower case - but it then makes me question why this is as the code is identical.

The only difference is the way that the pointers are assigned. In one case they are assigned from global memory allocated by me and in the other case from the local variable. The local variables are definitely in local/global memory as the cubin file shows this.

//TComplexReal* termNodes = termNodes_start + ((numT+1) * ref);    <<<< slow
//TComplexReal* probNodes = probNodes_start + ((numT+1) * ref);

TComplexReal termNodesa[256];   <<< fast
TComplexReal probNodesa[256];
TComplexReal* termNodes = &termNodesa[0];
TComplexReal* probNodes = &probNodesa[0];

So, given that the code is the same - bar the bit above - and that the arrays are put into local memory - which is the same as global memory - why does the local version go almost 6 times faster?

wumpus · December 22, 2008, 1:54pm

Local memory is auto-coalesced (interleaved per thread), I think

mgibbons · January 8, 2009, 11:51am

I think that you may well be right.

queeten · January 9, 2009, 11:11am

If I change my kernel to use an array in local memory rather than global I see a big increase in performance and a big decrease in global loads and stores.

What I am wondering is why? Everything else in my code is the same apart from assigning the pointers to either a local array of a global array.

I know that the local automatic array is in definitely in local memory from the cubin output.

The manual say that:
'As accesses are, by definition, per-thread, they are automatically coalesced'
Does this explain what I’m seeing? Perhaps I’m just dim but I don’t understand why they would automatically be coalesced.

thanks

Mark

Hello!

I happen to this thread and simply wonder how we can declare a pointer points to local memory. I notice that the programming guide says it is the complier’s duty to decide whether a certain variable is placed in local mem or in reg. So, how can we explicitly allocate local mem?

Thanks!

mgibbons · January 9, 2009, 11:36am

I don’t think that you can but you can check to see where it’s been placed by looking at the cubin file.

mgibbons · January 13, 2009, 11:17am

Turns out that NVIDIA have confirmed that local memory is autocoalesced. I changed my global memory version of the code so that the arrays were interleaved and got the same or better performance than when the arrays were placed in local memory. Shame that the CUDA compiler seems to limit locals to 16k as it woudl be very, very handy to not have to explicitly code the inteleaving.

mgibbons · January 13, 2009, 11:18am

One thing, which is not widely publicised (!), is that the profiler does not record non-coalesced memory accesses on the newer hardware e.g. GTX280 or C1060.

regards

Mark

E.D_Riedijk · January 13, 2009, 12:49pm

When selecting signals in the later versions of the profiler, non-coalesced items are grayed out when running on this hardware. It has also been for a long time in the release notes of the profiler.

The fact that people don’t read release notes (me included) can not be held against NVIDIA.

mgibbons · January 13, 2009, 3:28pm

I don’t think it is documented in the release notes - is it? I thought it was but maybe it was passed onto me by an Nvidia contact we’re working with. All I can see in the release notes is a comment that the profiler behaviour doesn’t disable incompatable features so results look the same as when you have 0 non-coalesced accesses; not good me thinks. Anyway, this is such a fundamental part of CUDA development that it either needs to ‘slap you in the face’ or be fixed but it definitely should not silently fail.

E.D_Riedijk · January 13, 2009, 5:04pm

All I can say is that I saw it today in the visual profiler under linux (the version released with 2.1 beta). But I must say both my cards are compute 1.3, so maybe it only disables it when all cards fall under this category.

alex_dubinsky · January 14, 2009, 8:50pm

Yes, I observed the same. I don’t understand what mistake NVIDIA made in implementing local that reimplementing it manually though global gives better performance.

P.S. Hiding a comment in the Release Notes but not mentioning this huge fault anywhere else (esp the Guide) certainly counts as ‘not widely publicised’! NVIDIA seems to like doing that a lot.

stencil · March 20, 2009, 10:32am

I am still very confused about the comment in the programming guide that “local memory accesses are always coalesced”. In this thread it was hypothesized that local memory arrays are interleaved in device memory and this automatically gives coalescence. But that would only give coalescense if all threads were accessing the same index of their respective local memory arrays. And the programming guide doesn’t add that stipulation. Is this an omission in the programming guide, or is there some other resolution here?

Jamie_K · March 20, 2009, 1:35pm

I think you’re right. For ‘plain’ variables (not arrays) it makes sense that they would be automatically interleaved and always coalesced. But with arrays, I don’t see how it could always be coalesced when the indices can be variables.

Topic		Replies	Views
Local memory performance Using more than 4kb kills it.. why? CUDA Programming and Performance	24	5422	September 6, 2008
How fast is local memory? the doc doesn't say much CUDA Programming and Performance	24	8563	August 20, 2007
Local vs Global memory is local memory access always coalesced ? CUDA Programming and Performance	4	4501	June 30, 2009
Coalescing of local arrays CUDA Programming and Performance	0	909	June 10, 2009
Help me to understand Global vs Local Memory performance. CUDA Programming and Performance	19	25213	December 21, 2009
Slow local memory, feigned constant memory. coalesced? global? CUDA Programming and Performance	29	7577	January 25, 2010
Coalescence CUDA Programming and Performance	3	845	January 9, 2018
Local Memory - What is that? Memory Hierarchies CUDA Programming and Performance	26	22823	December 6, 2007
How does the compiler lay out local variables in local memory? nvc, nvc++ and nvfortran	1	760	April 30, 2021
ptx .local memory CUDA Programming and Performance	2	6552	August 21, 2010

Local faster than global. Why?

Related topics