local memory and GT200?

I’m working my way through Dr Dobbs cuda programming guide and have a question about local memory.
In the guide we make a program for reversing arrays which uses global memory and one which uses local memory. The latter is supposed to be much faster, however when I run this on my desktop linux Geforce8600 GT I find there is no speed difference between the two programs, but on my macbook pro Geforce9600M there is the discussed speedup by using local memory.

So why does local memory not speed things up on my desktop??? in the programming guide there is mention that the GT200 architecture relaxes the time problem accessing global memory so is that it???

I’m buying some tesla cards at the mo, so should I bother learning to use local memory properly or will this not improve the speed of my programs??? (apart from on my laptop)

Very confused…

Before answering, can you clarify if you mean “shared memory” rather than “local memory”?

Sorry, yes I do mean shared memory. I’m new to CUDA as you can probably tell, but very impressed so far.

How are you doing the timing? If you just measuring how long the program takes to run, there are potentially lots of things that could swamp the kernel execution time. (Don’t worry. Correctly timing CUDA micro-benchmarks is non-trivial and lots of people do it wrong the first few times. Life is easier when you are optimizing a large program, in which case the actual runtime of the whole program is what you care about.)

To answer your general question: Shared memory is a very valuable tool for some algorithms, and worth understanding for sure. Shared memory is most useful for either making more efficient uses of limited global memory bandwidth or for allowing threads in the same block to exchange data. Both of these abilities vastly improve the performance of many CUDA algorithms.

The GT200 memory controller does remove some limitations which were compensated for in the past with the help of shared memory, but there are still plenty of uses for shared memory. Moreover, neither of your devices is based on the GT200. Both are compute capability 1.1 devices, which require contiguous, aligned, in-order reads of 32/64/128 bit elements from threads in a half-warp to issue full-size memory transactions. Break any of those requirements and a compute capability 1.0/1.1 device will start issuing memory transactions for each element, whereas GT200 will group the reads together efficiently.