Local memory usage GTX 260 vs GT 120

I have a (proprietary, unpostable) kernel which compiles, according to ptxas, with resource usage:

ptxas info : Used 10 registers, 148+16 bytes smem, 16 bytes cmem[1]

When I run this kernel on a GeForce GTX 260 the profiler notes that the kernel uses 10 registers per thread, 16x16 thread blocks and achieves 1.0 occupancy (16384 / 10 = 1636 threads max, I guess runs 1024 threads). It’s fast.

On a GeForce GT 120 the profiler notes similar thread block configuration and occupancy (should be fine, 8192 / 10 = 816 threads, 3x16x16). However, the kernel makes a substantial number of stores (no loads) to local memory, which appear to have a significant effect on performance. On the GTX 260 no local stores are identified by the profiler.

I had assumed that the PTX stage would have identified use of local memory if there would be any. I’m not entirely sure why there is any; the compiler has saved no registers (and does not need to) on this architecture. I stripped the kernel down as far as I could and the local memory usage only stops when the kernel is doing little more than an array copy with address computation.

So… can the PTX → device code stage introduce local memory usage? Would there be any obvious reason for it to? Furthermore, why do I see local memory stores but no loads?

These GPUs have different number of multiprocessors. The profiler connects to just one of these multiprocessors.

Now when you kernel uses some trigonometry functions, local memory use actually depends on the operands you pass to the functions. So it may well be that the profiler will profile different thread blocks depending on what graphics card you run it on. And some of these blocks may use local memory more than others.

That’s just a possible explanation, not necessarily the correct one.