I have a (proprietary, unpostable) kernel which compiles, according to ptxas, with resource usage:
ptxas info : Used 10 registers, 148+16 bytes smem, 16 bytes cmem[1]
When I run this kernel on a GeForce GTX 260 the profiler notes that the kernel uses 10 registers per thread, 16x16 thread blocks and achieves 1.0 occupancy (16384 / 10 = 1636 threads max, I guess runs 1024 threads). It’s fast.
On a GeForce GT 120 the profiler notes similar thread block configuration and occupancy (should be fine, 8192 / 10 = 816 threads, 3x16x16). However, the kernel makes a substantial number of stores (no loads) to local memory, which appear to have a significant effect on performance. On the GTX 260 no local stores are identified by the profiler.
I had assumed that the PTX stage would have identified use of local memory if there would be any. I’m not entirely sure why there is any; the compiler has saved no registers (and does not need to) on this architecture. I stripped the kernel down as far as I could and the local memory usage only stops when the kernel is doing little more than an array copy with address computation.
So… can the PTX → device code stage introduce local memory usage? Would there be any obvious reason for it to? Furthermore, why do I see local memory stores but no loads?