Problem with local memory

I’m having some problems with a kernel, I get correct results but the kernel is very slow. It seems like each thread uses 1.6 KB of local memory, I thought that the L1-cache on the Fermi-cards would be fast enough to handle this…if the L1-cache is 16 KB, does this mean that only 10 threads can run in parallel on each multiprocessor? My GPU is Nvidia GTX 480. In each thread I invert a 7x7 matrix and this uses a lot of registers.

Compiler output for kernel
ptxas info : Used 63 registers, 1644+4 bytes lmem, 8192+0 bytes smem, 128 bytes cmem[0], 54332 bytes cmem[2], 16 bytes cmem[16]

Wow, that’s a lot of local memory. It seems a bit excessive for the inversion of a 7x7 matrix. How many copies of the matrix do you need to store in order to invert it? That size matrix is only a couple hundred bytes.

No. It just means that a lot of your local memory traffic is likely going to get flushed all the way out to dram and read back when you run out of cache - and slow things down.

The number one thing to do is up the cache config to 48k. See the set cache config function calls in the reference manual for more information (if you’re running CUDA 3.2, there is a new call that is simpler than cudaSetFuncCacheConfig).

The number two thing to do is take a close look at the code and try and figure out where all that local memory use is coming from. Perhaps with some reuse of arrays or some clever rethinking of your code structure you can boost the performance significantly.