I’m having some problems with a kernel, I get correct results but the kernel is very slow. It seems like each thread uses 1.6 KB of local memory, I thought that the L1-cache on the Fermi-cards would be fast enough to handle this…if the L1-cache is 16 KB, does this mean that only 10 threads can run in parallel on each multiprocessor? My GPU is Nvidia GTX 480. In each thread I invert a 7x7 matrix and this uses a lot of registers.
Compiler output for kernel
ptxas info : Used 63 registers, 1644+4 bytes lmem, 8192+0 bytes smem, 128 bytes cmem, 54332 bytes cmem, 16 bytes cmem