Why don't I get full occupancy?

I’m trying to optimize a kernel to get full occupancy, with the help of maxrregcount I set the number of registers per thred to 21, 21 * 1536 = 32256 < 32768 which is the number of float registers on Fermi. My kernel uses 24576 bytes of shared memory, I launch 768 threads per block so there should be room for 2 blocks since the shared memory is 49152 bytes. According to the visual profiler I however only get 0.5 occupancy (only one block on each MP), can anyone explain this?

If I use the occupancy calculator I get “non valid value” if I try to use 768 threads per block (or for any value above 512 ).

Without maxrregcount I get 28 registers, so I guess that the rest spills into local memory, but due to the L1 cache the local memory should still be rather fast I suppose?

Why can’t the L1 cache be used to make room for extra registers instead?

This is the output from the compiler

ptxas info : Compiling entry function ‘_Z29Convolve_3D_Complex_7x7x7_NewPfP6float2S1_S1_iiiifii
ii’ for ‘sm_20’
ptxas info : Used 21 registers, 68+0 bytes lmem, 24576+0 bytes smem, 100 bytes cmem[0], 13488 bytes cmem[2]