I have a kernel that uses 14 registers per thread, 16 KB of shared memory per thread block and 512 threads per block. The Nvidia GTX 680 has 48 KB of shared memory, so I should be able to run 3 thread blocks per multiprocessor, and thereby achieve 1536 threads per MP or about 75% occupancy. However, according to the profiler I only achieve 25% occupancy. Do I have to set “PreferShared” in order to get the full 48 KB of shared memory?
Isn’t some of shared memory used for kernel parameters? If your thread block uses exactly 16 kB of shared memory, then I think the maximum you can launch is 2 blocks. However, that still should give 50% occupancy, and not 25%, so something else must be going on. How many registers does each thread use?
The programming guide says that the default shared memory configuration is 48 kB shared / 16 kB L1 cache, so that should not be the problem.
For Kepler the kernel parameters are stored in constant memory.