Amount of usable shared memory?

Hi all,
I have a kernel that uses variable input, which affects the amount of dynamically allocated shared memory. I’d like to run 2 blocks per SM (Tesla 2050), i.e. 28 blocks total. According to deviceQuery, I have 48K of shared memory per SM, which would limit each block at 24K.
However, the actual limit appears much lower. I can get 28 parallel blocks for inputs that use up to ~14K, but beyond that I end up with serialized execution.
I count shared memory as a sum of smem determined at compile time (3088 bytes) + size of a dynamically allocated array passed to the kernel.

Does anyone know if there is any hidden use of shared memory? Any ways to find the missing/used memory?

Thanks for any hints


That seems weird. How do you determine whether blocks execute in parallel or sequentially?
Barring any other idea, you could vary the amount of dynamically allocated shared memory to do a binary search for the exact size where blocks start executing sequentially.
Note that the API makes no guarantees for parallel execution, so we are in undocumented land here. I don’t know whether the driver uses heuristics to decide whether blocks should preferably be spread out between SMs to improve speed or concentrated to keep entire SMs free for following kernels.

I found a mistake in my code - a leftover from a previous version that multiplied the requested shared memory by sizeof(short), which doubled the dynamically allocated requirement. With that taken into account, the maximum shared memory usage that allows for two blocks on a SM is 24K exactly.