Shared memory per block Related to shared memory of an MCPU

Hello,
I’ve just started CUDA development. There’s one thing that doesn’t get clear to me. When running e.g. the deviceQuery (contained in the SDK), it says:
Total amount of shared memory per block: 16384 bytes
Appendix A of the user guide says:
The amount of shared memory available per multiprocessor is 16 KB

What’s confusing me is the relation between blocks and multiprocessor. As I understand it each blocks runs on a multiprocessor, but there can run multiple blocks on one multiprocessor. Does this mean that all the blocks on the multiprocessor need to share the 16KB of memory on the MCPU or is the shared memory swapped to e.g. global memory, so that each block can always access the full 16KB of shared memory?

Assuming they have to share the shared memory. If I run a kernel with the maximum number of blocks (65536). The blocks will be divided over the 16 MCPU’s this means there run 65536 / 16 = 4096 blocks on each MCPU. Does this mean that each block can allocate a maximum of 16384 / 4096 = 4 bytes of shared memory?

Thanks,
Michiel

The runtime looks at how much shared memory your kernel requires, and will run more than one block per multiprocessor if there is enough shared memory to do so. If each block only requires 5 kB of shared memory, then up to 3 blocks can be run simultaneously per multiprocessor (leaving some extra for kernel parameters, which get loaded into shared memory also). If you use 15 kB of shared memory in a block, then the scheduler will only run 1 block per multiprocessor. Blocks that cannot be scheduled will run after the first blocks finish. So if you request 65536 blocks, they will all run, but not all at the same time.

Running multiple blocks per multiprocessor helps mitigate some of the global memory latency, so it is good to keep shared memory usage low if possible, but it is not required.

Does this apply to the register file as well, i.e must two blocks use less than 8192/2 registers each in order to be launched simultaneously?

/Lars

Basically, yes.

Shared memory and registers are partitioned among the threads of all concurrent blocks. So, decreasing shared memory usage (per block) and register usage (per thread) increases number of blocks that can run concurrently in the same SM.