I’ve just started CUDA development. There’s one thing that doesn’t get clear to me. When running e.g. the deviceQuery (contained in the SDK), it says:
Total amount of shared memory per block: 16384 bytes
Appendix A of the user guide says:
The amount of shared memory available per multiprocessor is 16 KB
What’s confusing me is the relation between blocks and multiprocessor. As I understand it each blocks runs on a multiprocessor, but there can run multiple blocks on one multiprocessor. Does this mean that all the blocks on the multiprocessor need to share the 16KB of memory on the MCPU or is the shared memory swapped to e.g. global memory, so that each block can always access the full 16KB of shared memory?
Assuming they have to share the shared memory. If I run a kernel with the maximum number of blocks (65536). The blocks will be divided over the 16 MCPU’s this means there run 65536 / 16 = 4096 blocks on each MCPU. Does this mean that each block can allocate a maximum of 16384 / 4096 = 4 bytes of shared memory?