I’m curious about maximum number of blocks on cuda. Let’s say that we have a GTX 8800 with 16 streaming multi processors, 8 ALUs, 16KB shared memory per multiprocessor, and a kernel where each block requires 1060 Bytes of shared memory.
As far as I understood the maximum number of blocks that can be run simultaneously will be limited by the shared memory requirements of the blocks. So I tried to calculate the maximum num of blocks for kernel execution as:
max number of blocks per Multiprocessors: 16KB/1060B -> 15
max number of blocks on device -> 15x16 = 240
However, this calculation is in collision with the experimental results. The before mentioned kernel achieves really good performance when it’s launched on a 64x64 grid = 4096 blocks , or larger. Finally, for 512x512 blocks, the kernel crashes.
So, I would expect that the blocks are replaced with the new ones on the same multiprocessor, as soon as they are processed. Is that correct or some other mechanism is used?
Finally, how is the max. number of blocks that could be run
a) by one multiprocessor
b) by one kernel
Thanks in advance for helping me to better understand the execution model. :)