Limit of 8 blocks per SM in CUDA

Hi guys,
Can anyone tell me why there’s a limit of 8 blocks per SM in CUDA?

Because there is hardware required per block and 8 is a reasonable limit?

I have never seen an explanation. I would guess hardware limits. The switching hardware that looks after block state probably used a small number of bits for block id, and I am guessing that it must be expensive (in terms of die area) to increase that. The fact that successive generations of hardware have upped the threads per block, registers and shared memory each MP supports, but the 8 blocks per MP limit hasn’t changed might support the notion that they former are easier to do than the latter, or that there is limited return in performance terms to do so.

or that there is limited return in performance terms to do so.
Why do you say so? Can you please elaborate on this?

The switching hardware that looks after block state probably used a small number of bits for block id,
Since blockIdx is common for all threads in a block, they could be using SM for storing this info, right?

Transistor budget versus performance. If having more blocks means a much larger or more complex SM design, that complexity/die size/power consumption increase might not be justifiable in terms of peformance or cost. Semiconductor design is always about compromises. This might be one of those compromises.

I was thinking about shared memory and code. Warps from the same block must be “wired” correctly to the same pieces of shared memory. On Fermi, blocks might not even be from the same kernel, which means multiple instruction streams must be managed. The hardware that looks after that functionality probably isn’t trivial in size or complexity, which leads back to the first point about transistor budget and design compromises.

BTW, when I meant “they could be using SM for storing this info, right?”, I was thinking about shared-mem :D

Hmm… the design compromise makes sense. Thanks man!