Overhead for SM to switch to other set of blocks ?

I’ve read the manual that runtime will try to schedule as many blocks of threads as possible to one SM (While satisfying the limits of all shared mem, registers, max 7** threads per SM,…). This is also called occupancy , right ?

What is the overhead in terms of cycles for that SM to execute other sets of blocks after it finishes dealing with that previous sets of blocks ? What triggers that SM to switch to other sets of blocks ?


What I understand is that as soon as one of the blocks is finished, another block is brought in. I don’t know about overhead, there is a post with a small benchmark somewhere that talks about the overhead of having blocks that do no work at all (exit immediately) I believe it was a post by Sarnath or MisterAnderson42.

The overhead is essentially zero.

If there are total of 500 blocks to be executed, and let’s say that 8 blocks could fit into one SM.

For 8800GTX , since there are 16 SMs, 8 * 16 = 128 blocks will be executed by 16 SMs. I understand that there is no overhead for particular SM to switch to different block (Within those 8 blocks).

My question is, when particular SM finishes dealing with those initially assigned 8 blocks… what will be the overhead to switch to different set of blocks from remaining 372 blocks… ?

Is it still zero-overhead ?


Yes. If you think about it, the only initialization that really needs to be done is the thread/blockIdx values which takes essentially no time, especially if there is special register initialization hardware for this purpose.