Threads in flight


I am wondering if I launch a kernel with more than the maximum threads in flight (2048 * # of SM as explained by Robert Crovella from this topic, will the GPU wait the entire first group of threads in flight to be terminated or will it be done per SM?


The block scheduler will schedule new blocks on SMs approximately as soon as resources are available. If we are talking about blocks from a particular kernel launch (so that, therefore, their resource requirements are identical block-to-block) then as soon as a block “retires” on an SM, it’s possible for the GPU block scheduler to immediately schedule a new block there.

It doesn’t wait for more than one block to retire, whether on a particular SM or across SMs.

Great ! Thanks for this answer :) !!