Assign blocks to SMs

Is there a way to guarantee that each thread block will be scheduled for a hardware SM?

That is, I want to use all my GPU SMs, even if each thread block does not have the maximum number of threads.

I have two GPUs available, a Tesla K40c, and a Titan V.

Use the cudaGetDeviceProperties call to query the number of SMs on your GPU.

Launch at least that many blocks. The GPU block scheduler will deposit one or more blocks per SM, in most cases.

Kudos for remembering to add “in most cases”!

You can force blocks to be evenly spread by having them use all shared memory (allocate the maximum permissible amount of shared memory per block, minus the static shared memory used, as dynamically allocated shared memory via the third argument of the <<<>>> launch configuration operator), for most compute capabilities.

This scheme may still fail for CCs 3.7, 5.2, 6.1 and 6.2 where the maximum permissible shared memory size per block is half the total available shared memory size of an SM or less.

Another option is to launch more blocks than SMs, discover at runtime the distribution of active blocks amongst SMs, and exit all but one block per SM.

I notice this is getting complicated. But I have successfully used these techniques in the days of Compute Capability 1.x, when it was still possible to outperform the block scheduler with a custom implementation.

lol nice :)

It feels really niched or missguided when people start worrying about block scheduling (not saying that it can’t give performance improvements).

I have vague memories of working on the GT200 in 2009, where I managed to store previous block ID:s in shared memory that could be picked by the next block scheduled to determine what it should execute on. Not very robust to say the least :-)


I never tried passing data in shared memory between blocks. I didn’t have to, because with the custom block scheduler the block would only exit once all work was done.

IIRC the CC 1.x block scheduler was strictly round-robin. So anything that took actual load balance into account would beat it, even given that it had to use some global memory atomics.

And I have no problem being described as niche - that’s where I’ve been almost all live.

My biggest concern is not with performance improvements, my code is directed to a research project, so I am investigating other issues

I think I got the point, but is there any way to check the SMs occupancy through NVPROF?