Assign blocks to SMs

anon7339525 · January 30, 2019, 5:48pm

Is there a way to guarantee that each thread block will be scheduled for a hardware SM?

That is, I want to use all my GPU SMs, even if each thread block does not have the maximum number of threads.

I have two GPUs available, a Tesla K40c, and a Titan V.

Robert_Crovella · January 31, 2019, 8:35pm

Use the cudaGetDeviceProperties call to query the number of SMs on your GPU.

Launch at least that many blocks. The GPU block scheduler will deposit one or more blocks per SM, in most cases.

tera · January 31, 2019, 10:23pm

Kudos for remembering to add “in most cases”!

You can force blocks to be evenly spread by having them use all shared memory (allocate the maximum permissible amount of shared memory per block, minus the static shared memory used, as dynamically allocated shared memory via the third argument of the <<<>>> launch configuration operator), for most compute capabilities.

This scheme may still fail for CCs 3.7, 5.2, 6.1 and 6.2 where the maximum permissible shared memory size per block is half the total available shared memory size of an SM or less.

Another option is to launch more blocks than SMs, discover at runtime the distribution of active blocks amongst SMs, and exit all but one block per SM.

I notice this is getting complicated. But I have successfully used these techniques in the days of Compute Capability 1.x, when it was still possible to outperform the block scheduler with a custom implementation.

Jimmy_Pettersson · February 4, 2019, 1:51pm

lol nice :)

It feels really niched or missguided when people start worrying about block scheduling (not saying that it can’t give performance improvements).

I have vague memories of working on the GT200 in 2009, where I managed to store previous block ID:s in shared memory that could be picked by the next block scheduled to determine what it should execute on. Not very robust to say the least :-)

tera · February 4, 2019, 3:43pm

Ha!

I never tried passing data in shared memory between blocks. I didn’t have to, because with the custom block scheduler the block would only exit once all work was done.

IIRC the CC 1.x block scheduler was strictly round-robin. So anything that took actual load balance into account would beat it, even given that it had to use some global memory atomics.

And I have no problem being described as niche - that’s where I’ve been almost all live.

anon7339525 · February 4, 2019, 6:34pm

My biggest concern is not with performance improvements, my code is directed to a research project, so I am investigating other issues

I think I got the point, but is there any way to check the SMs occupancy through NVPROF?

Topic		Replies	Views
Ensuring blocks per SM CUDA Programming and Performance	4	1167	February 20, 2012
Scheduling blocks to SMs at runtime CUDA Programming and Performance	7	2921	October 27, 2008
How to specific the number of SMs used in my program? CUDA Programming and Performance	1	838	April 9, 2018
Question about the number of SMs using in the program. CUDA Programming and Performance	3	863	April 9, 2018
How to do static block scheduling in CUDA? How to evenly assign blocks to SMs? CUDA Programming and Performance	4	1129	January 31, 2022
Number of blocks parameter for kernel when GPU has just one SM CUDA Programming and Performance	3	571	August 4, 2017
understand the mapping of the block threads to SMs in GPU CUDA Programming and Performance	3	2816	August 2, 2018
Amount of Shared Memory CUDA Programming and Performance	10	4364	June 3, 2010
Relation between SM and block CUDA Programming and Performance	1	5645	March 18, 2010
How blocks will be distributed among SPs ? CUDA Programming and Performance	4	1619	October 13, 2008

Assign blocks to SMs

Related topics