How to do static block scheduling in CUDA? How to evenly assign blocks to SMs?

user157267 · January 30, 2022, 1:49pm

Hi everyone,
I want to ensure that the same number of blocks are launched to SMs.
For example, the GPU has 40 SMs and the launch configuration has 400 blocks.
How to exactly assign 10 blocks to each SM? Is such control of scheduling possible?

What I observe in my CUDA code is, without any control, there are some SM have less than 10 blocks, while some having more than 10 blocks. So it seems complete dynamic scheduling without any control.
(It is a similar idea as the static thread scheduling in OpenMP. I hope it helps for describing my example above.)

Thanks!

Robert_Crovella · January 30, 2022, 6:33pm

Not really. CUDA provides no formal or documented methods to do this.

I’ve not observed that, and not sure how you are observing it. The general idea is to assume that the GPU designers are well aware of the idea that blocks should be distributed in a sensible fashion to maximize throughput, and assume that that is happening. A slight imbalance (e.g. 11 vs. 9 blocks) could possibly arise, consistent with the above notions, if certain scheduling latencies arise. For example, if blocks have varying runtimes, and the number of blocks initially deposited is not 10 (say it is 8) then it might be that in the distribution of the remaining blocks, after some have retired, there may be some “imbalance”. This imbalance might still be consistent with the idea of maximum throughput.

It should be possible to “take control” of block scheduling by querying the SMID register, and making block actions/decisions accordingly. That strikes me as a lot of work, atypical, and not likely to be worth the effort or result in any noticeable improvement, in most cases.

user157267 · January 31, 2022, 4:24am

Hi Robert. Thanks for the reply.
Yes, I would totally agree with the idea to maximize the throughput. I would call this way “dynamic scheduling”, which makes sense to me.

However, like OpenMP having schedule(static), I wonder if it is possible to tune the scheduling of run configuration in CUDA. But could not find such a topic in the CUDA document.
What I want to achieve, following the last example, is letting each block work on a unique global memory. The GPU has 40 SMs, thus 40 different global memory areas are allocated for those coming blocks on those 40 SMs.

In my imagination, for performance gain, each SM works on its own unique global memory, then the interference and conflict could be minimized. For such a plan, the runtime dynamic scheduling is not acceptable, because the “slight imbalance” could only happen in runtime and is hard to be handled by code. If such static scheduling were possible, then I could go with such assumption, that every SM is ensured to have the number of blocks (say it is 10). Then my plan is much relieved to be implemented.

Could you comment on my plan of implementation? Is this assumption too strong and is against the idea of CUDA’s “maximum throughput”? The intuition of my plan was to gain performance from working locally.
(Yes. I am ware of the SMID register and am querying it. What I want is a “total fair balance”, which is more than querying SMID. I do have a array with size of number of SMs. But I expect every slot in this array to be added up to 10 by each block on it. So the “total fair balance”.)

Robert_Crovella · January 31, 2022, 4:29am

Yes, you can do static scheduling as you describe. I have no idea if it will provide a performance benefit. The article I linked isn’t focused precisely on that idea, but shows generally how to make a specific block (on a specific SM) take responsibility for a specific piece of work, so I believe it has most or all of the constructive concepts needed. If you simply search on “cuda smid” you will find other references including other code examples by me an others.

user157267 · January 31, 2022, 5:59am

Hi Robert.
But could you enlight me more with hints to do the static scheduling? I still find nothing about “cuda smid” to let each SM have 10 blocks.
Thanks!

Topic		Replies	Views
Ensuring blocks per SM CUDA Programming and Performance	4	1169	February 20, 2012
Assign blocks to SMs CUDA Programming and Performance	5	1764	February 4, 2019
How to specific the number of SMs used in my program? CUDA Programming and Performance	1	838	April 9, 2018
Thread Block Scheduling on SM in Dynamic Parallelism CUDA Programming and Performance	6	3452	January 18, 2015
hardware scheduling logic on the GPU CUDA Programming and Performance	2	777	December 7, 2012
Mapping of Thread Blocks to SMs CUDA Programming and Performance	1	1064	January 18, 2015
Question about the number of SMs using in the program. CUDA Programming and Performance	3	868	April 9, 2018
Is there any way to control the GPU block scheduler? CUDA Programming and Performance	1	310	May 27, 2024
Scheduling blocks to SMs at runtime CUDA Programming and Performance	7	2928	October 27, 2008
Request clarification on CUDA runtime scheduling CUDA Programming and Performance	1	1793	September 5, 2008

How to do static block scheduling in CUDA? How to evenly assign blocks to SMs?

Related topics