How to do static block scheduling in CUDA? How to evenly assign blocks to SMs?

Hi everyone,
I want to ensure that the same number of blocks are launched to SMs.
For example, the GPU has 40 SMs and the launch configuration has 400 blocks.
How to exactly assign 10 blocks to each SM? Is such control of scheduling possible?

What I observe in my CUDA code is, without any control, there are some SM have less than 10 blocks, while some having more than 10 blocks. So it seems complete dynamic scheduling without any control.
(It is a similar idea as the static thread scheduling in OpenMP. I hope it helps for describing my example above.)

Thanks!

Not really. CUDA provides no formal or documented methods to do this.

I’ve not observed that, and not sure how you are observing it. The general idea is to assume that the GPU designers are well aware of the idea that blocks should be distributed in a sensible fashion to maximize throughput, and assume that that is happening. A slight imbalance (e.g. 11 vs. 9 blocks) could possibly arise, consistent with the above notions, if certain scheduling latencies arise. For example, if blocks have varying runtimes, and the number of blocks initially deposited is not 10 (say it is 8) then it might be that in the distribution of the remaining blocks, after some have retired, there may be some “imbalance”. This imbalance might still be consistent with the idea of maximum throughput.

It should be possible to “take control” of block scheduling by querying the SMID register, and making block actions/decisions accordingly. That strikes me as a lot of work, atypical, and not likely to be worth the effort or result in any noticeable improvement, in most cases.

Hi Robert. Thanks for the reply.
Yes, I would totally agree with the idea to maximize the throughput. I would call this way “dynamic scheduling”, which makes sense to me.

However, like OpenMP having schedule(static), I wonder if it is possible to tune the scheduling of run configuration in CUDA. But could not find such a topic in the CUDA document.
What I want to achieve, following the last example, is letting each block work on a unique global memory. The GPU has 40 SMs, thus 40 different global memory areas are allocated for those coming blocks on those 40 SMs.

In my imagination, for performance gain, each SM works on its own unique global memory, then the interference and conflict could be minimized. For such a plan, the runtime dynamic scheduling is not acceptable, because the “slight imbalance” could only happen in runtime and is hard to be handled by code. If such static scheduling were possible, then I could go with such assumption, that every SM is ensured to have the number of blocks (say it is 10). Then my plan is much relieved to be implemented.

Could you comment on my plan of implementation? Is this assumption too strong and is against the idea of CUDA’s “maximum throughput”? The intuition of my plan was to gain performance from working locally.
(Yes. I am ware of the SMID register and am querying it. What I want is a “total fair balance”, which is more than querying SMID. I do have a array with size of number of SMs. But I expect every slot in this array to be added up to 10 by each block on it. So the “total fair balance”.)

Yes, you can do static scheduling as you describe. I have no idea if it will provide a performance benefit. The article I linked isn’t focused precisely on that idea, but shows generally how to make a specific block (on a specific SM) take responsibility for a specific piece of work, so I believe it has most or all of the constructive concepts needed. If you simply search on “cuda smid” you will find other references including other code examples by me an others.

Hi Robert.
But could you enlight me more with hints to do the static scheduling? I still find nothing about “cuda smid” to let each SM have 10 blocks.
Thanks!