Hardware scheduler

I am curious about cta (block) scheduler on GPUs.

As far as I know, there is gigathread scheduler engine, which is hardware implemented, that schedules cta to SM by round-robin fashion. So, we can always think that GPU schedules our cta efficiently as the scheduled is pretty much dynamic. That’s all great!
However, when we look at the p100 pascal gpu, there are 56 SM; each SM has 32 slots for cta (not exactly sure about this number, but there is a certain number of slots).

Now let’s consider all of this in an example. I have a very big array, I create an excessive number of cta, such like 1Million. It fits each thread each index of array. I think that gigathread engine schedules them efficiently whatever the number is.
1- Is scheduling for 1M cta expensive for gpu hardware?
2- If I solved some sized problem with less number of ctas, would it be faster?
3- Is there any cost of creating 1M ctas or is it free when compare less number of cta?
4- Do CTAs wait in some sort of a queue? Because p100 cannot host all the ctas at the time. If so, Can I measure somehow the waiting time?

Wouldn’t it be trivial to set up a quick experiment where the big array is handled by a variable number of thread blocks, and measure the performance? I have used grids with ~100,000 thread blocks without encountering any negative performance impact, so I would assume using 1M thread blocks wouldn’t be an issue either.

I don’t recall NVIDIA publishing detailed information on the block scheduler works (it may well differ by architecture family), but you could probably find out a few things from setting up a microbenchmark. Google Scholar can easily find some papers about reverse engineering various GPU architecture features, they might provide a starting point for setting up your own effort.

njuffa thank you for your reply.

I’ve tried the quick experiment and seen the same results. Apparently, 1M blocks would not be an issue. This is actually my question. Why is not it an issue? Because 900.000 blocks more are launched.

You are right, I haven’t seen any information from Nvidia’s side about it. but I’ve come across couple research articles which are not discussing overhead (if exist) of creating many th blocks.

I would guess there is a handful of people at NVIDIA that know the details of block scheduler operation, and as we can see historically, they are not talking. You could try digging through NVIDIA’s patent filings to see whether anything relevant is written there, put digging through the details of other company’s patents is usually not advisable if you are employed in industry.

I would speculate that creating thread blocks is cheap, because it involves basically mapping hardware resources according to simple deterministic formulas. So you could envision a state machine feeding into a queue of thread blocks (it probably wouldn’t have to be very deep) in which new blocks get inserted at one end and assigned to the next available SM at the other end.

The actual scheduling could be as simple or as complicated as NVIDIA wishes to make it (trade-off: hardware simplicity versus load balancing). I recall that load balancing issues did occur not infrequently on early GPUs, which is likely indicative of a scheduling algorithm that was very simple, not much more than straight round robin. With more transistors available now, scheduling is likely more sophisticated.

I seem to recall a reverse engineering document posted somewhere where the authors figured out the basics of the scheduling algorithm used. I do not have a reference handy, but I believe this came out within the past three years. The “grandfather” of GPU microbenchmarking papers is:

Henry Wong, Misel-Myrto Papadopoulou, Maryam Sadooghi-Alvandi, and Andreas Moshovos. “Demystifying GPU microarchitecture through microbenchmarking.” In Performance Analysis of Systems & Software (ISPASS), 2010 IEEE International Symposium on, pp. 235-246. IEEE, 2010

If you look for papers citing the above, you can probably find more recent work giving details of newer GPU architectures.