Hardware scheduler

grynet · November 3, 2017, 12:30pm

I am curious about cta (block) scheduler on GPUs.

As far as I know, there is gigathread scheduler engine, which is hardware implemented, that schedules cta to SM by round-robin fashion. So, we can always think that GPU schedules our cta efficiently as the scheduled is pretty much dynamic. That’s all great!
However, when we look at the p100 pascal gpu, there are 56 SM; each SM has 32 slots for cta (not exactly sure about this number, but there is a certain number of slots).

Now let’s consider all of this in an example. I have a very big array, I create an excessive number of cta, such like 1Million. It fits each thread each index of array. I think that gigathread engine schedules them efficiently whatever the number is.
Questions:
1- Is scheduling for 1M cta expensive for gpu hardware?
2- If I solved some sized problem with less number of ctas, would it be faster?
3- Is there any cost of creating 1M ctas or is it free when compare less number of cta?
4- Do CTAs wait in some sort of a queue? Because p100 cannot host all the ctas at the time. If so, Can I measure somehow the waiting time?

njuffa · November 3, 2017, 1:06pm

Wouldn’t it be trivial to set up a quick experiment where the big array is handled by a variable number of thread blocks, and measure the performance? I have used grids with ~100,000 thread blocks without encountering any negative performance impact, so I would assume using 1M thread blocks wouldn’t be an issue either.

I don’t recall NVIDIA publishing detailed information on the block scheduler works (it may well differ by architecture family), but you could probably find out a few things from setting up a microbenchmark. Google Scholar can easily find some papers about reverse engineering various GPU architecture features, they might provide a starting point for setting up your own effort.

grynet · November 3, 2017, 2:56pm

njuffa thank you for your reply.

I’ve tried the quick experiment and seen the same results. Apparently, 1M blocks would not be an issue. This is actually my question. Why is not it an issue? Because 900.000 blocks more are launched.

You are right, I haven’t seen any information from Nvidia’s side about it. but I’ve come across couple research articles which are not discussing overhead (if exist) of creating many th blocks.

njuffa · November 3, 2017, 3:13pm

I would guess there is a handful of people at NVIDIA that know the details of block scheduler operation, and as we can see historically, they are not talking. You could try digging through NVIDIA’s patent filings to see whether anything relevant is written there, put digging through the details of other company’s patents is usually not advisable if you are employed in industry.

I would speculate that creating thread blocks is cheap, because it involves basically mapping hardware resources according to simple deterministic formulas. So you could envision a state machine feeding into a queue of thread blocks (it probably wouldn’t have to be very deep) in which new blocks get inserted at one end and assigned to the next available SM at the other end.

The actual scheduling could be as simple or as complicated as NVIDIA wishes to make it (trade-off: hardware simplicity versus load balancing). I recall that load balancing issues did occur not infrequently on early GPUs, which is likely indicative of a scheduling algorithm that was very simple, not much more than straight round robin. With more transistors available now, scheduling is likely more sophisticated.

I seem to recall a reverse engineering document posted somewhere where the authors figured out the basics of the scheduling algorithm used. I do not have a reference handy, but I believe this came out within the past three years. The “grandfather” of GPU microbenchmarking papers is:

Henry Wong, Misel-Myrto Papadopoulou, Maryam Sadooghi-Alvandi, and Andreas Moshovos. “Demystifying GPU microarchitecture through microbenchmarking.” In Performance Analysis of Systems & Software (ISPASS), 2010 IEEE International Symposium on, pp. 235-246. IEEE, 2010

If you look for papers citing the above, you can probably find more recent work giving details of newer GPU architectures.

Topic		Replies	Views
Block/CTA Scheduling CUDA Programming and Performance	8	6689	October 24, 2024
some doubts about the task scheduling of NVIDIA GPU CUDA Programming and Performance	6	2188	May 26, 2017
Scheduling of thread blocks on Stream Processors CUDA Programming and Performance	9	11022	June 7, 2010
Assign blocks to SMs CUDA Programming and Performance	5	1572	February 4, 2019
Dynamic Block Scheduling on hardware latencies CUDA Programming and Performance	28	30076	July 19, 2008
performance cost of too many blocks? CUDA Programming and Performance	12	2811	December 4, 2018
Fewer threads per block = ... faster performance? CUDA Programming and Performance	9	128	December 31, 2024
how are blocks scheduled for execution? CUDA Programming and Performance	3	3412	December 9, 2016
Basic Cuda Confusion - help CUDA Programming and Performance	9	1905	February 11, 2013
Scheduling block execution Do multiprocessors block each other? CUDA Programming and Performance	45	22867	June 7, 2010

Hardware scheduler

Related topics