Scheduling block execution Do multiprocessors block each other?

Hello,

I ran into the following statement:

This is completely new and unexpected to me, perhaps I haven’t read the manual attentively enough. Sarnath, thanks for raising this issue!

Hence a couple of question, and I would appreciate a comment from an NVidia engineer:

  1. In what way are independent blocks scheduled to run on a board, that has several multiprocessors?

Specifically: say I have a kernel, that does not require any synchronization across blocks. The kernel is launched over a large number of blocks. Would a longer execution time of just one block cause the whole GPU board to wait for this block to accomplish, before another set of blocks is scheduled to run?

  1. Is there any performance penalty/gain due to code diversion between blocks, that run on different multiprocessors at the same time?

Specifically: say one block happens to run 15 iterations of some iterative process (poor convergence), and another block happens to run just 3 of similar iterations (good convergence). The two blocks run on different multiprocessors. Obviously, the RAM/cache access patterns exhibited by these blocks would be quite different. Is it actually good or bad, from the RAM bus performance standpoint, that the two blocks request data from the RAM at different instances of time?

Thanks!

Of course, all kernels are such that blocks cannot be synchronized. Therez no way to do it.

Also note that - once a block is scheduled in a MP – it runs until completion. Also, MP can schedule more than 1 block if there are enough resources to handle them.

My problem was that, my kernel would run 3 blocks simultaneously at a time in a MP (cuda occupancy). Now, each block has 64 threads and thus I achieve 192 threads executing simultaneously (which is required to hide register latencies). Now say 1 block compeltes execution soon, it now leaves the MP with 128 threads active (with latencies exposed). My experiments show that the block scheduler is NOT filling the void immediately thereby exposing latencies. I remember posting some results on that. You may check that out.

No answer from NVIDIA yet. But I think, this is too internal an issue - that which NVIDIA people would like to stay away from.

Hmm. Which thread did that come from? There was a big debate on the forums a while ago about this. I eventually sort of solved it with a microbenchmark.

As it should be. That is not how the hardware operates. In all the benchmarks I performed and every bit of code I’ve ever run, I have never found any evidience contrary to the idea that the GPU replaces a completed block immediately with a fresh one.

Only NVIDIA knows for sure. Presumably, the “central controller” maintains a list of blocks to run and whenever an MP comes to a situation where it is free to execute a block, it pulls one off the list and runs it. The ordering is completely non-deterministic, we do know that.

No. One long running block will happily run on its MP for a while. All the other MPs (and the other open slots on the one with the long block) will continue processing blocks as they are completed. The only situation you could get yourself into where one block limits the compuation time is if the execution time of that one block was so extremely long that it took more time to run than all of the other blocks combined running on all other available MPs. Your “long” block would probably need to be 1000’s of times longer than the others for this to happen.

No. There is also zero penalty when different warps follow different paths, even within the same block.

Don’t concern yourself with “requesting data from RAM at different instances of time”. This is precisely what the GPU hardware is built for. As long as you have a good memory access pattern (coalesced, or data-local texture reads) and you have enough blocks running to fully saturate all MPs, then the device will reward you with near-peak bandwidth. There is an advanced memory controller that deals with all of the varied requests coming in random orders from all the various warps on the device. It is all automatic and you don’t have to think about anything except your spatial access pattern.

Well, Mr.Anderson, Here we go again :-)

This is the URL: http://forums.nvidia.com/index.php?showtop…mp;#entry325083

Kindly check the post toward the end. I fixed the bug in your benchmark and posted some results – which were supportive of my theory. I am posting a portion for your convenience (of the final results)

But that discussion was on blocks that are completely skipped. Your “non-optimal results” are intuitively consistent with a non-uniform distribution of blocks that do actually run among the MPs. Nobody every did the monte-carlo simulations to see what the % workload should be given a random scheduling of many skipped blocks between a small number of MPs.

This thread is dealing with a not-so-degenerate case where some of the blocks take longer than others, but none are skipped. So the situation should be better. A new microbenchmark needs to be run to see how thins work in this case. Maybe I’ll write one later this week…

So, here is a more carefully designed experiment based on the benchmark in the thread linked above.

Setup:

40,000 blocks run on a GTX 280

The kernel does factor[blockIdx.x] * work units of computation work. The amount of work is so high as to make any kernel launch overheads completely negligible.

The kernel compiles to 5 registers.

A block size if 128 is chosen so we have 8 active blocks on each MP.

Benchmark 1 is performed with all factor values = 1. Benchmark 2 is another with all factor values = 2. A third with half of the factor values at 1 and the other half at 2, called benchmark 1.5 for obvious reasons.

Null Hypothesis: Blocks are not replaced immediately, but are run in “batches” so when one batch of 8 blocks starts, no new blocks start until the batch has finished. Assuming random scheduling for benchmark 1.5 the probability is (0.5 ^ 8 = 0.39%) that all blocks in a batch are factor=1. So, as a best-case time, the null hypothesis predicts that the time for benchmark 1.5 to execute is (2 * T_bmark1 - 0.003906 * T_bmark1).

Results:

Benchmark 1

447.251831 ms

Benchmark 1.5

733.122986 ms

Benchmark 2

894.075439 ms

The null hypothesis is rejected. Blocks are not run in batches. If you don’t believe me I can give you the full statistical distributions and perform the appropriate test to determine this. But I don’t really think it is needed in this case, the numbers are obvious.


Now that we have established that with a high confidence, the next question someone will ask is “how immediately” are thread blocks replaced when they complete. I don’t know how to design an experiment to determine that. We can get a little bit of a feel for it by calculating the difference in the expected time vs actual time in the above benchmark as the number of blocks with factor=2 is slowly increased. I’m running that benchmark now and will post it when it completes.

I can confirm what Sarnath is saying from doing many tests myself, and this does not simply occur when you immediately “skip” a block.

To clarify for MisterAnderson42 and anyone elses benefit what I found through repeated experiments is that in if a given kernel invocation’s resource requirements are such that more than one threadblock can be executed concurrently on a single multiprocessor given the resource and maximum warp/thread limitations of each multiprocessor, the set of threadblocks assigned to each multiprocessor are sequentially ordered and the multiprocessor will not begin processing a new set of threadblocks until ALL threadblocks in the last set have completed.

So basically when multiple threadblocks are allocated to a single multiprocessor the time taken to complete processing those threadblocks is determined by the threadblock in that set who’s threads completely return last.

This isn’t directly related to occupancy because you can still achieve high occupancy when only 1 threadblock is executing on a multiprocessor by using more threads and thusly more warps, but in situations where you are using a modest number of threads with relatively modest shared/register requirements and the amount of ideal time to process each block varies drastically within sequential sets of n threadblocks where n is the number of threadblocks being concurrently assigned to the same multiprocessor this can potentially affect performance significantly.

As one last exhaustive example in case there was any doubt. Try creating 2 kernels that require 256 threads with 10 registers and no shared memory on a 1.0/1.1 device. The device will assign 3 threadblocks concurrently to the same multiprocessor for execution to increase occupancy. Structure the 1st kernel such that every 2nd and 3rd threadblock executes faster than the 1st (by how much makes no difference, you could even have the threads return immediately with no divergence for those blocks) and in the 2nd kernel have each threadblock perform the same amount of work. Both kernels will execute in the exactly same amount of time, even though theoretically kernel #1 should completely in less time.

I must admit I’ve never tested this on compute model 1.3 devices because I do not yet own one, but I can say beyond a shadow of a doubt that what Sarnath and myself are saying is at least true for 1.0 and 1.1 devices.

Just a quick question related to one of the responses. Sarnath, you mentioned

“Also note that - once a block is scheduled in a MP – it runs until completion. Also, MP can schedule more than 1 block if there are enough resources to handle them.”

Is that positively true? Once a block is scheduled on an MP, it runs until completion? There is no block swapping in/out on an MP?

Mr. Anderson, the results of your experiment seem to directly contradict those of parlance’s.

What about the following explanation: the association of blocks with factors in your experiment just happened to be such, that all the blocks, loaded on a given MP were [almost] always associated with the same factor?

Just to give you an example of such a setup: say your GPU has 16 MPs and the scheduler allocates block#1 to MP#1, block #2 to MP#2, … block#17 to MP#1, block#18 to MP#2, etc (round robin). Were I designing the scheduler, this would be my first approach, in order to achieve a better distribution of blocks over MPs in the case of a small number of blocks. So, what if your factors are assigned in the following way: factor=1 to block #1, factor=2 to block #2, factor=1 to block #3, factor=2 to block #4, etc. Then, all the blocks, which are executed on MP1 will run with factor=1, on MP2 with factor=2, on MP3 with factor=1, etc. So, I am unable to reject Sarnath’s null hypothesis in this particular setup.

Please, do post the results of your further investigation. This gets quite intriguing.

@Mr.Anderson,
While using 128 threads per block, just running 2 blocks is enough to hide latencies. That is why your code works as per your expectation. You should make latencies hit because of lack of dynamic block execution. Thats what I did in the other thread.
Also, In case if you are using your previous code - you need to work on 2 things:

  1. Number of threads per block
  2. The skips logic was not proper. I think we discussed.

@Parlance,
We are on same page. Thanks for posting.

My dear Enemy (vow! Thats an interesting way to adress…),

Blocks dont move around between MPs. They will be stuck to the MP until death just like a plant that sticks to the soil until death.

To reason why, think from the other side:

Think that blocks will be moved around between MPs… What will happen?

  1. One needs to move the shared memory contents between MPs to preserve the value of shared memory variables

  2. One needs to move the “register” contents between MP to preserve local variable content (??!?!?!?!?!)

  3. Not to mention the instruction pointer of various warps.

  4. Texture cache benefit is foregone.

And above all, it will be s…l…o…w

Latency hiding has nothing to do with my expectations. There are no latencies to hide! The test kernel just does += a few thousand times. I do this intentionally as good experiment design. I’m separating out any complications that latency hiding will add to the analysis. The entire purpose of the experiment is to make groups of blocks take a controllable amount of time to run to completion and test the ability of the block-level scheduler. After all, this entire forum thread is about the block scheduler, right? Latency hiding is an orthogonal issue.

Are you calling me stupid? Of course I fixed the logic in assigning skips. And I told you that I changed the number of threads per block to a given value. Responses like this strongly incline me to stop making these kinds of posts.

I also ran on a 9800 GX2 with the same results, all times were just a little larger with the same relative ratios.

I don’t see this as a necessary test. The results I posted did not just assume random scheduling vs round robin, I enforced it by reassigning the factor values for several 100 benchmarks and looking at the distributions of benchmarked times. Still, it only takes a few minutes to modify the code.

And those few minutes are worth it, I guess. The results are interesting. I expanded it to allow for running every 1/4 block (benchmark 1.25) too.

Back on the GTX 280:

Block size = 128: 8 blocks per MP

benchmark 1: 179.163712 ms

benchmark 1.25: 269.694183 ms

benchmark 1.5: 357.865417 ms

benchmark 2: 357.891174 ms
Block size = 256: 4 blocks per MP

benchmark 1: 358.290619 ms

benchmark 1.25: 539.473816 ms

benchmark 1.5: 715.666809 ms

benchmark 2: 715.731934 ms

With every-other block set to factor=2, the kernel does take the same amount of time as when all kernels are at factor=2. This rejects the hypothesis that individual blocks are replaced immediately. However, benchmark 1.25 takes approximately 1.25 * 179ms, again rejecting the hypothesis that blocks are run in batches of “num blocks per MP”.

So the reality is more complicated than either “ideal” situation. We are already stretching many assumptions to the limit to get this far, I’m not optimistic that we can really learn that much more about the hardware with experiments like this. It would seem that blocks are indeed co-scheduled in groups of some kind (groups of 2, maybe?, but what would the hardware do with an odd number of blocks per MP?). To really learn more, we either need info from NVIDIA (not likely) or a set of more detailed predictive hypotheses to test. As benchmark smoothly goes from 1 to 2 (with the randomly assigned factors), there is a distinct curve that shows up. A monte-carlo simulation of a given model that reproduces that curve would lend some weight to that model. I, however, have already wasted enough time on this already and am not about to do any more.

Here are a few more runs on a 9800 GX2 (:

Block size 96: 8 blocks per MP

benchmark 1: 198.512497 ms

benchmark 1.25: 396.541992 ms

benchmark 1.5: 396.620911 ms

benchmark 2: 396.655670 ms
Block size 128: 6 blocks per MP

benchmark 1: 263.871918 ms

benchmark 1.25: 526.891846 ms

benchmark 1.5: 527.182617 ms

benchmark 2: 527.407776 ms
Block size 192: 8 blocks per MP

benchmark 1: 395.408295 ms

benchmark 1.25: 789.900879 ms

benchmark 1.5: 789.941101 ms

benchmark 2: 789.982422 ms

It would seem that compute 1.1 hardware is indeed not as efficient at block scheduling as compute 1.3, thus making additional analysis even more impossible to do.

Adding the random factor assignments back in:

benchmark 1: 198.515320 ms

benchmark 1.25: 275.918335 ms

benchmark 1.5: 321.258392 ms

benchmark 2: 396.679688 ms

Compute 1.1 hardware now handles the situation.

So the moral is: while there is some kind of limitation, the hardware is capable of efficiently handling blocks that take varied amounts of time to execute as long as you aren’t in the degenerate cases with even blocks taking significantly longer than odd blocks or with such so many blocks taking exactly the same amount of time to run. Even a small amount of randomness in the time it takes for blocks to complete (i.e. due to latency hiding in real kernels that have global memory reads) will break up this supposed round-robin scheduling and take us away from these degenerate cases where the hardware does not handle the situation well.

Block size (rather, the number of blocks running on each MP) seems to also have a big effect on the performance. I don’t see this as a big deal: just yet another reason to add to the already long list of reasons why one should benchmark kernels vs all possible block size to determine the fastest configuration on each hardware.

Oh, and as an additional check. I just ran a count through the random factor data (for benchmark 1.25) and counted that 4468 of 5000 groups of 8 blocks have a factor=2 in them. Keeping your assumptions of round-robin scheduling, the hypothesis that blocks are run in groups of 8 on this 9800 GX2 would predict a time of (4468 * benchmark 2 time + 532 * benchmark 1 time) / 5000 = 375 ms. Benchmark 1.25 clocks in at 275.91 ms again providing more evidence against that hypothesis and again confirming that the real way the block scheduling works is more complicated than any of us is thinking. Whatever it is doing, hardware seems to load-balance the random situations the best.

The round-robin scheduling was Sarnath’s hypothesis. I personally would place a wager that the blockIDs of blocks that are assigned to one multiprocessor concurrently when occupancy allows for more than one threadblock to do so are linearly contiguous on 1.0 devices, even though round-robin would be more effective for small numbers of blocks NVidia may have chosen to do things this way for simplicity in implementation and based on the assumption that few if any kernel calls will ever have such a low number of total blocks to be affected by this, since that would be a very inefficient way to use the device anyway.

Taking it further I’d wager based on the results you just posted in your benchmark that it isn’t quite as cut and dry as that and different devices appear to have different algorithms to decide which blockIDs are assigned to a single multiprocessor with the simplest method being linearly contiguous on 1.0 devices to a more advanced distribution on 1.3 devices. Furthermore, it would seem as though on 1.0 devices a multiprocessor must completely empty itself of all running threadblocks on it before accepting a new batch, but on 1.1 and 1.3 devices only N must complete and then N can be replaced if the number of running threadblocks on a multiprocessor is greater than N, and N would seem to vary by device with better devices getting a smaller N as this is more efficient but probably harder to implement in hardware. For the GTX280 you’re testing, N appears to be 2.

I agree though that we’re all just basically guessing at this point but provably your initial assumption that N = 1 for all devices is absolutely false. I’d like to see Nvidia post here and clarify this for us as it has important performance implications.

@Mr.Anderson,

Sorry about the previous post. I did not mean to offend you in any way. Let us forget it.

And, I do admire your hard work and accuracy - when it comes to proving a theory. Thanks for your time,

Best Regards,
Sarnath

ps: edited to remove some wrong info.

A possible reason why 1.25 behaviour varies between GTX 280 and 9800 GX2 could be the fact that:

  1. GTX280 has 30 multiprocessors - which is not divisible by 4.
  2. GX2 has 16 multiprocessors - which is divisible by 4.

A RR scheduling would result in a mix of 1 and 2 workload in a some MPs (even numbered) for GTX280
The same scheduling will NOT mix 1 and 2 workload in a given MP for GX2
At least for the initial loading.

Subsequent loading could depend on – which MPs finish their worload (of a block or batch of blocks). Subsequent loading (down the line) can trigger mixing of 1 and 2s in a MP for GX2 as well.

Also, if you look @ the case of 128 threads with 6 active blocks (GX2) - out of 6, if we assume that 1 block had a factor of 2 then that block would stay for sometime with 128 active threads (<192) before expiry. This would expose register latencies (if there was no block replacement) – It was this scenario that I had quoted in the wishlist (as quoted again in this thread by cudesnick)

One another idea to get a macro-level view of block scheduling — One can write a kernel which

  1. Uses blockId(0) to provide for a TICK COUNTER. The thread 0 of this block would merely increment a global variable continuously.
  2. Other blocks when they finish their work and exit will update the TICK COUNTER value in a result array (array size = number of blocks)

This way, we can form a macro level view of block scheduling order for a particular kernel. But this approach assumes that blockId 0 will be scheuled first – which is a reasonable premise to start with.

Here is my attempt at solving the scheduler mystery.

I wrote a program in G80 native assembly to precisely dump all scheduling data.

(Actually I could have written it in PTX instead, but that’s too hi-level for me :) )

Source is at http://gala.univ-perp.fr/~scollang/ware/cu…heduling.tar.gz (cubin included so you don’t need cudasm, only a C compiler).

Usage :

scheduling numblocks mode maxtime smsize

  numblocks : number of blocks

  mode : {constant, random, mod2, mod4, mod16, block16}

  maxtime : target duration in cycles

  smsize : Shared Memory in bytes

It runs a user-given number of blocks of 1 thread each (yeah, it is inefficient, but we only want to test block scheduling here).

Each thread reads a target duration from memory, then runs a busy loop until that duration is elapsed (according to the global clock cycle counter). It then writes its starting and ending dates and Streaming Multiprocessor ID to memory.

The main executable inits the target durations according to a policy given on the command-line (‘constant’ is always maxtime, ‘random’ is random between 0 and maxtime, ‘mod2’ to ‘mod16’ are maxtime when blockIdx % n == 0 and 0 otherwise, ‘block16’ is maxtime for every 16th chunk of 16 blocks.)

Changing the shared memory size allows to control how many blocks can be run concurrently on a SM.

Some results on a G92 (9800GX2) :

scollang@kenny:~/g80_test/scheduling$ ./scheduling 32 mod2 10000 16000

#Block	TPC	SM	start	duration

0	0	0	475	10092

1	1	0	315	492

2	2	0	234	10089

3	3	0	175	456

4	4	0	129	10092

5	5	0	160	475

6	6	0	41	10068

7	7	0	8	483

8	0	1	473	10092

9	1	1	314	493

10	2	1	230	10091

11	3	1	171	460

12	4	1	124	10095

13	5	1	154	477

14	6	1	36	10055

15	7	1	0	485

16	0	0	10691	10074

17	1	0	961	306

18	2	0	10478	10057

19	3	0	814	299

20	4	0	10365	10056

21	5	0	842	305

22	6	0	10229	10066

23	7	0	715	324

24	0	1	10715	10054

25	1	1	981	296

26	2	1	10490	10097

27	3	1	821	300

28	4	1	10368	10069

29	5	1	844	303

30	6	1	10232	10065

31	7	1	715	324

Here, blocks are statistically scheduled in round-robin. Odd-numbered TPCs 1, 3, 5, 7 finish their work earlier than even-numbered ones, then have to wait for the other TPCs (this is still true when running more blocks).

So there is not any kind of load-balancing between TPCs, only a static scheduling to which TPCs stick until the end.

Here are other results :

> scheduling 256 mod16 10000 0

> scheduling 256 block16 10000 0

This shows that there the scheduler waits until all blocks have finished before starting a new wave of blocks on a SM.

So it seems there is nothing fancy about the CUDA scheduler, just a static round-robin without pipelining… (at least on G92 with 180.44 drivers)

Mr Anderson,

I agree with Sarnath, that the results of the experiments, that you publish here are very informative and thought-provoking.

So, to test this hypothesis, it would be helpful to conduct a “spectral analysis” of multi-block kernel execution time. Specifically, if Mr Anderson could execute his kernel on GTX280 with factor allocation steps (with respect to the block ID) of 2, 3, 5 (the divisors of 30, which is the number of MPs on GTX280), then, according to this RR theory, he should see a faster execution time, correct?

Moreover, since GTX260 is essentially the same board with fewer MPs in it, the exact same experiment with different factor allocation steps (divisors of the number of MPs on GTX260, not sure what it is) should uncover different “spectral lines” on that card.

Well. Actually, the allocation steps of 2,3,5 will lead to the slowest times :)

With those in place, we get evidence for the static scheduling again.

benchmark 1: 7.468608 ms

benchmark 1.5: 14.575840 ms

benchmark 1.33: 14.577056 ms

benchmark 1.20: 14.573472 ms

benchmark 2.0: 14.580448 ms

So now we have a pretty good body of evidence from multiple tests that the GPU is does dumb static round-robin scheduling of blocks.

The problem with applying that theory in the general case is that we have counter examples where this is definitely not the case. Firstly, we have runs where I assigned factor=2 randomly, but a huge fraction of all the block groups had at one factor=2 in them.

Secondly, we have this new run I did. I kept exactly the same setup as in the above run (skips of 2,3,5 on GTX 280 with a block size of 128). What I changed is that I added a global memory read into the time-wasting loop. And we get…

benchmark 1: 346.994934 ms

benchmark 1.5: 529.773499 ms

benchmark 1.33: 470.849701 ms

benchmark 1.20: 427.183044 ms

benchmark 2.0: 692.284912 ms

near perfect load-balancing.

I don’t know how to really make that much sense from these counter examples. And I don’t really care. It works in real-world cases, what else really matters?