I think I have a new addition to our theory of static RR scheduling.
My initial experiments (+ Mr.Anderson’s experiments) tend to suggest that the hardware schedules blocks in all the MPs as one batch. i.e. block replacement do NOT happen until all MPs have finished their jobs.
I need to experiment a bit more before confirming this. I will post source code and results later (its holiday today here).
This would explain, why Mr Anderson observed the GREATEST (rather than the smallest, as I incorrectly suggested) time for his step=2,3,5 experiment. Indeed, when he overloaded some MPs with several slow kernels, he got all other MPs to wait for the overloaded ones to accomplish.
I wonder, what Mr Anderson’s experiment would yield for step = 15, 20* and 30 on GTX280? If Sarnath’s Static RR scheduling theory is correct, then these runs should be as slow as other “spectral lines” (step = 2,3,5), i.e. finish in about 14.57 ms. This is even though the fraction of the slow kernels is only ~3% percent in step=30 experiment.
During the first 500 cycles, blocks 0 to 127 are scheduled to the 16 SMs of the 8 TPCs.
Block 0 runs during 10097 cycles in SM 0 of TPC 0, while all other blocks run during 500 to 1000 cycles.
Now around cycle 1000, SMs of TPCs 1 to 7 start executing new waves of blocks while Block 0 is still running.
But all blocks scheduled in both SMs of TPC 0 have to wait until Block 0 finishes, even though SM 1 stands idle all this time.
From an architectural point of view, I believe there is a separate scheduler inside each TPC. Each of those schedulers is assigned a bunch of blocks in round-robin at startup, and them runs it without caring about the other TPCs.
TPC 0 is assigned blocks 0, 8, 16, 24 … to 120 and not blocks 0, 1, 2… to 15.
This is good for load-balancing if adjacent blocks are more likely to behave the same way.
However, since TPCs is where the texture caches resides, this is probably bad for texture sampling:
If we assume that blocks with very different block numbers are likely to access very different locations in memory, while blocks with close IDs will access closer locations, this scheduling should cause a lot of contention in caches, poor data reuse rate and a high cache miss rate as a result.
(Disclaimer: This is an hypothesis which needs confirmation by a specific benchmark, and even then the effect may be completely insignificant in real-world apps.)
With only an 8kb texture cache to begin with you’d have to be processing a very tiny amount of data from it in each block for anything to survive to the next block anyway, so it probably is a non-issue for performance. The cache is only large enough really to accelerate fetches within a block. Thanks for your tests, for me it confirmed what I suspected and this information can be used to gain about a 30% boost in my application.
I meant that in the example, all 16 blocks 0, 8, 16, … to 120 run on the same TPC at the same time. So they will all compete for the same 16kB-cache, instead of collaborating by sharing cache lines. The tininess of the cache only aggravates this.
I’m not even dreaming cached data could survive from a block scheduling round to the next. :)
So there is an actual practical use of knowing the architecture better…
Nope. My experiment disproves this theory. The block dispatcher waits for all the multi-processors to finish off their job before scheduling the next wave.
Thus, if 1 MP is just doing some busy work — all other MPs will wait idly waiting for that 1 MP to finish his job.
From your post, I understand 1 TPC could control multiple SMs. So, according to you, for a 8800GTX, there should be 8 TPCs with 2 SMs each. My experiment shows blocks are scheduled to all TPCs at a time. If one TPC is busy, the rest all wait for him to complete.
I have wrote some micro-wrokload, and that also indicates that block schedule policy is RR. The details is that block 0 is assigned to TPC 0, block 1 is assigned to TPC 1, … (my card is 9600GT, each TPC with 2 SM, totally 4 TPC), and this is same as Sylvain Collange descreibed, schedule phy unit is TPC rathe than SM.
But, my experiment data also show that blocks are dispatched batch by batch and no load balance. The detail is that each batch blocks run time is determined by the longest block in this batch. In other words, if there is a block that will run a long time, even through other blocks all exit, the block dispatcher can’t assigned next batch of blocks to TPC until the longest block finish. This conclusion is same as Sarnath descreibed.
My workload is simple. each block has only one thread, and the block to burn time is implemented by loading global memory always(need 8.54ms). other blocks only check its blockid and then exit immediately (need 0.041ms, load global memory or exit determined by block id). By test, the number of active blocks concurrently in GPU one time is 64(8*8: first 8 is 8 SM, last 8 is maximum active blocks on one SM. the result 64 is also same as CUDA occupancy calculator’s result).
blocks run time(ms) notice
64 0.041 check their blockid and then exit
1 8.54 only one block, load global memory
64 8.54 only block 0 load memory, others exit immediately
65 17.29 only block 0 and 64 load memory, others exit. block 0~63 is first batch of block, block 64 is sencond batch of block.
128 17.31 only block 0 and 64 load memory, others exit. block 0~63 is first batch of block, block 64~127 is sencond batch of block.
Did you check if you have missed the “volatile” thing?? (use ld.volatile if u r using ptx). May be, it gets translated to just spinning on the register.
Also, on what GPU did you profile it on? How many multi-processors does it have? Probably it has so much multiprocessors that all 255 are scheduled in 1 stretch. OR probably your results are not valid because your kernel crashed… (am just enumerating some options. no offence please.)
No, I have never used the fence thing ever. I use CUDA 2.1
My hypothesis explains Mr.Anderson’s test results too. but your experiments seem to tell the other way. Are you sure - block 255 is executed while block 0 is still running ? - How did you verify that? Can you share some details?
Well, I code directly in assembly to avoid compiler issues, I dump the TPC andÂ SM IDs to know exactly which block is scheduled on which TPC/SM, and I used my own timeout to avoid crashing the kernel, so I hope I got myself covered here ;)
However I don’t know how much confidence I can have in the device clock() function. I know the counter is local to each SM (or each TPC), and is not guaranteed to be synchronized with the other SM counters. (It looks like they are synchronized at the beginning of the program, but running functions such as cuMemset/cudaMemset can desynchronize them. Or I just don’t understand anything.)
It would be good if you could memset the gendtimes, gid to zero before launching the kernel – so we know that it were written from the kernel (and not coming possibly from previous execution of the same application).
I am yet to go through your entire code. I am still looking. Thanks.
After thinking twice about the results, I think you are right and my analysis was wrong.
The likely explanation for my results is that after a TPC finishes its work (wave of blocks), it’s put in a low-power state waiting for other TPC to finish their own share of work.
To save power, clock-gating techniques are used, and this causes the TPC or SM timer to stop incrementing until wakeup.
So when the next wave of blocks is scheduled, the clock register will still contain the time when the TPC was put to sleep instead of the actual current time.
Which means all my measurements were meaningless (at least those based on the clock. The TPC and SM IDs should still be valid).
Sorry for the confusion…
(Well, this is good news for me: the simpler the scheduling policy is, the easier it will be to implement in Barra. And then say: “we improved the scheduling algorithm and get a X% speedup” :) )
Resurrecting this old thread:
Has anyone seen if Fermi has better block scheduling methods? We KNOW the block scheduler is very different with Fermi (since you can even run multiple kernels in the same SM), so this “wave of blocks” simple scheduling likely has changed as well.
And a second point. If blocks really are scheduled in batches, we could completely hide that inefficiency by doing our own scheduling: Make a single global atomic integer. At the start of your block, you increment the integer and use THAT value as your “block number.” If the value is greater than the number of work units you have, your block exits. When the block has finished doing its work, you go back to the start of your block and it does the atomic increment again. So it’s really a single persistent block to the scheduler, but you’re filling in new work units asynchronously to all other SMs as you’d want for efficiency.
Hmm, I will try this on some of my kernels to see if it makes any difference on G200 (and/or Fermi). I never expected it to, but this thread showed me that it may not be true.