Scheduling block execution Do multiprocessors block each other?

@MrAnderson,

Thanks for the experiments.

I think I have a new addition to our theory of static RR scheduling.

My initial experiments (+ Mr.Anderson’s experiments) tend to suggest that the hardware schedules blocks in all the MPs as one batch. i.e. block replacement do NOT happen until all MPs have finished their jobs.

I need to experiment a bit more before confirming this. I will post source code and results later (its holiday today here).

Thanks,

Best Regards,
Sarnath

PTX is hi-level?? Oh ok… You guys wrote Barra… Understandable.

Cool. My initial experiments are showing that the scheduler waits for all the MPs to finish their jobs before placing the new wave of blocks. Can you confirm this?

Also, btw, what is a TPC?

Happy holiday!

This would explain, why Mr Anderson observed the GREATEST (rather than the smallest, as I incorrectly suggested) time for his step=2,3,5 experiment. Indeed, when he overloaded some MPs with several slow kernels, he got all other MPs to wait for the overloaded ones to accomplish.

I wonder, what Mr Anderson’s experiment would yield for step = 15, 20* and 30 on GTX280? If Sarnath’s Static RR scheduling theory is correct, then these runs should be as slow as other “spectral lines” (step = 2,3,5), i.e. finish in about 14.57 ms. This is even though the fraction of the slow kernels is only ~3% percent in step=30 experiment.

*EDIT: step=20 is irrelevant to this experiment

TPC = Texture Processing Cluster.

TPCs are the parts that contain a texture cache and 2 (G8x-G9x) or 3 (GT200) Streaming Multiprocessors.

So the G92 has 8 TPC, each containing 2 SMs and the GT200 has 10 TPCs, each with 3 SMs.

I included it in the results because it is actually related to scheduling:

Almost.

I would say instead : “the scheduler waits for all the MPs (SMs) of a TPC to finish their jobs…”

Consider this experiment :

> scheduling 256 mod128 10000 0

During the first 500 cycles, blocks 0 to 127 are scheduled to the 16 SMs of the 8 TPCs.

Block 0 runs during 10097 cycles in SM 0 of TPC 0, while all other blocks run during 500 to 1000 cycles.

Now around cycle 1000, SMs of TPCs 1 to 7 start executing new waves of blocks while Block 0 is still running.

But all blocks scheduled in both SMs of TPC 0 have to wait until Block 0 finishes, even though SM 1 stands idle all this time.

From an architectural point of view, I believe there is a separate scheduler inside each TPC. Each of those schedulers is assigned a bunch of blocks in round-robin at startup, and them runs it without caring about the other TPCs.

Also, note the scheduling order.

TPC 0 is assigned blocks 0, 8, 16, 24 … to 120 and not blocks 0, 1, 2… to 15.

This is good for load-balancing if adjacent blocks are more likely to behave the same way.

However, since TPCs is where the texture caches resides, this is probably bad for texture sampling:
If we assume that blocks with very different block numbers are likely to access very different locations in memory, while blocks with close IDs will access closer locations, this scheduling should cause a lot of contention in caches, poor data reuse rate and a high cache miss rate as a result.

(Disclaimer: This is an hypothesis which needs confirmation by a specific benchmark, and even then the effect may be completely insignificant in real-world apps.)

With only an 8kb texture cache to begin with you’d have to be processing a very tiny amount of data from it in each block for anything to survive to the next block anyway, so it probably is a non-issue for performance. The cache is only large enough really to accelerate fetches within a block. Thanks for your tests, for me it confirmed what I suspected and this information can be used to gain about a 30% boost in my application.

I meant that in the example, all 16 blocks 0, 8, 16, … to 120 run on the same TPC at the same time. So they will all compete for the same 16kB-cache, instead of collaborating by sharing cache lines. The tininess of the cache only aggravates this.

I’m not even dreaming cached data could survive from a block scheduling round to the next. :)

Great!

So there is an actual practical use of knowing the architecture better…

Nope. My experiment disproves this theory. The block dispatcher waits for all the multi-processors to finish off their job before scheduling the next wave.

Thus, if 1 MP is just doing some busy work — all other MPs will wait idly waiting for that 1 MP to finish his job.

–edit–

From your post, I understand 1 TPC could control multiple SMs. So, according to you, for a 8800GTX, there should be 8 TPCs with 2 SMs each. My experiment shows blocks are scheduled to all TPCs at a time. If one TPC is busy, the rest all wait for him to complete.

–end edit–

Proof below: See this code:

#include <stdio.h>

__global__ void mykernel(volatile int *data)

{

	if (blockIdx.x == 0)

	{

		while(data[gridDim.x-1] == 0);

	} else {

		data[blockIdx.x] = blockIdx.x + 1;

	}

}

void callKernel(int device)

{

	struct cudaDeviceProp p;

	int totalBlocks;

	void *gpuPtr;

	cudaError_t error;

	cudaGetDeviceProperties(&p, device);

	totalBlocks = (p.multiProcessorCount*8)*2;

	printf("Number of Multi-processors = %d in device %d\n", p.multiProcessorCount, device);

	printf("Spawning %d number of blocks with 32 threads each\n", totalBlocks);

	cudaMalloc(&gpuPtr, totalBlocks*sizeof(int));

	cudaMemset(gpuPtr, 0, totalBlocks*sizeof(int));

	mykernel <<< totalBlocks, 32 >>> ((int*)gpuPtr);

	cudaThreadSynchronize();

	error = cudaGetLastError();

	printf("Error = %d\n", error);

	return;

}

The code is simple and straightforward. On my 8800GTX (16 MPs), this code timesout with error code as “6”.

If I replace “gridDim.x-1” with any value < 128 (16*8) - the code works.

I am yet to see the details of your experiment. In the meantime, Kindly share your thoughts on the experiment above.

PS:

My previous idea of “ticking counter” in block 0 – would never work in the first place.

Best Regards,

Sarnath

I have wrote some micro-wrokload, and that also indicates that block schedule policy is RR. The details is that block 0 is assigned to TPC 0, block 1 is assigned to TPC 1, … (my card is 9600GT, each TPC with 2 SM, totally 4 TPC), and this is same as Sylvain Collange descreibed, schedule phy unit is TPC rathe than SM.

But, my experiment data also show that blocks are dispatched batch by batch and no load balance. The detail is that each batch blocks run time is determined by the longest block in this batch. In other words, if there is a block that will run a long time, even through other blocks all exit, the block dispatcher can’t assigned next batch of blocks to TPC until the longest block finish. This conclusion is same as Sarnath descreibed.

My workload is simple. each block has only one thread, and the block to burn time is implemented by loading global memory always(need 8.54ms). other blocks only check its blockid and then exit immediately (need 0.041ms, load global memory or exit determined by block id). By test, the number of active blocks concurrently in GPU one time is 64(8*8: first 8 is 8 SM, last 8 is maximum active blocks on one SM. the result 64 is also same as CUDA occupancy calculator’s result).

blocks run time(ms) notice

64 0.041 check their blockid and then exit

1 8.54 only one block, load global memory

64 8.54 only block 0 load memory, others exit immediately

65 17.29 only block 0 and 64 load memory, others exit. block 0~63 is first batch of block, block 64 is sencond batch of block.

128 17.31 only block 0 and 64 load memory, others exit. block 0~63 is first batch of block, block 64~127 is sencond batch of block.

256 34.31 only block 0 64 128 196 load memory, others exit. blocks 0~63 64~127 128~195 196~255 are batch 0 1 2 3 respectively.

You are right. Your program crashed my machine as expected. :)

Thanks for providing the test.

I rewrote your code in assemby and merged it with mine so I could see what’s happening.

Results show that block 255 is scheduled and finishes early, while block 0 is still running.

But block 0 still never sees the value change, and timeouts.

So maybe this is not a scheduling issue, but rather a memory coherency issue.

For example the value stored by block 255 may still reside in the internal buffers of TPC 7 and not get written in DRAM.

(But this hypothesis still does not explain why it works with less than 128 blocks…)

Did you try to use __threadfence() and compile with CUDA 2.2?

Did you check if you have missed the “volatile” thing?? (use ld.volatile if u r using ptx). May be, it gets translated to just spinning on the register.

Also, on what GPU did you profile it on? How many multi-processors does it have? Probably it has so much multiprocessors that all 255 are scheduled in 1 stretch. OR probably your results are not valid because your kernel crashed… (am just enumerating some options. no offence please.)

No, I have never used the fence thing ever. I use CUDA 2.1

My hypothesis explains Mr.Anderson’s test results too. but your experiments seem to tell the other way. Are you sure - block 255 is executed while block 0 is still running ? - How did you verify that? Can you share some details?

Thanks,

Best REgards,

Sarnath

Well, I code directly in assembly to avoid compiler issues, I dump the TPC and SM IDs to know exactly which block is scheduled on which TPC/SM, and I used my own timeout to avoid crashing the kernel, so I hope I got myself covered here ;)

However I don’t know how much confidence I can have in the device clock() function. I know the counter is local to each SM (or each TPC), and is not guaranteed to be synchronized with the other SM counters. (It looks like they are synchronized at the beginning of the program, but running functions such as cuMemset/cudaMemset can desynchronize them. Or I just don’t understand anything.)

I updated my test program at http://gala.univ-perp.fr/~scollang/ware/cu…heduling.tar.gz.

The new option ‘lock’ performs your locking test. The ‘maxtime’ argument is the timeout.

For example for

./scheduling 256 lock 100000 0

I get:

#Block	TPC	SM	start	duration

0	0	0	522	100231

1	1	0	354	537

2	2	0	273	575

...

126	6	1	6	486

127	7	1	30	511

128	0	0	100883	212

129	1	0	1166	210

...

255	7	1	1554	234

Which tells me that block 0 was scheduled on TPC 0, SM 0 from cycle 522 to 100753, and block 255 was scheduled on SM 1 of TPC 7 from 1554 to 1788.

Of course I am far inside the undocumented zone, so there is absolutely no guarantee that the counters I use are meaningful…

Thanks for your time.

It would be good if you could memset the gendtimes, gid to zero before launching the kernel – so we know that it were written from the kernel (and not coming possibly from previous execution of the same application).

I am yet to go through your entire code. I am still looking. Thanks.

From sarnath_kernel in scheduling.asm :

loopstart:

mov.u32 $r4, g[$r1]

  mov.u32 $r3, 0

  set.eq.s32 $p0|_, $r4, $r3

@$p0.ne mov.b32 $r6, %clock

  @$p0.ne sub.b32 $r6, $r6, $r7

  @$p0.ne set.lt.u32 $p0|_, $r6, $r5

 @$p0.ne bra.label loopstart

–edited–

Am I right, if I say the predicate condition for last 4 statements (@$p0.ne) is NOT ok?

Both the compared values are 0 initially for block 0. So block 0 does NOT loop and comes out making way for other blocks.

Is that right?

Predicates in G80 assembly are tricky and misleading. Actually they are 4-bit flags which contains the Zero, Negative, Carry and Overflow bits.

They can be used together with a condition code to predicate an instruction. For instance @$p0.ne executes the operation when bit Zero is not set.

This is consistent with the way condition code works in most other architectures.

The set instruction is primarily used to set a register, for example:

set.eq.s32 $r0, $r4, $r3

will assign 0xffffffff to r0 if r4=r3, and 0 otherwise.

Now most (all?) ALU instructions can write to a predicate. The set instruction just follows the general case.

So the Zero and Negative bits of the predicate are computed from the result of the instruction, here 0 or 0xffffffff = -1.

The underscore is a bit bucket to say we don’t want the result to be written to a register, as we are only interested in the predicate.

Instructions predicated with @$p0.ne will be executed when bit Zero is not set, or equivalently when the output of set is not zero, or when r4 = r3.

This is misleading because the condition looks reversed.

Then

@$p0.ne set.lt.u32 $p0|_, $r6, $r5

will update the predicate again if it was already set.

This is the same behavior as (cond1 &&Â cond2) in C, where cond2 is not tested when cond1 is false.

So what I meant was :

while(scratch[gridDim.x-1] == 0 && clock() - start_time < limit) {}

@Sylvain,

Thanks for the detailed explanation. Have encountered such things in other architectures as well. Makes sense now.

Will post after analyzing the code. Thanks.

After thinking twice about the results, I think you are right and my analysis was wrong.

The likely explanation for my results is that after a TPC finishes its work (wave of blocks), it’s put in a low-power state waiting for other TPC to finish their own share of work.
To save power, clock-gating techniques are used, and this causes the TPC or SM timer to stop incrementing until wakeup.

So when the next wave of blocks is scheduled, the clock register will still contain the time when the TPC was put to sleep instead of the actual current time.
Which means all my measurements were meaningless (at least those based on the clock. The TPC and SM IDs should still be valid).

Sorry for the confusion…

(Well, this is good news for me: the simpler the scheduling policy is, the easier it will be to implement in Barra. And then say: “we improved the scheduling algorithm and get a X% speedup” :) )

Sylvain,

I dont know much about power-saving.

But as a programmer - I can say - Whatever you found is a brilliant catch!

Good Luck!

Best Regards,
Sarnath

Resurrecting this old thread:
Has anyone seen if Fermi has better block scheduling methods? We KNOW the block scheduler is very different with Fermi (since you can even run multiple kernels in the same SM), so this “wave of blocks” simple scheduling likely has changed as well.

And a second point. If blocks really are scheduled in batches, we could completely hide that inefficiency by doing our own scheduling: Make a single global atomic integer. At the start of your block, you increment the integer and use THAT value as your “block number.” If the value is greater than the number of work units you have, your block exits. When the block has finished doing its work, you go back to the start of your block and it does the atomic increment again. So it’s really a single persistent block to the scheduler, but you’re filling in new work units asynchronously to all other SMs as you’d want for efficiency.

Hmm, I will try this on some of my kernels to see if it makes any difference on G200 (and/or Fermi). I never expected it to, but this thread showed me that it may not be true.

It’s a totally different scheduling mode. It’s a lot better.

as to your second idea: it works for GT200 with suitably heterogeneous block runtimes, it doesn’t work for GF100 despite the improved atomic performance.