Scheduling of thread blocks on Stream Processors

According to CUDA manual, at most 8 thread blocks can be scheduling to be executing on a same stream processor. The exact amount of blocks depends on the register/shared memory usage. So is there any detail about how thread blocks are assigned to processors, i.e., in a block-wise OR a cyclic way? Normally it requries analysis of registers/shared memory through PTX code to reveal the exact number of concurrent thread blocks on a processor. Is there an easy way/tool to determine this? Thanks.

Just got the CUDA occupancy calculator. Second problem solved.

From developer’s point of view, you should not expect any predictable behavior of scheduling of thread blocks.

Sorry I was trying to ask a very simple question which should be easy to answer. For example I have 30 Stream Multiprocessors (as for GT200), active thread block per Multiprocesor is 4, and 120 thread blocks with number 1 to 120. So when scheduled, all 120 thread blocks will be all up and running, i.e., assigned to some Multiprocessor. My question is at the i-th Multiprocessor, which thread blocks are assigned to it? For example, the 1st Multiprocessor will contain thread blocks of number 1 to 4, OR thread blocks with number: 1, 31, 61, 91? I think this assignment should be static and follow certain rules such as block-wise or cyclic.

Also when a thread block (of the 4 blocks assigned to the same Multiprocessor) quits, another thread block will be scheduled in OR the Multiprocessor will stay `less’ occupied until the other 3 all quit?


And the correct answer, which you have already been provided with, it that the behaviour is undefined.

Just digged an old yet alive thread:…st&p=533631

It seems the results reached by that thread is as far as GT200 goes (Fermi not included), thread blocks are assigned to TPC’s in a RR way, and each Multiprocessor will carry one or multiple thread blocks (up to 8). So with the case that thread blocks with number 1 to 240 to be run on a GT200 GPU, with thread block count per multiprocessor as 4, then:

(1) the first 120 TB’s will be up running at the same time
(2) TB 1, 11, 21, 31 will be running on Multiprocessor 1; TB 2, 12, 22, 32 will be running on Multiprocessor 2; TB 3, 13, 23, 33 will be running on Multiprocessor 3. They all run on TPC 1.
(3) scheduling of the second 120 TB’s won’t start until all the 120 TB’s have all finished.

Is it alright? Thanks.

If you write an app that depends on that exact behavior, you will rapidly become very unhappy.

Ok… I take your advice. I can understand that with new hardware, i.e., Fermi, and new drivers, scheduling may change. More flexible scheduling is more desirable, definitely. And is there going to be a n API for arranging the initial assignment to threads?

I’m curious: What’s the use case for an API to send blocks to specific SMs?

What I was considering is quite straightforward. GPUs before Fermi have 2-level read only texture caches, exploiting them requires certain knowledge of how threads use texture cache, including footprint size, etc. For Fermi, since Caches are coherent and larger, it may provide more performance potential for careful tuning of it. Binding of thread blocks indeed matter in this aspect.