Is this possible? I checked with CUDA programming guide 2.3 and could not find a clear answer:
“When a multiprocessor is given one or more thread blocks to execute, it splits them into warps that get scheduled by the SIMT unit. The way a block is split into warps is always the same; each warp contains threads of consecutive, increasing thread IDs with the first warp containing thread 0.”
From above words, my understanding is that warp CAN consist threads from different blocks but which threads from different block consist the warp is not clear. Is this correct? Thanks!
No, it doesn’t work that way. A warp is always threads from a single block. If the number of threads in a block isn’t an even multiple of the warp size, “dummy” threads are added to the warp, and their execution masked out. Experimentation and poring over the NVIDIA patents suggests that new blocks are not scheduled on a multi processor until all current executing blocks are completed.
And regarding your answer: “new blocks are not scheduled on a multi processor until all current executing blocks are completed”, another question comes to me about the relation between block and SMs:
If I issue N blocks and there are n<N SMs on my GPU, these N blocks will wait in a queue for available SMs.
So even if the work load of each block is quite different, the GPU SMs are still always busy if there is enough blocks (assuming threads doesn’t stall).
Is this correct?
E.g., 4 blocks * 32 threads/block, and
block 0 does 32 addings,
block 1 does 64 addings,
block 2 does 64 addings,
block 3 does 32 addings,
and there are 2 SMs on GPU, the work load of SMs are still balanced because SM1 does 96 addings (from block 0 and 2) and SM2 also does 96 addings (from block 1 and 3) though the work load of block are not. Right?
That isn’t necessarily the case. A multiprocessor can run up to 8 blocks concurrently, if there are sufficient resources (shared memory, total thread count and registers). The rest of your question isn’t necessarily true either. You might want to look at section 5.2 of the programming guide again.
The example may not be accurate. My case is that the tasks cannot be divided evenly among blocks or threads, so I’m trying to assign many blocks and in my understanding as long as one block finishes all its tasks, it would exit from SM and another waiting block will get in (I understand that there may already several other blocks running in that SM).
If my above understanding is correct, I can still achieve good load balancing among SMs (although not among blocks) because every SM is always busying with tasks (except the last round where no block is waiting so some SMs are idle). Do you think this is correct?
Repeating what I already posted, no it most likely isn’t correct. The consensus view is that scheduling of blocks on current hardware is done on a simple round robin basis, and a multiprocessor must reach an idle state (all scheduled blocks must exit) before new blocks are scheduled, and the block allocation is done in groups of blocks whose size is determined by the constraints described in section 5.2 of the programming guide. So the scheduling model is basically “first-come, first served” - there is no load balancing or anything like that.
No. A block is retired when all its threads are finished executing (meaning they have executed one of the return/exit paths in the kernel code). Any thread that is stalled waiting for instructions or data will prevent its parent block from being retired. If you manage to engineer an intrablock synchronization race condition (some threads sitting at a __syncthreads instruction while others have already exited for example), the block will effective “hang” and never retire without host side intervention.
Is it known if the idle state must be reached per multiprocessor or for the entire card? That is, if one multiprocessor finishes its blocks substantially before the others, will it get new blocks immediately, or does it have to wait for the other multiprocessors?
My understanding (I couldn’t find the long thread from last year where this was thrashed out), that it is at the MP level. So when a multiprocessor has retired all its current active blocks, it is scheduled with more. Certainly the alternative hypothesis (that blocks are replaced as they are retired individually, rather than as MP groups) was disproven.
On the bright side of things, from the marketing material on Fermi’s “Gigathread” engine and the specification that Fermi can run up to 4 kernels concurrently it seems clear (hopefully) that Fermi will replace blocks immediately when they finish. I suspect there will be a number of people testing this when Fermi comes out.
As an interesting side-point to this, it seems that this is true even on AMD GPUs. The reason being that in graphics, it is necessary to perform updates on pixels in-order and blocks that are launched at once typically update a single frame before a new wave of blocks are launched to update a new frame. If these are scheduled together, allowing blocks from the first group to complete after the second group would allow pixels from previous frames to bleed into newer frames. They could easily get around this by adding explicit rather than implicit barriers between groups of blocks, and maybe that is what we will see in Fermi.
You’re right that it’s an easy experiment and one we’ll all test with! The steps are straightforward:
Get a Fermi board.
Make a simple kernel that takes a LONG time to compute… just a while loop or something spinning for one second.
Launch the kernel asynchronously 32 times with ONE BLOCK each. These launches need to be from the same context, but with different streams. The 3.0 docs states Fermi won’t multi-schedule kernels from different streams.
Measure the wall clock execution time.
On any G200 device, the runtime will be 32 seconds… A kernel will launch its block, most of the SMs will sit idle while one spinloops the dummy kernel. After it’s done, the next kernel will launch. Repeat 32 times for 32 seconds.
On a 16 SM Fermi… we’re not sure what will happen.
It could be that the total time takes 2 seconds… each SM effectively gets one kernel with one block, so the 16 SMs each process 2 kernels and finish in 2 seconds.
The other possibility, and probably the likely behavior, is the run will take 8 seconds… we know Fermi can simultaneously execute 4 kernels. So 4 blocks will run at once, 12 SMs will sit idle and 4 SMs will keep executing 1-block kernels.
(The above is based on a 16 SM Fermi, but it may be 14SMs in actual parts… we’re not sure yet!)
A variant of the test above is to make 32 different unique kernels… it might show if Fermi can multischedule identical kernels more times than it can multischedule different kernels. (Just use a C++ template to force “custom” kernels by numeric template argument)
The BIG question is whether or not OS graphics acts like a “kernel”… allowing you to devote 1/4 of your card to display and use the remainder for processing with no watchdog time limit. This is by far the most important question to answer in practice.
i meant test for showing behaviour in scheduling blocks. Anyway thank you for test on gigathread :) Actually i do not doubt running time will take <= 8 sec, cause ability for 4 kernels simultaneously is officially announced