I can’t seem to find a complete explanation of exactly how kernels get scheduled and executed, so let me ask if I’ve got this correct.
- GPU selects a Block that hasn’t been executed, and assigns it to a single, available MP.
- A Block is only ever assigned to one MP (so that shared memory works)
- An MP is only assigned one block at a time. It will not be assigned another block until execution of all threads in that first block is complete. [?]
So block scheduling takes place at the MP level. The Cuda reference manual (diagram) implies that blocks are allocated round-robin style to MPs before execution, but also that blocks can execute in any order.
Now it gets hazy…
(on my GT240) each of the 12 MPs has 8 cores. So the GPU assigns up to 8 of the block’s warps, one warp to a core in the block’s assigned MP.
If there are less than 8 warps in the block, cores go unused [this doesn’t sound right]
If the block contains more than 8 warps, the GPU will assigned the remaining warps to cores either as previous warps complete, or when a running warp starts waiting for a resource (like IO, or syncThreads).
I’m not clear on whether warps can switch to alternative cores, or whether cores can have multiple warps assigned to them?
each warp executes its 32 (or less) threads in lock-step, one instruction every 4 clock cycles (IO waits notwithstanding).
And now the fog descends…
- I don’t believe all 8 warps on the 8 cores execute independently in parallel, but I don’t have a clear picture of how they execute with the clock? Something about MP’s only being able to complete one instruction per clock cycle.
How this varies with compute architectures is another question.
What score do I get for the above, or can someone point me towards the best on-line material covering this rather hardware orientated perspective?