- Does a Block stay active on a MP until it finish the kernel or does the MP slice time between other active Blocks?
- -,- Warp ,- … active Warps?
- can I assume that all Threads in the warp run parallel ?
No it can finish during kernel execution and another block can then be scheduled. There can also be more than 1 block running on a MP at any given time. Warps are indeed time-sliced (warps ready for execution)
Yes you can
does What you mean is that if there are N active blocks at the same MP then when Block0 finish Block1 start and so on without time slicing, but inside the block each warp slice MP time that mean warp0 start then after K miliseconds warp0 wait and warp1 start the same happen to warp1 and warp2 start and then warp0 continue and so on …?
Not exactly. If you can have e.g. 4 active blocks per MP, the warps from those 4 blocks are intermingled to hide the acccess latency to global memory. You can in one kernel call have thousands of blocks running per MP, so when the first of those 4 is done, then number 5 is scheduled and so on. (experiments have shown that this scheduling of not yet done blocks is very fast indeed, so the scheduler is very good at keeping the processors busy)
Also the warps are scheduled every clock cycle, there is no switching overhead like on a CPU, so it is not like a CPU scheduler.
by intermingled you mean that if Block size is not a multiple of Warp size then threads from the k Block create a Warp with Threads from the k+1 Block ?
by “warps are scheduled every clock cycle” you mean that every clock cycle there a switch between active warps cyclic on all warps?
No, a Block with 33 threads will be broken into two warps, the second one having 31 of its 32 threads masked out. Such “reduced” warps will not be merged automatically, hence it’s advised to keep block sizes aligned to warp size.
Warps are intermingled in the sense that the scheduler is free to select any warp from any block running on a given MP for execution on this MP. And warps are timesliced. For example, if there are 4 blocks queued on an MP, each consisting of 3 warps, there’s a total of 12 warps ready to run. At any time, only one of them is executing but the order of execution and preemption is up to the scheduler (one should not assume anything about how they will be sliced and mixed). If our MP is currently executing warp 0 from block 1 and reaches a point where this warp makes a global memory read (resulting in like 500 cycles of waiting) it may switch to any other warp from those 12 (ex. warp 2 from block 0) and execute it instead. There’s a scoreboard mechanism that will keep track on the state of memory reads and when warp 0’s data is ready, the scheduler will switch back to it.
AFAIK there should be no cost of switching the context (mainly due to shared register file) so we do not loose cycles when warps get sliced (thus every cycle = new instruction issued to some warp queued on the MP). At least that’s the theory. I’m not sure that’s the whole truth, as benchmark reveal different performance depending on block size (unrelated to occupancy).