I’m trying to understand how the Fermi warp scheduler works. So far I have figured out the following:
Each thread-block (TB) is assigned to a specific streaming-multiprocessor (SM). A TB is never migrated between two SMs because the shared memory (TB local) and the thread-contexts of the TB are allocated on the assigned SM.
Each SM has two instruction dispatch units each of which issues instructions to 16 CUDA cores (and few other units) in parallel (32 threads in total).
Each SM picks two different warps and schedules them half-by-half (first 16 threads from the first warp and the second 16 threads from the second warp) using the two dispatch units onto the 32 cores. I guess those half-warp memory coalescing restrictions come from this half-warp scheduling / dispatch policy (?).
My question is, those two different warps, do they come from the same TB or could they be from two different TBs assigned to the same SM? I’m also a bit puzzled about how this dual-issue mechanism increases efficiency (as opposed to having one dispatch unit per SM and scheduling one whole warp at once), any explanations are very welcome!
I know I don’t need this information to program in CUDA this is for a little survey (research)