warp scheduler of Fermi architecture

For device of compute capability of 2.0, each SM has 32 cores and 2 warp schedulers. A warp scheduler can issue an instruction to only 16 cores once, so it takes two clock cycles to issue to 32 cores. These two warp schedulers allow “two warps to be issued and executed concurrently” (from Fermi white paper). Does this mean half of the one warp is executed on 16 cores, and half of another different warp is executed on the other 16 cores? If so, does this conflict with the statement that “a warp executes one common instruction at a time (from programming guid)”, which implies that the 32 cores should be executing one instruction from the same warp at the same time? Besides, what is happening to a device of compute capability 2.1, which has 48 cores? Thanks for clarification! External Image

The answer is “yes” to all yes/no questions.
I believe “a warp executes one common instruction at a time” should not be taken too literally, rather as opposed to “a warp executes for each thread serially”. The meaning of “one instruction at a time” is blurred anyway, as instructions need somewhere between 16…24 cycles to execute but execution is highly pipelined. From a programmer’s point of view, it doesn’t make a difference whether a warp is executed in 16 double-pumped pipelines or 32 single-pumped ones.

“One instruction at a time” gets even more blurred on compute capability 2.1 devices, where two different arithmetic instructions from the instruction stream of one warp may be issued to two groups of 16 cores in parallel. This does however make a difference to the programmer (performance-wise), as the scheduler may not be able to find two independent instructions for a warp, or the register bandwidth might not be sufficient to supply all operands for two instructions at the same time.

It’s somewhat complicated unless you know the right hardware terminology. If you want to get more confused, the compute 1.x cards actually have 8 cores per multicore and still have a warp of 32 threads which takes 4 clock cycles per instruction.

What actually happens is that each warp has a single instruction counter, so not all threads issue the same instruction at exactly the same time, but rather, it takes several cycles for the warp to finish the current instruction and no thread can continue to the next instruction until all threads finish the current one. With compute 2.1 it’s even more complicated, as it has 3 x 16 cores but only two instruction schedulers so you have to use ILP (instruction level parallelism) to fully utilize the entire multicore.