warp scheduler of Fermi architecture

King_Crimson · February 1, 2012, 7:27am

For device of compute capability of 2.0, each SM has 32 cores and 2 warp schedulers. A warp scheduler can issue an instruction to only 16 cores once, so it takes two clock cycles to issue to 32 cores. These two warp schedulers allow “two warps to be issued and executed concurrently” (from Fermi white paper). Does this mean half of the one warp is executed on 16 cores, and half of another different warp is executed on the other 16 cores? If so, does this conflict with the statement that “a warp executes one common instruction at a time (from programming guid)”, which implies that the 32 cores should be executing one instruction from the same warp at the same time? Besides, what is happening to a device of compute capability 2.1, which has 48 cores? Thanks for clarification! External Image

tera · February 1, 2012, 12:53pm

The answer is “yes” to all yes/no questions.
I believe “a warp executes one common instruction at a time” should not be taken too literally, rather as opposed to “a warp executes for each thread serially”. The meaning of “one instruction at a time” is blurred anyway, as instructions need somewhere between 16…24 cycles to execute but execution is highly pipelined. From a programmer’s point of view, it doesn’t make a difference whether a warp is executed in 16 double-pumped pipelines or 32 single-pumped ones.

“One instruction at a time” gets even more blurred on compute capability 2.1 devices, where two different arithmetic instructions from the instruction stream of one warp may be issued to two groups of 16 cores in parallel. This does however make a difference to the programmer (performance-wise), as the scheduler may not be able to find two independent instructions for a warp, or the register bandwidth might not be sufficient to supply all operands for two instructions at the same time.

laughingrice · February 5, 2012, 12:52pm

It’s somewhat complicated unless you know the right hardware terminology. If you want to get more confused, the compute 1.x cards actually have 8 cores per multicore and still have a warp of 32 threads which takes 4 clock cycles per instruction.

What actually happens is that each warp has a single instruction counter, so not all threads issue the same instruction at exactly the same time, but rather, it takes several cycles for the warp to finish the current instruction and no thread can continue to the next instruction until all threads finish the current one. With compute 2.1 it’s even more complicated, as it has 3 x 16 cores but only two instruction schedulers so you have to use ILP (instruction level parallelism) to fully utilize the entire multicore.

Topic		Replies	Views
Scheduler concept inside FERMI CUDA Programming and Performance	2	7245	March 25, 2011
Execution of a warp CUDA Programming and Performance	0	460	November 28, 2013
regarding transcendental instruction execution cycles in Fermi CUDA Programming and Performance	7	2382	November 19, 2010
Warps and Occupancy CUDA Programming and Performance	4	4047	April 19, 2011
Warp threads execution model CUDA Programming and Performance	8	2770	January 19, 2010
"Half-warps", scheduling, and branch divergence CUDA Programming and Performance	3	4302	February 24, 2013
How do CUDA cores on a SM execute warps concurrently? CUDA Programming and Performance	8	28700	July 4, 2019
Threads Dispatching : 2 different instructions per cycles? CUDA Programming and Performance	2	33	January 31, 2025
Warp thread Scheduling CUDA Programming and Performance	7	2244	June 28, 2010
Threads per warp vs number of cores CUDA Programming and Performance	2	2602	February 3, 2009

warp scheduler of Fermi architecture

Related topics