Warp threads execution model

tonhead · January 15, 2010, 8:44am

CUDA Programming Guide:
“The multiprocessor SIMT unit creates, manages, schedules, and executes threads in groups of 32 parallel threads called warps.”

Does that actually mean that one instruction is issued by MP on every tick, and this instruction is “applied” to all 32 threads of the warp (if no branching occurs within warp)?
This leads us to the thought that it is minimum of 32 threads per block required to achieve speedup (in other words, running 32 threads in a block will take as much time as running 8 threads in a block, for instance).

True or false?

Sarnath · January 15, 2010, 10:43am

True

tonhead · January 15, 2010, 11:07am

thread here discusses branch instructions. It was said:

Having 8 scalar cores in MP, and 32 threads in a warp, we can get 4 instructions issued on each scalar core (that is, one instruction per one thread in a warp).

True or false?

Nevertheless it is only instruction issue efforts and can be neglected compared to instruction execution efforts. On the other hand, execution efforts ARE instructions to be issued. Wow, now it’s a mess. Can someone explain the process in details, please? External Media

Sarnath · January 15, 2010, 11:10am

Well, That is where “deep” pipelines, RAW hazards et al come into picture…

I leave the discussion to the more knowledgeable ones like Sylvain et al.

tonhead · January 15, 2010, 11:27am

Thank you, Sarnath, for your replies.

avidday · January 15, 2010, 11:54am

Neither. It takes 4 clock cycles to issue an instruction to all 32 threads in a non-divergent warp. 8 cores x 4 cycles = 32 instructions. So effectively, the MP instruction scheduler is issuing a new instruction at a maximum rate of one per four cycles. Not every instruction can be retired in one clock. Double precision instructions on current hardware must take a minimum of 8 cycles to retired (the double precision FPU is shared by all 8 scalar cores).

tonhead · January 19, 2010, 11:11am

Is it ever gonna change?

avidday · January 19, 2010, 11:22am

In Fermi, yes - there are now four times as many double precision capable FPUs per multiprocessor. So that theoretical 8 cycle minimum drops to two cycles for Fermi. The double precision retire rate is only the most obvious example. I am fairly sure that one of the NVIDIA people that posts here (maybe Simon Green or Mark Harris, I can’t remember off the top of my head) has hinted in the past that not all instructions execute in a single cycle. The programming guide notes that full 32 bit integer multiply is a very costly instruction on current hardware, for example, which is why there is the __mul24() version, which does execute in a single cycle. It also says that should change in future hardware.

seibert · January 19, 2010, 4:11pm

In fact, the Fermi whitepaper says the ALU has been upgraded from 24-bit to 32-bit multiplication, so the future is soon. :)

Topic		Replies	Views
Basic question about warps CUDA Programming and Performance	14	6596	June 9, 2009
warp and core What's the relationship between warp and core? CUDA Programming and Performance	12	15593	February 4, 2011
warp scheduler of Fermi architecture CUDA Programming and Performance	2	3214	February 5, 2012
How many parallel threads? CUDA Programming and Performance	19	10000	October 1, 2021
Threads per warp vs number of cores CUDA Programming and Performance	2	2602	February 3, 2009
Warp thread Scheduling CUDA Programming and Performance	7	2244	June 28, 2010
Thread Scheduling Concept CUDA Programming and Performance	3	3721	June 21, 2012
Warp Size Question CUDA Programming and Performance	21	13967	June 18, 2010
SIMD question Is the number of actual execution units relevant to a warp? CUDA Programming and Performance	2	544	March 30, 2012
GPU architecture and CUDA kernel execution CUDA Programming and Performance	13	24852	September 6, 2009

Warp threads execution model

Related topics