Processing of a warp


How are the 32 warp threads executed concurrently (assume same instructions case) by SM with 8 processors?


The ALUs are clocked 4x faster than the instruction decoder. Thus an instruction is docoded and then the ALUs are ticked 4 times with the same instruction => warp size of 32.

And the pipelining in the stream processors requires more instructions in-flight than there are ALUs for best efficiency. Folding up a 32-thread warp to run 4 deep in each ALU does this very well.

it’s actually 2x, which is why the half-warp matters for some things. I believe it’s a forward compatibility issue–16-wide is conceivable in the future, and this way it wouldn’t affect CUDA at all.