Fermi doesn't keep all execution units busy?

Uncle_Joe · February 24, 2010, 5:08am

I asked this earlier in the CUDA 3.0 thread, but not many people will find it.

Can someone enlighten me what’s good about the dual warp scheduler? Originally, I thought this allows issuing 2 instructions that use different execution units in 1 cycle, so that a load can execute at the same time as an arithmetic. This would likely be a 1.5x speedup for shared memory convolution code, which without register blocking, needs 2 loads for every arithmetic:

before:

slot 0: load
slot 1: load
slot 2: mad

after (hopefully):

slot 0: load
slot 1: load mad
slot 2: load
slot 3: load mad

However, Appendix G, in the new programming manual says, “A warp scheduler can issue an instruction to only half of the CUDA cores”. So does that mean even though an arithmetic and load execute in the same slot, half of the load units and ALUs are idle?

I don’t see any reason why they can’t allow the load units to execute at the same time as the ALUs. Dual issue gains a lot for little cost (Pentium 1, newer ARMs).

This post has been edited by Uncle Joe: Yesterday, 08:34 PM

SPWorley · February 24, 2010, 6:10am

The schedulers aren’t trying to coordinate multiple instruction issue on the same warp. They work on independent warps so there’s little wastage or idle cores.

One of the schedulers handles even numbered warps, and one handles odd warps. Each scheduler uses only half the cores, so they pretty much chug away independently.
Either scheduler can inhibit the other one for a cycle when a double precision compute is needed, allowing the running scheduler to suck up the resources of ALL the cores (not just half) to handle the wider DP op.

This leads to the question about lost __syncthreads() efficiency in Fermi. If you have lots of work to do in warps 1 and warps 3, but not in warps 2 and 4, then one scheduler and half your cores will be sitting idle at the next syncthreads() barrier, even if there’s plenty of work left to do in multiple warps 1 and 3.
Is this going to be a common inefficiency? Probably not, but it may make you need to be careful if you’re doing a lot of too-clever per-warp task allocation.

It might also affect block sizing a bit… it’s likely good to create blocks with threads in a multiple of 2 warps just to help load-balance the two schedulers. It’s clearly a bad idea to create a block with 32 or even 96 threads. We’ll see in practice when we all have hardware. (Though maybe Fermi will traansparently flip the scheduler “phase” of warps when there’s multiple running blocks in one SM just to help this scheduler load balancing?)

Uncle_Joe · February 24, 2010, 6:20am

Right. I know that, even though it seemed I was talking about within a warp when I compared it to Pentium’s dual issue, but the problem remains:

(my interpretation - could be wrong)

32 load/store units are idle, all ALUs busy

16 load/store units idle, 16 ALUs idle

all load/store units busy, 32 ALUs idle

Now that I recall, GT200 allowed multiply and mad to be dual issued. So why not allow loads to be dual issued with arithmetic, which probably

is more useful?

Topic		Replies	Views
warp scheduler of Fermi architecture CUDA Programming and Performance	2	3209	February 5, 2012
How do CUDA cores on a SM execute warps concurrently? CUDA Programming and Performance	8	28700	July 4, 2019
Understanding fermi warp scheduler CUDA Programming and Performance	0	2384	December 2, 2011
Scheduler concept inside FERMI CUDA Programming and Performance	2	7245	March 25, 2011
Threads Dispatching : 2 different instructions per cycles? CUDA Programming and Performance	2	33	January 31, 2025
Understanding CUDA scheduling CUDA Programming and Performance	4	15455	May 20, 2014
Fermi speculation Kernel invocation in kernel code CUDA Programming and Performance	10	4294	October 20, 2009
Can threads in a warp from different blocks? CUDA Programming and Performance	17	11836	March 26, 2010
How to run SFU together with FMA? CUDA Programming and Performance	2	1270	January 28, 2013
performance gain by "killing" warps can there be any? CUDA Programming and Performance	5	2268	February 12, 2009

Fermi doesn't keep all execution units busy?

Related topics