Fermi doesn't keep all execution units busy?

I asked this earlier in the CUDA 3.0 thread, but not many people will find it.

Can someone enlighten me what’s good about the dual warp scheduler? Originally, I thought this allows issuing 2 instructions that use different execution units in 1 cycle, so that a load can execute at the same time as an arithmetic. This would likely be a 1.5x speedup for shared memory convolution code, which without register blocking, needs 2 loads for every arithmetic:


slot 0: load
slot 1: load
slot 2: mad

after (hopefully):

slot 0: load
slot 1: load mad
slot 2: load
slot 3: load mad

However, Appendix G, in the new programming manual says, “A warp scheduler can issue an instruction to only half of the CUDA cores”. So does that mean even though an arithmetic and load execute in the same slot, half of the load units and ALUs are idle?

I don’t see any reason why they can’t allow the load units to execute at the same time as the ALUs. Dual issue gains a lot for little cost (Pentium 1, newer ARMs).

This post has been edited by Uncle Joe: Yesterday, 08:34 PM

The schedulers aren’t trying to coordinate multiple instruction issue on the same warp. They work on independent warps so there’s little wastage or idle cores.

One of the schedulers handles even numbered warps, and one handles odd warps. Each scheduler uses only half the cores, so they pretty much chug away independently.
Either scheduler can inhibit the other one for a cycle when a double precision compute is needed, allowing the running scheduler to suck up the resources of ALL the cores (not just half) to handle the wider DP op.

This leads to the question about lost __syncthreads() efficiency in Fermi. If you have lots of work to do in warps 1 and warps 3, but not in warps 2 and 4, then one scheduler and half your cores will be sitting idle at the next syncthreads() barrier, even if there’s plenty of work left to do in multiple warps 1 and 3.
Is this going to be a common inefficiency? Probably not, but it may make you need to be careful if you’re doing a lot of too-clever per-warp task allocation.

It might also affect block sizing a bit… it’s likely good to create blocks with threads in a multiple of 2 warps just to help load-balance the two schedulers. It’s clearly a bad idea to create a block with 32 or even 96 threads. We’ll see in practice when we all have hardware. (Though maybe Fermi will traansparently flip the scheduler “phase” of warps when there’s multiple running blocks in one SM just to help this scheduler load balancing?)

Right. I know that, even though it seemed I was talking about within a warp when I compared it to Pentium’s dual issue, but the problem remains:

(my interpretation - could be wrong)

32 load/store units are idle, all ALUs busy

16 load/store units idle, 16 ALUs idle

all load/store units busy, 32 ALUs idle

Now that I recall, GT200 allowed multiply and mad to be dual issued. So why not allow loads to be dual issued with arithmetic, which probably

is more useful?