Throughput FP Unit ?

I’m quite surprised that FP Throughput is almost every cycle ?
even for dependent FP instructions (next instruction depends on previous FP output).
Is that possible ? or I’m doing something wrong ?

I don’t know much about the FP design of Nvidia GPUs, but it seems they do data-forwarding effectively ?
since latency for each FP latency is definitely not a cycle…

I appreciate it for some comments…

No, it’s not possible to run dependent operations back to back. That is why you should load the GPU to run ~24 threads per floating point unit, so that latency can always be hidden with instructions from independent threads (GPUs optimize throughput, not latency).

Thanks, I see that those back to back latency should have been hidden by other warps.

I still would like to know roughly what would be the latency for that single FP instruction issue to a single warp.

~24 cycles on compute capability 1.x devices. Apparently a bit shorter on 2.x devices (~16 cycles are reported for some instructions), although I’m not aware of a published systematic measurement.

Was the 16 cycles latency reported by me? I was on a GTX 460 with 2 schedulers capable of issuing 3 inst/clock… sorry for giving the wrong numbers :(

@Michael

Anyway the latency is dependent on the operands… the latency is longer when a register is used several times as different operands of the same instruction. That’s probably some problem in register file read/operand fetch

you can take a look here: it’s messy and the results there aren’t comprehensive… but you can ask the people who have done enough measurement… I haven’t, anyways

https://groups.google.com/forum/?fromgroups#!topic/asfermi/eEjCVpYpZ-s

Might well be, I didn’t bother to search the forums and just cited from my memory. The generally accepted number seems to be 18 cycles, with the caveat that it might be higher in some cases. So 24 seems a safe bet on all devices.