I’m quite surprised that FP Throughput is almost every cycle ?
even for dependent FP instructions (next instruction depends on previous FP output).
Is that possible ? or I’m doing something wrong ?
I don’t know much about the FP design of Nvidia GPUs, but it seems they do data-forwarding effectively ?
since latency for each FP latency is definitely not a cycle…
No, it’s not possible to run dependent operations back to back. That is why you should load the GPU to run ~24 threads per floating point unit, so that latency can always be hidden with instructions from independent threads (GPUs optimize throughput, not latency).
~24 cycles on compute capability 1.x devices. Apparently a bit shorter on 2.x devices (~16 cycles are reported for some instructions), although I’m not aware of a published systematic measurement.
Anyway the latency is dependent on the operands… the latency is longer when a register is used several times as different operands of the same instruction. That’s probably some problem in register file read/operand fetch
you can take a look here: it’s messy and the results there aren’t comprehensive… but you can ask the people who have done enough measurement… I haven’t, anyways
Might well be, I didn’t bother to search the forums and just cited from my memory. The generally accepted number seems to be 18 cycles, with the caveat that it might be higher in some cases. So 24 seems a safe bet on all devices.