Throughput FP Unit ?

Michael_H1 · March 3, 2012, 8:57pm

I’m quite surprised that FP Throughput is almost every cycle ?
even for dependent FP instructions (next instruction depends on previous FP output).
Is that possible ? or I’m doing something wrong ?

I don’t know much about the FP design of Nvidia GPUs, but it seems they do data-forwarding effectively ?
since latency for each FP latency is definitely not a cycle…

I appreciate it for some comments…

tera · March 3, 2012, 9:21pm

No, it’s not possible to run dependent operations back to back. That is why you should load the GPU to run ~24 threads per floating point unit, so that latency can always be hidden with instructions from independent threads (GPUs optimize throughput, not latency).

Michael_H1 · March 3, 2012, 9:38pm

Thanks, I see that those back to back latency should have been hidden by other warps.

I still would like to know roughly what would be the latency for that single FP instruction issue to a single warp.

tera · March 3, 2012, 10:12pm

~24 cycles on compute capability 1.x devices. Apparently a bit shorter on 2.x devices (~16 cycles are reported for some instructions), although I’m not aware of a published systematic measurement.

hyqneuron · March 4, 2012, 5:01am

Was the 16 cycles latency reported by me? I was on a GTX 460 with 2 schedulers capable of issuing 3 inst/clock… sorry for giving the wrong numbers :(

@Michael

Anyway the latency is dependent on the operands… the latency is longer when a register is used several times as different operands of the same instruction. That’s probably some problem in register file read/operand fetch

you can take a look here: it’s messy and the results there aren’t comprehensive… but you can ask the people who have done enough measurement… I haven’t, anyways

https://groups.google.com/forum/?fromgroups#!topic/asfermi/eEjCVpYpZ-s

tera · March 4, 2012, 1:04pm

Might well be, I didn’t bother to search the forums and just cited from my memory. The generally accepted number seems to be 18 cycles, with the caveat that it might be higher in some cases. So 24 seems a safe bet on all devices.

Topic		Replies	Views
Questin regarding latency CUDA Programming and Performance	6	4282	August 26, 2010
Meaning of Operation Throughput CUDA Programming and Performance	2	1052	February 17, 2011
Instruction Latency CUDA Programming and Performance	18	43903	January 18, 2010
Pipeline Latencies on GPU vs CPU typical CPU pipeline latencies? CUDA Programming and Performance	17	11628	December 7, 2009
How to keep the float pipe busy? CUDA Programming and Performance	7	738	April 23, 2019
What limits the IPC in CUDA? or How to decrease the avg execution dependency cycles? CUDA Programming and Performance	6	7242	March 30, 2013
performance of integer vs float CUDA Programming and Performance	10	21737	June 15, 2009
how many threads to hide latency on Fermi? the number in the NVIDIA manual is 2x off (?) CUDA Programming and Performance	14	9904	August 21, 2010
When increase threads linearly, throughput might not go linear at the end CUDA Programming and Performance	2	3798	February 10, 2011
__syncthreads throughput too low on Fermi! CUDA Programming and Performance	5	5762	May 8, 2012

Throughput FP Unit ?

Related topics