Pipeline Latencies on GPU vs CPU typical CPU pipeline latencies?

Hi there,

We’re taught to run lots of independent operations to help hide pipeline latencies on the GPU, so it was my impression that the CPU was considerably less. For single precision it’s around 24~26 cc on the GPU.

What are some typical CPU pipeline latencies?

For multiply 3 - 4, other instructions 1 except division which is more than 20, perhaps 40

you say other operations 1 clock, are you referring to floating point addition? so then the throughput is the same as the latency in this case?

volkov does some experiements to evaluate pipeline latency in the paper

Vasily Volkov, James W. Demmel, Benchmarking GPUs to Tune Dense Linear Algebra. In SC ’08: Preceedings of the 2008 ACM/IEEE conference on Supercomputing. Piscataway, NJ, USA, 2008, IEEE Press.

you can download the paper in the thread http://forums.nvidia.com/index.php?showtopic=89084

No, thruput would be four times higher for register to register operations.

yes that is where i got my 24-26 clock cycles for latency on GPU… I’m curious about CPU latency

oh so you’re saying 4 clock cycle latency on CPU for floating point ADD and 12-16 clock cycles latency for floating point MUL?

so then add is 6 times faster on CPU than GPU when talking about latency? that’s not bad

EVERY operation has at least a 12 clock latency due to register potential read-after-write access stalls… the systems guys know a lot more than I do, but this 12 clock latency is a floor. This is why you should schedule at least 192 (active) threads per SM. since 16 threads operate in one clock, so 12*16=192 threads will cover this 12 clock latency.

oh no I’m talking about CPU latencies

GPU latencies are clearly detailed in the paper LSChien mentioned

http://gmplib.org/~tege/x86-timing.pdf

jjp, if that was for floating point operations, it would be awesome

just to clarify… when it says something like MUL throughput 1/5
it means the throughput takes 5 clock cycles to execute an integer multiply operation? isn’t GPU throughput for floating point ADD, MAD, MUL 1 clock cycle?

I don’t know of any data on floating point operations, unfortunately. Maybe there is something in the manuals from Intel/AMD.

Yes, that is what it means and you are correct about the throughput on GPUs.

No, I am saying that that the datatype for SSE and friends is float4 and therefore the thruput is 4 x clockrate.

This comparison isn’t completely valid though, since iterating only on the data that you happen to have in registers would soon leave you out of steam without anything reasonable to do. You’d have to get more data from at least cache level one, write back the results, read them back again, leading to much longer delays - and this is where you should start realizing that what Nvidia calls “registers” is approximately what Intel calls “cache level 1” and what Intel calls “registers” is what Nvidia considers part of the pipeline in the execution units.

The wiring is different, optimized for different goals - but the distances internally on the silicone is similar which is what determines the latencies.

thanks jma, that’s very interesting

what i was hoping to gain from this comparison was some perspective on the # of clock cycles it would take for a single cpu core to complete a series of dependent floating point operations (i.e. forcing full pipeline latency after each operation) vs the # of clock cycles for a single gpu core. This of course is hypothetical so imagine that’s all this program does… how many times faster is the cpu ( maybe we can compare a cpu’s L1 cache to nvidia’s register file space )

Typical latencies are 3 cycles for an FP add, 4 cycles for an FP mul and 5 cycles for an FMA, at 3 GHz.
This only includes the latency of the FP units. Complete pipeline depth (=cost of a branch misprediction) is about 20 cycles.

The G80 has a FP unit latency around 10 cycles, for an overall pipeline depth of about 28 cycles at 1.5 GHz (or rather 14 at 750 MHz).
But the G80 pipeline is also much simpler than the pipeline of a modern out-of-order x86 CPU…

hey thanks Sylvian, if there is no misprediction, would the overall cpu latency be ~3-5 clocks? (also are you quoting double precision on cpu?)

Latency depends on the instruction in question. usually division has lot of latency than multiplication.

The Intel Optimization guide has a list of latencies for various micro architectures and for all instructions. Note that division latency has considerably improved in the latest i7… but still manifold the latency of mul.

The guide defines “latency” as total clock cycles taken to execute the instruction.
It defines “throughput” as number of clock cycles to WAIT until the same instruction type can be issued again.

Also note that CPU optimization also comes from engaging all the execution units parallely. For example , an FP MUL and FP ADD can happen at the same time because they operate on independent execution units… Modern Intel cores have 6 executipn units… So, if u re-arrange your code intelligently and make the CPU busy, you can even hide your L1 cache latency… Intelligent pre-fetching can make you totall latency free…

These are the kind of thinking that goes into design of libraries like MKL.

Sarnath, you bring up an interesting point, I was thinking of latencies b/c there are always going to be some operations that have to be dependent on other operations, however for the cpu, there are many tools (such as MKL) available to help lessen the cpu latency even further then it already is against the gpu