Pipeline Latencies on GPU vs CPU typical CPU pipeline latencies?

Nikolai · December 4, 2009, 12:21am

Hi there,

We’re taught to run lots of independent operations to help hide pipeline latencies on the GPU, so it was my impression that the CPU was considerably less. For single precision it’s around 24~26 cc on the GPU.

What are some typical CPU pipeline latencies?

jma · December 4, 2009, 12:49am

For multiply 3 - 4, other instructions 1 except division which is more than 20, perhaps 40

Nikolai · December 4, 2009, 1:04am

you say other operations 1 clock, are you referring to floating point addition? so then the throughput is the same as the latency in this case?

LSChien · December 4, 2009, 1:43am

volkov does some experiements to evaluate pipeline latency in the paper

Vasily Volkov, James W. Demmel, Benchmarking GPUs to Tune Dense Linear Algebra. In SC â€™08: Preceedings of the 2008 ACM/IEEE conference on Supercomputing. Piscataway, NJ, USA, 2008, IEEE Press.

you can download the paper in the thread [url=“The Official NVIDIA Forums | NVIDIA”]http://forums.nvidia.com/index.php?showtopic=89084[/url]

jma · December 4, 2009, 1:54am

No, thruput would be four times higher for register to register operations.

Nikolai · December 4, 2009, 1:57am

yes that is where i got my 24-26 clock cycles for latency on GPU… I’m curious about CPU latency

Nikolai · December 4, 2009, 2:01am

oh so you’re saying 4 clock cycle latency on CPU for floating point ADD and 12-16 clock cycles latency for floating point MUL?

so then add is 6 times faster on CPU than GPU when talking about latency? that’s not bad

SPWorley · December 4, 2009, 5:15am

EVERY operation has at least a 12 clock latency due to register potential read-after-write access stalls… the systems guys know a lot more than I do, but this 12 clock latency is a floor. This is why you should schedule at least 192 (active) threads per SM. since 16 threads operate in one clock, so 12*16=192 threads will cover this 12 clock latency.

Nikolai · December 4, 2009, 5:18am

oh no I’m talking about CPU latencies

GPU latencies are clearly detailed in the paper LSChien mentioned

jjp · December 4, 2009, 7:32am

[url=“http://gmplib.org/~tege/x86-timing.pdf”]http://gmplib.org/~tege/x86-timing.pdf[/url]

Nikolai · December 4, 2009, 8:27am

jjp, if that was for floating point operations, it would be awesome

just to clarify… when it says something like MUL throughput 1/5
it means the throughput takes 5 clock cycles to execute an integer multiply operation? isn’t GPU throughput for floating point ADD, MAD, MUL 1 clock cycle?

jjp · December 4, 2009, 8:45am

I don’t know of any data on floating point operations, unfortunately. Maybe there is something in the manuals from Intel/AMD.

Yes, that is what it means and you are correct about the throughput on GPUs.

jma · December 4, 2009, 5:50pm

No, I am saying that that the datatype for SSE and friends is float4 and therefore the thruput is 4 x clockrate.

This comparison isn’t completely valid though, since iterating only on the data that you happen to have in registers would soon leave you out of steam without anything reasonable to do. You’d have to get more data from at least cache level one, write back the results, read them back again, leading to much longer delays - and this is where you should start realizing that what Nvidia calls “registers” is approximately what Intel calls “cache level 1” and what Intel calls “registers” is what Nvidia considers part of the pipeline in the execution units.

The wiring is different, optimized for different goals - but the distances internally on the silicone is similar which is what determines the latencies.

Nikolai · December 4, 2009, 8:58pm

thanks jma, that’s very interesting

what i was hoping to gain from this comparison was some perspective on the # of clock cycles it would take for a single cpu core to complete a series of dependent floating point operations (i.e. forcing full pipeline latency after each operation) vs the # of clock cycles for a single gpu core. This of course is hypothetical so imagine that’s all this program does… how many times faster is the cpu ( maybe we can compare a cpu’s L1 cache to nvidia’s register file space )

Sylvain_Collange · December 4, 2009, 9:00pm

Typical latencies are 3 cycles for an FP add, 4 cycles for an FP mul and 5 cycles for an FMA, at 3 GHz.
This only includes the latency of the FP units. Complete pipeline depth (=cost of a branch misprediction) is about 20 cycles.

The G80 has a FP unit latency around 10 cycles, for an overall pipeline depth of about 28 cycles at 1.5 GHz (or rather 14 at 750 MHz).
But the G80 pipeline is also much simpler than the pipeline of a modern out-of-order x86 CPU…

Nikolai · December 4, 2009, 9:11pm

hey thanks Sylvian, if there is no misprediction, would the overall cpu latency be ~3-5 clocks? (also are you quoting double precision on cpu?)

Sarnath · December 7, 2009, 4:46am

Latency depends on the instruction in question. usually division has lot of latency than multiplication.

The Intel Optimization guide has a list of latencies for various micro architectures and for all instructions. Note that division latency has considerably improved in the latest i7… but still manifold the latency of mul.

The guide defines “latency” as total clock cycles taken to execute the instruction.
It defines “throughput” as number of clock cycles to WAIT until the same instruction type can be issued again.

Also note that CPU optimization also comes from engaging all the execution units parallely. For example , an FP MUL and FP ADD can happen at the same time because they operate on independent execution units… Modern Intel cores have 6 executipn units… So, if u re-arrange your code intelligently and make the CPU busy, you can even hide your L1 cache latency… Intelligent pre-fetching can make you totall latency free…

These are the kind of thinking that goes into design of libraries like MKL.

Nikolai · December 7, 2009, 5:47am

Sarnath, you bring up an interesting point, I was thinking of latencies b/c there are always going to be some operations that have to be dependent on other operations, however for the cpu, there are many tools (such as MKL) available to help lessen the cpu latency even further then it already is against the gpu

Topic		Replies	Views
Instruction Latency CUDA Programming and Performance	18	43735	January 18, 2010
GPU architecture and CUDA kernel execution CUDA Programming and Performance	13	24852	September 6, 2009
Latency and low-level performance questions CUDA Programming and Performance	10	4287	June 23, 2015
performance of integer vs float CUDA Programming and Performance	10	21515	June 15, 2009
Confusion about performance guide information CUDA Programming and Performance	7	6672	July 23, 2009
Basic question about warps CUDA Programming and Performance	14	6595	June 9, 2009
How to understand the "hide latency" CUDA Programming and Performance	13	3419	August 8, 2024
Low processor efficiency with almost same CUDA kernels CUDA Programming and Performance	4	683	April 9, 2018
Parallel Access to GDU Global Memory CUDA Programming and Performance	9	8935	January 24, 2008
CUDA perormances CUDA Programming and Performance	10	7130	January 22, 2008

Pipeline Latencies on GPU vs CPU typical CPU pipeline latencies?

Related topics