We’re taught to run lots of independent operations to help hide pipeline latencies on the GPU, so it was my impression that the CPU was considerably less. For single precision it’s around 24~26 cc on the GPU.
volkov does some experiements to evaluate pipeline latency in the paper
Vasily Volkov, James W. Demmel, Benchmarking GPUs to Tune Dense Linear Algebra. In SC ’08: Preceedings of the 2008 ACM/IEEE conference on Supercomputing. Piscataway, NJ, USA, 2008, IEEE Press.
EVERY operation has at least a 12 clock latency due to register potential read-after-write access stalls… the systems guys know a lot more than I do, but this 12 clock latency is a floor. This is why you should schedule at least 192 (active) threads per SM. since 16 threads operate in one clock, so 12*16=192 threads will cover this 12 clock latency.
jjp, if that was for floating point operations, it would be awesome
just to clarify… when it says something like MUL throughput 1/5
it means the throughput takes 5 clock cycles to execute an integer multiply operation? isn’t GPU throughput for floating point ADD, MAD, MUL 1 clock cycle?
No, I am saying that that the datatype for SSE and friends is float4 and therefore the thruput is 4 x clockrate.
This comparison isn’t completely valid though, since iterating only on the data that you happen to have in registers would soon leave you out of steam without anything reasonable to do. You’d have to get more data from at least cache level one, write back the results, read them back again, leading to much longer delays - and this is where you should start realizing that what Nvidia calls “registers” is approximately what Intel calls “cache level 1” and what Intel calls “registers” is what Nvidia considers part of the pipeline in the execution units.
The wiring is different, optimized for different goals - but the distances internally on the silicone is similar which is what determines the latencies.
what i was hoping to gain from this comparison was some perspective on the # of clock cycles it would take for a single cpu core to complete a series of dependent floating point operations (i.e. forcing full pipeline latency after each operation) vs the # of clock cycles for a single gpu core. This of course is hypothetical so imagine that’s all this program does… how many times faster is the cpu ( maybe we can compare a cpu’s L1 cache to nvidia’s register file space )
Typical latencies are 3 cycles for an FP add, 4 cycles for an FP mul and 5 cycles for an FMA, at 3 GHz.
This only includes the latency of the FP units. Complete pipeline depth (=cost of a branch misprediction) is about 20 cycles.
The G80 has a FP unit latency around 10 cycles, for an overall pipeline depth of about 28 cycles at 1.5 GHz (or rather 14 at 750 MHz).
But the G80 pipeline is also much simpler than the pipeline of a modern out-of-order x86 CPU…
Latency depends on the instruction in question. usually division has lot of latency than multiplication.
The Intel Optimization guide has a list of latencies for various micro architectures and for all instructions. Note that division latency has considerably improved in the latest i7… but still manifold the latency of mul.
The guide defines “latency” as total clock cycles taken to execute the instruction.
It defines “throughput” as number of clock cycles to WAIT until the same instruction type can be issued again.
Also note that CPU optimization also comes from engaging all the execution units parallely. For example , an FP MUL and FP ADD can happen at the same time because they operate on independent execution units… Modern Intel cores have 6 executipn units… So, if u re-arrange your code intelligently and make the CPU busy, you can even hide your L1 cache latency… Intelligent pre-fetching can make you totall latency free…
These are the kind of thinking that goes into design of libraries like MKL.
Sarnath, you bring up an interesting point, I was thinking of latencies b/c there are always going to be some operations that have to be dependent on other operations, however for the cpu, there are many tools (such as MKL) available to help lessen the cpu latency even further then it already is against the gpu