I would like to know a quick answer of the performance differences between
computations using integer vs float.
Cannot find a published info about this topic in documentation.
Common sense is integer computations to be faster but I need a proof.
I would like to know a quick answer of the performance differences between
computations using integer vs float.
Cannot find a published info about this topic in documentation.
Common sense is integer computations to be faster but I need a proof.
As far as I’m aware GPUs are optimized for floating point math. This is especially true for G80 chip. Always prefer floating point math (I assume single precision).
Section 5.1.1.1 in the programming guide explains the arithmetic performance. Single precision floating point multiply, add, and multiply-add take 4 clock cycles per warp, as does integer addition and bitwise operations. Integer multiply actually takes 16 clock cycles per warp, unless you use a special intrinsic function __mul24(), which only performs a 24-bit multiply. Integer and floating point division are even slower.
Thanks, the info in the Programming Guide is abouth throughput, I would like
to know about latencies too. The difference is - latency shows how many cloks
cycles are needed to get the data ready, throughput - how many clock cycles are
needed for next instruction to begin execution. Or there is not such concept in
CUDA and that is whay I don’t see it published. It would be good if we can get
the clock cycles info from C and to make measurements ourselves. As fa I know
this is possible only from PTX.
Regarding integers use - in some cases there is a reason to use integers, nnot floating point.
Floating point arithmetic introduces an errors in computations.
Latencies on arithmetic instructions are negligible if you are running enough threads. Section 5.2 suggests 192 or 256 threads per block in general, although 64 threads per block also works if you are running multiple blocks per multiprocessor.
Edit: By “negligible” I mean that they don’t matter for kernel performance, not that the latencies are zero. :)
(deleted, clicked wrong button)
What secific benchmarking methods are you using?
I haven’t benchmarked this particular issue. I’m just reiterating what the programming guide says here.
So - we need a good benchmarks. I will work on it.
Have a look at vvolkov’s work:
http://mc.stanford.edu/cgi-bin/images/6/65…_Volkov_GPU.pdf
Integer operations are performed by the same pipeline as FP operations, so the latencies are the same.
Basically we a talking about the pipeline. Here is an interesting note I found:
NVIDIA GeForce GTX 280 Pipeline Update
http://www.anandtech.com/video/showdoc.aspx?i=3336
“These sources reveal that properly hiding instruction latency requires 6 active warps per SM. The math on this comes out to an effective latency of 24 cycles before a warp can be scheduled to process the next instruction in its instruction stream. Each warps takes 4 cycles to process in an SM (4 threads from a warp are processed on each of the 8 SPs) and 6*4 is 24. You can also look at it as 6 warps * 32 threads/warp = 192 threads and 192 threads / 8 SPs = 24 threads per SP, and with a throughput of 1 instruction per cycle = 24 cycles.”
Also From NVIDIA’s Mark Harris in one of the threads:
“The latency is approximately 22 clocks (this is the 1.35 GHz clock on 8800 GTX), and it takes 4 clocks to execute an arithmetic instruction (ADD, MUL, MAD, etc,) for a whole warp.”
And more:
[b]"Which brings us to a broader point. NVIDIA is going to have to give CUDA developers more detail in order for them to effectively use the hardware. "
;)
[/b]