performance of integer vs float

thstart · June 7, 2009, 7:58pm

I would like to know a quick answer of the performance differences between
computations using integer vs float.

Cannot find a published info about this topic in documentation.

Common sense is integer computations to be faster but I need a proof.

sergeyn · June 7, 2009, 8:24pm

As far as I’m aware GPUs are optimized for floating point math. This is especially true for G80 chip. Always prefer floating point math (I assume single precision).

seibert · June 7, 2009, 9:27pm

Section 5.1.1.1 in the programming guide explains the arithmetic performance. Single precision floating point multiply, add, and multiply-add take 4 clock cycles per warp, as does integer addition and bitwise operations. Integer multiply actually takes 16 clock cycles per warp, unless you use a special intrinsic function __mul24(), which only performs a 24-bit multiply. Integer and floating point division are even slower.

thstart · June 7, 2009, 9:42pm

Thanks, the info in the Programming Guide is abouth throughput, I would like
to know about latencies too. The difference is - latency shows how many cloks
cycles are needed to get the data ready, throughput - how many clock cycles are
needed for next instruction to begin execution. Or there is not such concept in
CUDA and that is whay I don’t see it published. It would be good if we can get
the clock cycles info from C and to make measurements ourselves. As fa I know
this is possible only from PTX.

Regarding integers use - in some cases there is a reason to use integers, nnot floating point.
Floating point arithmetic introduces an errors in computations.

seibert · June 7, 2009, 9:50pm

Latencies on arithmetic instructions are negligible if you are running enough threads. Section 5.2 suggests 192 or 256 threads per block in general, although 64 threads per block also works if you are running multiple blocks per multiprocessor.

Edit: By “negligible” I mean that they don’t matter for kernel performance, not that the latencies are zero. :)

seibert · June 7, 2009, 9:51pm

(deleted, clicked wrong button)

thstart · June 9, 2009, 4:33am

What secific benchmarking methods are you using?

seibert · June 9, 2009, 12:11pm

I haven’t benchmarked this particular issue. I’m just reiterating what the programming guide says here.

thstart · June 9, 2009, 2:49pm

So - we need a good benchmarks. I will work on it.

Sylvain_Collange · June 9, 2009, 8:19pm

Have a look at vvolkov’s work:

http://mc.stanford.edu/cgi-bin/images/6/65…_Volkov_GPU.pdf

Integer operations are performed by the same pipeline as FP operations, so the latencies are the same.

thstart · June 15, 2009, 6:14pm

Basically we a talking about the pipeline. Here is an interesting note I found:

NVIDIA GeForce GTX 280 Pipeline Update

http://www.anandtech.com/video/showdoc.aspx?i=3336

“These sources reveal that properly hiding instruction latency requires 6 active warps per SM. The math on this comes out to an effective latency of 24 cycles before a warp can be scheduled to process the next instruction in its instruction stream. Each warps takes 4 cycles to process in an SM (4 threads from a warp are processed on each of the 8 SPs) and 6*4 is 24. You can also look at it as 6 warps * 32 threads/warp = 192 threads and 192 threads / 8 SPs = 24 threads per SP, and with a throughput of 1 instruction per cycle = 24 cycles.”

Also From NVIDIA’s Mark Harris in one of the threads:

“The latency is approximately 22 clocks (this is the 1.35 GHz clock on 8800 GTX), and it takes 4 clocks to execute an arithmetic instruction (ADD, MUL, MAD, etc,) for a whole warp.”

And more:

[b]"Which brings us to a broader point. NVIDIA is going to have to give CUDA developers more detail in order for them to effectively use the hardware. "

;)

[/b]

Topic		Replies	Views
About instruction throughputs CUDA Programming and Performance	9	5270	May 27, 2010
About integer calculation CUDA Programming and Performance	1	4375	April 14, 2008
Integer Arithmetic 32 integer arithmetic performance CUDA Programming and Performance	4	6955	March 7, 2007
measure integer instructions by nvprof CUDA Programming and Performance	1	1035	June 11, 2019
Peak Performance of integer operation CUDA Programming and Performance	3	2969	May 11, 2017
Forward looking GPU integer performance CUDA Programming and Performance	22	22303	March 20, 2017
Mythical Tflops CUDA Programming and Performance	11	1295	January 14, 2019
Cuda 3.5 Integer Multiply Performance Is it really 3x slower than 64-bit floating point? CUDA Programming and Performance	21	20311	March 12, 2014
Arithmetic Operations benchmarking with CUDA FERMI Understanding pure performance of arithmetic on F CUDA Programming and Performance	9	1759	October 27, 2010
Instruction Latency CUDA Programming and Performance	18	44191	January 18, 2010

performance of integer vs float

Related topics