About instruction throughputs

Hello,

The “Best Practices Guide” mentions “Single-precision floats provide the best performance, and their use is highly encouraged” at the chapter 5.1.

On the other hand, integers and single-precision floats have the same throughput for arithmetic instructions according to the Table 5-1 in the “Programming Guide(ver 3.0)”.
(In CC 1.x, if use __mul24)

Why does single-precision floats use provide the best performance ?

Single precision is faster than double precision by a factor of 8 on most cards, and is faster than integers in multiplication unless the __mul24() function is used (as you point out). Seems like a reasonable suggestion to me. :)

Single precision is faster than double precision by a factor of 8 on most cards, and is faster than integers in multiplication unless the __mul24() function is used (as you point out). Seems like a reasonable suggestion to me. :)

Thanks seibert.

I’ve heard a report that using integers provide lower performance rather than using floats for integral arithmetic.
The report also says that integers may be converted internally to floats, thus performance down.

Additionally, I’ve run the CUDA-Z on GTX 480(CC 2.0) and got a result as below,

Single-precision Float : 1336920 Mflop/s
Double-precision Float : 168151 Mflop/s
32-bit Integer : 671568 Miop/s
24-bit Integer : 670700 Miop/s

As I’ve pointed out before, integers and single-precision floats are supposed to have the same throughput for arithmetic instructions (in CC2.0).
I wonder why Single-precision Float scores twice as much as 32-bit Integer.

Thanks seibert.

I’ve heard a report that using integers provide lower performance rather than using floats for integral arithmetic.
The report also says that integers may be converted internally to floats, thus performance down.

Additionally, I’ve run the CUDA-Z on GTX 480(CC 2.0) and got a result as below,

Single-precision Float : 1336920 Mflop/s
Double-precision Float : 168151 Mflop/s
32-bit Integer : 671568 Miop/s
24-bit Integer : 670700 Miop/s

As I’ve pointed out before, integers and single-precision floats are supposed to have the same throughput for arithmetic instructions (in CC2.0).
I wonder why Single-precision Float scores twice as much as 32-bit Integer.

Does the hardware have an integer multiply-add instruction? Multiply-add can be done for floating point in one instruction, but if there is no corresponding version for integers, then that would give you a factor of 2 in peak throughput.

Does the hardware have an integer multiply-add instruction? Multiply-add can be done for floating point in one instruction, but if there is no corresponding version for integers, then that would give you a factor of 2 in peak throughput.

There is a mad24 instruction in PTX and in DECUDA, so I would say definitely yes.

There is a mad24 instruction in PTX and in DECUDA, so I would say definitely yes.

From what I can tell, single precision & int24 arithmetic are done in the same unit (fine grain reconfiguration), which should be possible
since a float has a 24 bit significand.

Likewise, for Fermi, double precision & float are done by the same unit, though double has 1/2 throughput.
as I found here

This reconfigurable design is very good because it allows the ultimate flexibility of allowing all units to be used compared to letting some sitting idle if using discrete units.

The only question remaining is does the Fermi float/double precision unit also handle 32bit arithmetic?