What's the peak performance with 32-bit integers?

As I understand, peak performance for FP counted as #ALU * ALU speed * 3. “3” because it’s possible to do MAD+MUL in one clock, so, for example, GTX260 w/ 192 SP peaks on 192 * 1.242 * 3 = 715 GFLOPS.
However, I’m interested in peak performance only with 32-bit integers while no MAD/MUL needed. So it should peaks at 192 * 1.242 = 238.5 G(IOPS? not sure how to define this as IOPS more counts as “input/output operations per second”…). Anyway, real results gives me performance about 1.5 higher than theoretical value. It’s not compiler issue (which can optimize out something). So now I’m puzzled why it happens.
I’ve tested same calculations on ATI GPUs and with HD4850 practical results perfectly match theoretical ones (HD4850 running 160 ALUs able to do 5 operations per clock @ 625Mhz == 160 * 5 * 0.625 == 500).

What I’m missed? This 1.5x looks really weird, at least it should be close to integer value :).

What kind of integer operations are you testing? There is a large speed difference between integer addition and (32-bit) integer multiplication on the GPU.

No multiplications at all. In fact it’s SHA1 == many logical operations, adds and shifts, nothing complex.

Hmm, nobody knows such thing?

Actually it isn’t only integers, the same applies to SP floats without massive MAD/MUL usage. And DPFP too.

With ATI it’s possible to got performance degradation if instruction flow cannot be packed into 5 operations per ALU (it happens when data heavily depends on previous calculations), so even single thread vectorization required. AFAIK there no such problems with nVidia’s GPU as ALU can perform only 1 instruction per clock.

And looking at decuda’s output of compiled cubin I don’t see much fused operations (like add.half.b32 $r3, $r11, $r12), so my theory that 1.5x performance boost happens because of fused operations failed.

At least the throughput of DP should be fairly predictable. Are you sure of your timing measurements? Do you count non-arithmetic operations (mov) as well?

The same problem exists on NVIDIA GPUs, to a lesser extent. They have two execution pipes (MAD and SFU) that likely have higher latencies than ATI’s. The difference is that in the NVIDIA case instructions are scheduled by the hardware (superscalar) instead of the compiler (VLIW).

Performance is more difficult to predict on a superscalar than on a VLIW…

The “half” instructions seem to exist only to save instruction cache and memory bandwidth. They don’t execute faster than regular full-width operations.

You are getting higher performance because the hardware instruction scheduler dispatches instructions to both execution pipes (MAD+MUL is just a possibility, it can be anything like {MAD,MUL,ADD,MOV,AND,SHL…}+{MUL,MOV,RCP…}). MUL and MOV can be scheduled to the second unit, so they are essentially free if the MAD unit is the bottleneck.

So the peak integer performance is 2 ops/cycle/SP, if you have the right instruction mix.

Thanks a lot, this explains pretty everything.

Is it possible to get somewhere list of all possible instructions for each execution pipe? If second (“MUL”) pipe limited only to MOV & SPFP MUL it isn’t that good for integer calculations (at least without 32-bit integer MUL or some shifter).