What's the peak performance with 32-bit integers?

empty_knapsack · June 27, 2009, 9:14am

As I understand, peak performance for FP counted as #ALU * ALU speed * 3. “3” because it’s possible to do MAD+MUL in one clock, so, for example, GTX260 w/ 192 SP peaks on 192 * 1.242 * 3 = 715 GFLOPS.
However, I’m interested in peak performance only with 32-bit integers while no MAD/MUL needed. So it should peaks at 192 * 1.242 = 238.5 G(IOPS? not sure how to define this as IOPS more counts as “input/output operations per second”…). Anyway, real results gives me performance about 1.5 higher than theoretical value. It’s not compiler issue (which can optimize out something). So now I’m puzzled why it happens.
I’ve tested same calculations on ATI GPUs and with HD4850 practical results perfectly match theoretical ones (HD4850 running 160 ALUs able to do 5 operations per clock @ 625Mhz == 160 * 5 * 0.625 == 500).

What I’m missed? This 1.5x looks really weird, at least it should be close to integer value :).

seibert · June 27, 2009, 8:41pm

What kind of integer operations are you testing? There is a large speed difference between integer addition and (32-bit) integer multiplication on the GPU.

empty_knapsack · June 27, 2009, 8:45pm

No multiplications at all. In fact it’s SHA1 == many logical operations, adds and shifts, nothing complex.

empty_knapsack · July 11, 2009, 5:31am

Hmm, nobody knows such thing?

Actually it isn’t only integers, the same applies to SP floats without massive MAD/MUL usage. And DPFP too.

With ATI it’s possible to got performance degradation if instruction flow cannot be packed into 5 operations per ALU (it happens when data heavily depends on previous calculations), so even single thread vectorization required. AFAIK there no such problems with nVidia’s GPU as ALU can perform only 1 instruction per clock.

And looking at decuda’s output of compiled cubin I don’t see much fused operations (like add.half.b32 $r3, $r11, $r12), so my theory that 1.5x performance boost happens because of fused operations failed.

Sylvain_Collange · July 11, 2009, 3:30pm

At least the throughput of DP should be fairly predictable. Are you sure of your timing measurements? Do you count non-arithmetic operations (mov) as well?

The same problem exists on NVIDIA GPUs, to a lesser extent. They have two execution pipes (MAD and SFU) that likely have higher latencies than ATI’s. The difference is that in the NVIDIA case instructions are scheduled by the hardware (superscalar) instead of the compiler (VLIW).

Performance is more difficult to predict on a superscalar than on a VLIW…

The “half” instructions seem to exist only to save instruction cache and memory bandwidth. They don’t execute faster than regular full-width operations.

You are getting higher performance because the hardware instruction scheduler dispatches instructions to both execution pipes (MAD+MUL is just a possibility, it can be anything like {MAD,MUL,ADD,MOV,AND,SHL…}+{MUL,MOV,RCP…}). MUL and MOV can be scheduled to the second unit, so they are essentially free if the MAD unit is the bottleneck.

So the peak integer performance is 2 ops/cycle/SP, if you have the right instruction mix.

empty_knapsack · July 11, 2009, 6:56pm

Thanks a lot, this explains pretty everything.

Is it possible to get somewhere list of all possible instructions for each execution pipe? If second (“MUL”) pipe limited only to MOV & SPFP MUL it isn’t that good for integer calculations (at least without 32-bit integer MUL or some shifter).

Topic		Replies	Views
Peak Performance of integer operation CUDA Programming and Performance	3	2907	May 11, 2017
Integer Arithmetic 32 integer arithmetic performance CUDA Programming and Performance	4	6895	March 7, 2007
speed of integer and FP operation on ALU CUDA Programming and Performance	1	4659	May 12, 2008
Arithmetic Operations benchmarking with CUDA FERMI Understanding pure performance of arithmetic on F CUDA Programming and Performance	9	1678	October 27, 2010
Cuda 3.5 Integer Multiply Performance Is it really 3x slower than 64-bit floating point? CUDA Programming and Performance	21	20032	March 12, 2014
Gap between measured perf. and peak CUDA Programming and Performance	8	13092	March 20, 2008
performance of integer vs float CUDA Programming and Performance	10	21720	June 15, 2009
FPU and ALU multiplications in parallel Can I take advantage using both of them? CUDA Programming and Performance	7	16112	November 4, 2010
CUDA integer ops in hardware the skinny on ints in CUDA and hardware CUDA Programming and Performance	3	20150	March 26, 2007
Forward looking GPU integer performance CUDA Programming and Performance	22	21875	March 20, 2017

What's the peak performance with 32-bit integers?

Related topics