As I understand, peak performance for FP counted as #ALU * ALU speed * 3. “3” because it’s possible to do MAD+MUL in one clock, so, for example, GTX260 w/ 192 SP peaks on 192 * 1.242 * 3 = 715 GFLOPS.
However, I’m interested in peak performance only with 32-bit integers while no MAD/MUL needed. So it should peaks at 192 * 1.242 = 238.5 G(IOPS? not sure how to define this as IOPS more counts as “input/output operations per second”…). Anyway, real results gives me performance about 1.5 higher than theoretical value. It’s not compiler issue (which can optimize out something). So now I’m puzzled why it happens.
I’ve tested same calculations on ATI GPUs and with HD4850 practical results perfectly match theoretical ones (HD4850 running 160 ALUs able to do 5 operations per clock @ 625Mhz == 160 * 5 * 0.625 == 500).
What I’m missed? This 1.5x looks really weird, at least it should be close to integer value :).