CUDA thread processors v. ATI stream processors

When comparing ATI cards against nVidia cards, the ATI ones always have vastly more stream processors for the same performance.

What is it that nVidia do with their thread processors to achieve so much more throughput?

Apologies for the trivial question, I have read through the programming guide but this seems unclear. Thanks for any help.

ATI Radeon HD 4870: 800 streaming processors, 750 MHz
NVIDIA GTX 280: 240 streaming processors, 1296 MHz

So the ATI chip has 3.3x more streaming processors, but clocked at 58% the frequency of the NVIDIA chip. That accounts for part of the difference, but that still means the NVIDIA streaming processors are on average doing twice as much work per clock as the ATI streaming processors.

Does anyone know what the IPC is for the ATI chips? That’s the only other explanation I could imagine. The NVIDIA chips are pipelined such that 1 instruction can finish per clock cycle.

NVIDIA and ATI are counting SPs differently.
Each of 800 ATI’s SP capable of processing one 32-bit value while each of 240 NVIDIA’s SP is capable of processing 4 32-bit values in parallel and this takes 4 cycles (if I recall correctly).

Also, HD4870 and GTX280 may have “same performance” in games, but not in computing. For compute-bound kernels ATI cards are better (well, if you’ll find out how to write something for it – documentation is awful).

For example: I have kernel for NVIDIA and ATI which performs same computations. On HD4870 it is ~35% faster then on GTX280. Now, I know that it could be at least 25% faster on ATI because hardware is 5-way superscalar and I’m processing data 4 elements at a time.

1.35 * 1.25 = 1.68.
If we compare raw (#SP * Frequency) then ATI-to-NVIDIA ratio is (800*750) / (240 * 1296) = 1.93 which is only 15% higher than expected 1.68 speedup.