the G80 GTX has around 518 GFlops (http://en.wikipedia.org/wiki/G80#GeForce_8800) which i know is a theoretical value due the arithmetic unit is shared between the 8 SP in a MP. But still if i use the follow calculation (128 SPs * 2 (op per second) * 1.35 Ghz) == 345 Gflops.
What happens with the (518 - 345) 173 GFlops ? Again, i know that this 518 is a theoretical value but 173 is really a lot. Is this the PTX Virtual Machine overhead ?
The extra FLOPS come from the texture interpolaters (which are presumably implemented hardwired in silicon and not done on the multiprocessors). I’m not sure exactly how the math works out to count the extra GFLops from them though. Note that the CUDA programming guide only claims ~340 GFlops in figure1.
I’m not sure if I understand you here, but the arithmetic units are not shared. Each multiprocessor has 8 ALUs (which NVIDIA calls “processors”), but one instruction decoder. So to use all 8 ALUs, you need them all to execute the same instruction (but of course, each ALU acts on different registers).
But yes, MisterAnderson’s explanation is correct. The marketing materials for the 8800 GTX include the computations done by the texture units in the total, which is highly optimistic unless you are doing nothing but texture math. The CUDA manual computes the Gflops based on just the ALUs (“processors”) being at full utilization.
Well, each MP contains 8 SP but only one instruction decoder. That is the reason why the same instruction is executed 8 times. But I’m not sure if the MADD and MUL units are included in each SP. I read somewhere that they are shared, but I’m not sure.
Anyway the most important is the information that per clock two arithmetic operations are possible, whenever not (a * B) * c , but more e.g. cos(a*B).