I mostly understand the difference between instructions, cycles, and floating-point-operations, but the CUDA programming guide doesn’t tell the whole story. I would like to compare my application’s computing performance to the hardware optimum. I have a 8800GTX, so 346 GFLOPS is obviously optimal (for GPGPU), and GPUBench shows 165 billions scalar instructions maximum.

First, when counting FLOPS, I know that a MAD counts as 2 FLOPS within a single instruction. I also know that an exp is one instruction but must perform a small number of FLOPS. How many, though? On page 49 it states that most instructions 2 cycles to issue, and that more complex instructions like floating-point reciprocal, exp, and sin take 16 cycles. Does that mean that I should count 8 or 16 FLOPS?

Secondly, do the single-precision versions of exp and sin require fewer cycles or count as fewer FLOPS? I see a 15% performance boost using those.

As it is, my app issues about 130 billions instructions per second, so I am about as close to optimum as I expect to get. Calculating cycles per second, though, I get something like 374 billion. What number can I compare that to? Counting memory speed times number of processors only gives 230 billion cycles per second (1.8G*128).