How to compute performance in GFLOPS ?

The CUDA 2.0 guide mentions that one ADD or MADD instruction takes 4 clock cycles and not only one … so that leaves me a little bit confused.

what am I getting wrong?

I don’t know why they do that. Yes, looking at the hardware, it takes 4 cycles for a 32-thread warp to do a MADD across the 8-unit SM. Looking at the software, it’s 1 cycle.

Maybe it is that that it takes four cycles to generate the first instruction, but any subsequent instructions take 1 cycle?

Or maybe the instruction line is pipelined, so while it might take 4 cycles to generate the instruction, it can execute instructions while generating new ones.

I’ve seen kernels that do 250+ GFLOP/s on a 9800GT, so there must be a way to crunch all those instructions together.

Saying that an instruction in one thread takes 1 cycle is an abstraction. In truth, actual instructions execute per warp, and a 32 thread warp takes 4 cycles to execute on an 8-ALU multiprocessor (at a minimum). The documentation uses the more physical defintion of “instruction.” But when reading the documentation, you can divide every instruction cycle count by 4 to get a more natural “thread-level” count. So you can say an ADD takes 1 instruction intead of 4, a modulus takes 8 instead of 32, etc. In fact, that’s what the documentation should have said from the start.

Thanks for the clarification. External Media

ok, that makes sense, thanks.