Meaning of Operation Throughput


The CUDA Guide says that a 2.0 GPU has a throughput of 32 float multiplications per clock cycle per multiprocessor, but that a warp will have to wait 22 clock cycles for the result of such a computation. To me this means that in each multiprocessor there must be 32 pipelines which each hold 22 multiplications in various stages of completeness. However, 22 steps for a multiplication seems an awful lot to me, but I don’t really see any other way to get to these numbers… Am I missing something?


No, you are seeing it right. The only thing is that unlike modern CPUs, GPUs are not optimized for low latencies but for throughput. (Moderate) latency is easily hidden by having loads of warps (threads) waiting to be scheduled onto the multiprocessor.

Also, the 22 cycles include instruction decode etc., unlike the 4 cycles that common out of order CPUs achieve.

I see, thanks!