Hello,
The CUDA Guide says that a 2.0 GPU has a throughput of 32 float multiplications per clock cycle per multiprocessor, but that a warp will have to wait 22 clock cycles for the result of such a computation. To me this means that in each multiprocessor there must be 32 pipelines which each hold 22 multiplications in various stages of completeness. However, 22 steps for a multiplication seems an awful lot to me, but I don’t really see any other way to get to these numbers… Am I missing something?
Best,
Nikolaus