I’m in the final stages of writing up a paper concerning a CUDA implementation of a cross-correlation algorithm for radio astronomy. The code is highly optimized, and on large enough problem sizes achieves over 1060 GFLOPS on a GTX 480. (The algorithm is one of those ideal ones like GEMM, in that it has O(n^2) compute, but only O(n) memory traffic.)
I am currently trying build a performance model of this code, to see what the limiting factors are. Each thread does the following computation in a for loop:
128x fma instructions
2x floating point additions
32x 32-bit loads from shared memory
4x 32-bit stores to shared memory
2x 64-bit texture requests
1x branch compare (from the for loop)
This comes to a total of 171 instructions issued, with the 128 fma instructions being the actual desired computation, and that is what I am counting when I report the 1060 GFLOPS figure. This corresponds to 79% of peak performance. I’m trying to work out what the actual utilization is when I count the other instructions.
To the best of my knowledge the GF100/GF110 cannot co-issue instructions within the same warp (unlike GF104/GF114), so I had assumed that all instructions issued that are not computation subtract from available GFLOPS. Thus, the effective performance would be 79% * (171/128) = 106%. Clearly this can’t be correct, so what am I doing wrong here? Can GF100 co-issue some instructions, i.e., shared memory loads and fma instructions?