instruction throughput >1 ?

I have being testing a toy program on my integrated Nvidia card (compute capability 1.1), and found that the instruction throughput = 1.11475.

Does this make any sense? Thanks!

From the Visual Profiler User Guide:

“instruction throughput: Instruction throughput ratio. This is the ratio of achieved instruction rate to peak single issue instruction rate. The achieved instruction rate is calculated using the “instructions” profiler counter. The peak instruction rate is calculated based on the GPU clock 0-1 speed. In the case of instruction dual-issue coming into play, this ratio shoots up to greater than 1. This is calculated as gpu_time * clock_frequency / (instructions)”

Yes, even the earliest CUDA capable cards (starting from compute capability 1.0) are able to dual-issue instructions, so you can achieve more that one instruction per cycle on average.

Oh, really? I thought this is something supported only by Fermi.

Then what is the maximum instruction throughput? Thanks!

Compute capability 2.1 is the first that can issue two multiply-adds in the same cycle. 1.x devices however were already able to co-issue a mul in the special function units together with and add, mul, or multiply-add in the FPU.
All compute capabilities are able to issue some non-arithmetic instructions like loads and stores in parallel to an arithmetic instruction.

I believe the maximum to be 2 (dual-issue at most).

I don’t see how you can achieve 2. You don’t have enough operating units for that. If you try real hard, you might get 1.375 on a 1.x device, 1.125 on a 2.0 device, and 1.167 on a 2.1 device.

Could you explain a little bit where those numbers are from?

This might be completely wrong: Is the dual-issuing related to ‘core clock’ and ‘shader clock’? Thanks!

How do you arrive at those numbers?

Of course 2 is a theoretical upper limit that is at least hard to approach in practice. And I do see that you can’t co-issue a memory transaction on every cycle because memory just doesn’t have the bandwidth necessary. But what would prevent a long sequence of alternating mad and mul instruction to asymptotically approach 2?

Dual-issue is not related to the ratio between core and shader clock: 2 instructions can be issued per every (fast) shader clock cycle.

1.375 is (8+2+1)/8, 1.125 is (32+4)/32, and 1.167 is (48+8)/48. A long sequence of alternating mad and mul instructions should (as far as I understand) asymptotically approach 1 (as long as you count mad as 1 instruction).

Sorry, I don’t understand your calculations. Can you elaborate a bit?

Yes, please elaborate… It looks like those numbers 8, 2 ect. are the number of FPU and SFU of a SM for each generation (although I don’t know where the 1 is from for the 1.x generation).

Take a compute capability 1.x device.

The peak single issue instruction rate is 1 warp instruction (=32 operations) per 4 clocks.

During each of these 4 clocks, each SM can perform 8 operations in its main CUDA cores, 2 single-precision floating-point multiplications or transcendentals in its two SFUs, and 1 double-precision operation in its double-precision unit.

So if all these units were loaded to 100%, you’d see the throughput of (8+2+1)*4/32 = 1.375.

HOWEVER:

  1. There’s an article online that says that the double precision unit and SFUs can’t be active at the same time. No mention of that in official documentation.

  2. That same article seems to claim (in a roundabout way) that each SFU can do 4 single-precision floating-point multiplications per clock.

If that is the case, then the theoretical peak would indeed be 2.

I also stand corrected about my comment about the string of mads and muls. I did not realize that SFUs could do multiplications.

Is it possible to actually test it (the theoretical peak)?

Could you provide the link of the article? Thanks.