CUDA Profiler: instruction throughput ratio > 1.00?

I’ve been doing some profiling on a program and the profiler reports that I am achieving an instruction throughput ratio of 1.10. I am a little unclear as to what that actually means, since intuitively it seems the maximum would be 1.00. The description of instruction throughput ratio in the README is not specific enough, and web searches have found nothing helpful.

Could anybody help me understand the meaning of “instruction throughput ratio” better than the README does?

And what is the actual maximum ratio that can be achieved, since apparently 1.00 is not it?

Thanks!

Dual issue? Are you doing MADs and MULs in close proximity to each other? (I don’t actually know, but that guess makes sense to me)

There are sections of code within loops that have MADs and MULs in proximity. Is there any documentation on what instructions can be dual issued together?

That is a good question… The normal ALU and the SFU can issue things together, but I can’t recall any NVIDIA documentation on how that works. The RWT GT200 article might be useful.

Ok, thanks! That article helps a lot. So in theory if it was possible to sustain the dual issue instructions and since the instruction throughput ratio is determined against a single issue rate, then the highest achievable throughput ratio would be 2.00, correct?

This is also a good question. I assumed that since there was only one SFU, but 8 ALUs, that if the MUL instruction for a warp is sent to the SFU, it would take 8 times longer to finish. This is also similar to double precision, which also takes 8 times longer due to only one double precision ALU per multiprocessor. That would imply a maximum throughput ratio of 1.125.

I’m speculating here, of course.

The SFU is actually 2 execution units though, not 1, so by your thinking it would be 4 times longer, not 8, making the maximum throughput ratio 1.25. But the article states that with sustained dual issue execution “the computational throughput of the shader core is increased by 50%”, implying that the maximum throughput ratio is 1.50.

Making things even more confusing, the article goes on to mention that MULs complete on the SFU in 4 cycles, just like on the ALU, and “in a single fast clock cycle, the execution units can perform up to 8 FMADs and 8 FMULs.” That makes it sound like the maximum is 2.00.

scratches head

So all of this would mean, that to get up to the 1 TFlop I would need to push this ratio to the 1,125 - 2,0 maximum, not to one?

This actually makes sense.

Each of the 2 SPUs can execute either 1 special function or 4 MULs each cycle.

MADs are counted as 2 flops.

So 8 MADs + 8 MULs count as 24 flops, 50% more than the 16 flops of 8 MADs…

I suppose the profiler counts either MAD or MUL as 1 operation, so the maximum ratio would be 2?..

I can confirm that a test application that only does loads of MADS and MULS achieves 3 FLOPS (2 from the MAD, 1 from the MUL) per clock cycle on GT200.

I hate to drag up old threads, but I can add a small additional piece of experimental evidence to the puzzle of instruction throughput. I have a rather compute heavy kernel which the profiler reports as having an instruction throughput of 1.30182 on a compute 1.1 capable device.