Is there a way I can tell whether I'm getting concurrent floating point instructions on cc86, cc89, +?

The Ampere GA102, GA103 and Lovelace AD102, AD103 chips (cc86 and cc89, respectively) can set up two concurrent fp32 operations, as opposed to one int32 and one fp32 instruction in the earlier Turing hardware. Thus, the theoretical peak FLOPs of an A40 is nearly double that of an A100 (which does not have such concurrent instructions available, although it does do “beast mode” fp64).

Is there a way I can tell whether, to any extent, my code is taking advantage of such concurrent instructions? I have a slick way to rejigger the most compute-heavy kernel in my code that will present two identical calculations in the inner loop, although it’s not a slam dunk (each calculation involves an if-then based on whether two particles are within range of on another, and while it will be very likely that both particles are within range if either is within range, I may have to crunch through some blanks in a way that SIMT can’t mitigate to any degree). I’d like to see whether I’m already getting some concurrency, however.

This question would appear to be a better fit for the Nsight Compute sub-forum.

Looking through the events available I don’t see any that appear to directly provide a count of FP32 dual issue events. Indirectly one could conclude that when there are more FP32 instructions issued than cycles elapsed, there must have been FP32 dual issues (by the pigeonhole principle), but this kind of weak statement probably is not helpful for the scenario described.

Sounds good. I’ll… test!

You are asking about the usage of fmalite vs. fmaheavy pipeline? Are they shown separately or as one FMA entry?

Nsight Compute does not have a metric to calculate if back to back FP32 instructions were issued taking advantage of the new feature. Nsight does show the utilization of the new math pipe fmalite.

The two FP32 pipe utilization can be captured as:

  • sm__pipe_fmaheavy_cycles_active.avg.pct_of_peak_sustained_elapsed
  • sm__pipe_fmalite_cycles_active.avg.pct_of_peak_sustained_elapsed

FMAheavy executes FP32, FP16, IMAD, and a few other instructions.
FMAlite executes FP32 (FMA, FADD, FMUL) and FP16

If the average of the two metrics is > 50% then the the new 2x rate is likely being leveraged. I state likely as FMAheavy could also be used for IMAD.