Is there a way I can tell whether I'm getting concurrent floating point instructions on cc86, cc89, +?

dscerutti · October 14, 2024, 12:34am

The Ampere GA102, GA103 and Lovelace AD102, AD103 chips (cc86 and cc89, respectively) can set up two concurrent fp32 operations, as opposed to one int32 and one fp32 instruction in the earlier Turing hardware. Thus, the theoretical peak FLOPs of an A40 is nearly double that of an A100 (which does not have such concurrent instructions available, although it does do “beast mode” fp64).

Is there a way I can tell whether, to any extent, my code is taking advantage of such concurrent instructions? I have a slick way to rejigger the most compute-heavy kernel in my code that will present two identical calculations in the inner loop, although it’s not a slam dunk (each calculation involves an if-then based on whether two particles are within range of on another, and while it will be very likely that both particles are within range if either is within range, I may have to crunch through some blanks in a way that SIMT can’t mitigate to any degree). I’d like to see whether I’m already getting some concurrency, however.

njuffa · October 14, 2024, 3:13am

This question would appear to be a better fit for the Nsight Compute sub-forum.

Looking through the events available I don’t see any that appear to directly provide a count of FP32 dual issue events. Indirectly one could conclude that when there are more FP32 instructions issued than cycles elapsed, there must have been FP32 dual issues (by the pigeonhole principle), but this kind of weak statement probably is not helpful for the scenario described.

dscerutti · October 14, 2024, 4:02am

Sounds good. I’ll… test!

Curefab · October 14, 2024, 12:49pm

You are asking about the usage of fmalite vs. fmaheavy pipeline? Are they shown separately or as one FMA entry?

Greg · October 14, 2024, 3:42pm

Nsight Compute does not have a metric to calculate if back to back FP32 instructions were issued taking advantage of the new feature. Nsight does show the utilization of the new math pipe fmalite.

The two FP32 pipe utilization can be captured as:

sm__pipe_fmaheavy_cycles_active.avg.pct_of_peak_sustained_elapsed
sm__pipe_fmalite_cycles_active.avg.pct_of_peak_sustained_elapsed

FMAheavy executes FP32, FP16, IMAD, and a few other instructions.
FMAlite executes FP32 (FMA, FADD, FMUL) and FP16

If the average of the two metrics is > 50% then the the new 2x rate is likely being leveraged. I state likely as FMAheavy could also be used for IMAD.

Topic		Replies	Views
Separate CUDA Core pipeline for FP16 and FP32? Nsight Compute	11	447	August 20, 2024
INT 32 and FP64 can be used concurrently in the Volta architecture? CUDA Programming and Performance	5	2521	May 6, 2024
A Question about how Ampere/Lovelace (RTX 3000/4000, GA10X/AD10X) cards handle Warp Dispatching CUDA Programming and Performance	13	454	June 1, 2024
Simultaneous FP32 and INT32 operations code sample CUDA Programming and Performance	6	940	April 19, 2022
Is there a document about in which hardware unit(ie. ALU FMU...) an instruction is executed? CUDA Programming and Performance	35	2923	October 5, 2022
How to measure Tensor FLOPs? CUDA Programming and Performance tensorrt , cuda , kernel	14	2448	May 15, 2024
Difference in SM performance of float16 and bfloat16 CUDA Programming and Performance	4	747	August 7, 2024
instruction or operation CUDA Programming and Performance	16	3233	March 28, 2019
Compile float as 64bit floating point CUDA Programming and Performance	7	1514	September 25, 2016
Confusion about the (d/f/h)(mul/add/fma) count in the nsight compute Nsight Compute cuda , deep-learning-profiler , profiling	6	1530	January 16, 2024

Is there a way I can tell whether I'm getting concurrent floating point instructions on cc86, cc89, +?

Related topics