Discrepancy in Tensor Core FP16 Performance Ceiling on H100 SXM Observed in Nsight Compute

output-file-roofline.nsight-cuprof-report-2.zip (40.4 MB)
I suspect there is an issue with the performance ceiling reported in the Nsight Compute roofline analysis. I am conducting tests on an H100 SXM GPU. According to the official documentation (NVIDIA H100 Tensor Core GPU Datasheet), the Tensor Core FP16 performance limit should be approximately 900 TFLOPS. However, Nsight Compute reports a limit of around 700 TFLOPS, which matches the theoretical peak for the PCIe version.

I have attached the Nsight Compute profiling file. I would greatly appreciate it if an NVIDIA expert could help review and clarify this discrepancy. Thank you!

By the way, what is OP/s? I noticed in TFLOPS in official doc, so… I am using half, so I need to multiply 2 to Op/s → FLOPS?(the value seems strange)

The Nsight Compute roofline is based upon measured clock frequency, not on boost clock (which the tool cannot necessarily query). In the report header the SM Frequency is 1.39 GHz. The whitepaper uses the boost clock when calculating the maximum OPs.

Nsight Compute defaults to collecting with the option --clock-control=base. This fixes the GPU clocks to the base clock which allows Nsight Compute to replay a kernel or range more consistently resulting in more consistent/accurate multi-pass metrics (e.g. roofline). The option --clock-control=none can be used to avoid setting the GPU to base clocks. When setting to none it is recommended to use nvidia-smi or other tool to set the GPU clocks to a fixed rate such as the documented boost clock.

By the way, what is OP/s? I noticed in TFLOPS in official doc, so… I am using half, so I need to multiply 2 to Op/s → FLOPS?(the value seems strange)

OP = arithmetic OPerations - This unit already includes additional weighting for multi-op instructions (e.g. FMA = 2 operations/thread).

The term FLOP/s or FLOPS is not used as the Roofline charts support bit formats, integer formats, various floating point formats, and in general use “mixed precision” operations which do not align with classical use of FLOP = FP32 and DFLOP = FP64 operations. It is also significantly easier in generic UI to use OPerations and specify the src data format leaving the use of terms such as FLOP, DFLOP, IOP, TOP, etc. to the user.

1 Like

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.