Discrepancy in Tensor Core FP16 Performance Ceiling on H100 SXM Observed in Nsight Compute

202476410arsmart · December 31, 2024, 6:33am

output-file-roofline.nsight-cuprof-report-2.zip (40.4 MB)
I suspect there is an issue with the performance ceiling reported in the Nsight Compute roofline analysis. I am conducting tests on an H100 SXM GPU. According to the official documentation (NVIDIA H100 Tensor Core GPU Datasheet), the Tensor Core FP16 performance limit should be approximately 900 TFLOPS. However, Nsight Compute reports a limit of around 700 TFLOPS, which matches the theoretical peak for the PCIe version.

I have attached the Nsight Compute profiling file. I would greatly appreciate it if an NVIDIA expert could help review and clarify this discrepancy. Thank you!

By the way, what is OP/s? I noticed in TFLOPS in official doc, so… I am using half, so I need to multiply 2 to Op/s → FLOPS?(the value seems strange)

Greg · December 31, 2024, 9:09pm

The Nsight Compute roofline is based upon measured clock frequency, not on boost clock (which the tool cannot necessarily query). In the report header the SM Frequency is 1.39 GHz. The whitepaper uses the boost clock when calculating the maximum OPs.

Nsight Compute defaults to collecting with the option --clock-control=base. This fixes the GPU clocks to the base clock which allows Nsight Compute to replay a kernel or range more consistently resulting in more consistent/accurate multi-pass metrics (e.g. roofline). The option --clock-control=none can be used to avoid setting the GPU to base clocks. When setting to none it is recommended to use nvidia-smi or other tool to set the GPU clocks to a fixed rate such as the documented boost clock.

By the way, what is OP/s? I noticed in TFLOPS in official doc, so… I am using half, so I need to multiply 2 to Op/s → FLOPS?(the value seems strange)

OP = arithmetic OPerations - This unit already includes additional weighting for multi-op instructions (e.g. FMA = 2 operations/thread).

The term FLOP/s or FLOPS is not used as the Roofline charts support bit formats, integer formats, various floating point formats, and in general use “mixed precision” operations which do not align with classical use of FLOP = FP32 and DFLOP = FP64 operations. It is also significantly easier in generic UI to use OPerations and specify the src data format leaving the use of terms such as FLOP, DFLOP, IOP, TOP, etc. to the user.

system · January 14, 2025, 9:10pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Tensorcore roofline Nsight Compute	2	366	August 27, 2024
Question about Roofline of TensorCore GEMM Nsight Compute	3	1587	August 7, 2024
Different achieved values in Roofline Nsight Compute	3	648	June 8, 2023
How to calculate the Tensor Core FP16 performance of H100? CUDA Programming and Performance	9	7366	August 14, 2024
Incorrect Peak Performance Boundaries in Nsight Compute Roofline Charts Nsight Compute	4	946	July 5, 2022
Why the Peak FLOP/s in Nsight Compute is much less than white paper provided? Nsight Compute	4	912	February 10, 2023
Roofline Tensor Core should be half but not float? Nsight Compute	3	1552	May 29, 2024
Confusion about the (d/f/h)(mul/add/fma) count in the nsight compute Nsight Compute cuda , deep-learning-profiler , profiling	6	1685	January 16, 2024
IMMA roofline analysis in NSight Compute Nsight Compute	4	1248	August 17, 2023
Tensor Core Flops Nsight Compute	1	21	December 2, 2025

Discrepancy in Tensor Core FP16 Performance Ceiling on H100 SXM Observed in Nsight Compute

Related topics