H100 PCIe hgemm cannot reach peak performance

Hi there,

I evaluated hgemm on H100 PCIe with Cutlass profiler, cuBLAS, and Triton, but the performance is up to about 400 TFlops. However, the peak performance shown in the whitepaper is 756 TFlops. I am not sure if the results of 400 TFlops are by design or evaluated incorrectly.

I run the evaluation with Driver 535.129.03, CUDA 12.2.1. The nvidia-smi shows that H100 runs with1 GHz, 350 W, and ~60 °C. But the peak frequency should be 1.75GHz. However, if I run hgemm with zero matrices as inputs, H100 can reach 1.75 GHz and 700+ TFlops.

The similar results are reported by others as well. Reddit - Dive into anything

It’s not realistic to expect to reach peak performance.

Running gemm or tensorcore codes will often cause the GPU to throttle its clocks to stay within an appropriate power envelope.

Yes, the input data pattern can affect power consumption, and therefore measured performance.

Hi Robert, thanks for your quick reply. Since I want to calibrate the H100 for further performance evaluations, are there any reference values of the reachable hgemm performance? I am afraid that I did not install the H100 in a correct way since it achieves less than 60% of the theoretical performance.

1 Like

I’m not are of any published data in this area. It is common for GPUs to have varying percentages of achievement of performance, relative to peak.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.