I evaluated hgemm on H100 PCIe with Cutlass profiler, cuBLAS, and Triton, but the performance is up to about 400 TFlops. However, the peak performance shown in the whitepaper is 756 TFlops. I am not sure if the results of 400 TFlops are by design or evaluated incorrectly.
I run the evaluation with Driver 535.129.03, CUDA 12.2.1. The nvidia-smi shows that H100 runs with1 GHz, 350 W, and ~60 °C. But the peak frequency should be 1.75GHz. However, if I run hgemm with zero matrices as inputs, H100 can reach 1.75 GHz and 700+ TFlops.
Hi Robert, thanks for your quick reply. Since I want to calibrate the H100 for further performance evaluations, are there any reference values of the reachable hgemm performance? I am afraid that I did not install the H100 in a correct way since it achieves less than 60% of the theoretical performance.