FP64 Performance - Power Limitation - H100 vs A100

Robert_Crovella · December 30, 2025, 9:52pm

To get close to the theoretical peak performance you are referring to (whether on A100 or H100), you would need basically a matrix multiply code. Suggesting that another algorithm or computational sequence also meets the necessary criteria is immediately doubtful to me, based on my own experience.

The vast majority of codes are memory bound.

The 1.92 ratio you report between the two GPUs would be consistent with the memory bandwidth ratio, and as you have already pointed out, inconsistent with the referenced compute throughput ratio.

If I wanted to test the claim I am implying here, since FP64-non-TC cannot be easily accessed via libraries like CUBLAS (on A100 and H100), I would seek to construct a code that nevertheless had many back-to-back DFMA sequences, and see how that compares between the two GPUs.

Topic		Replies	Views
Integer NTT on RTX 20xx, A100 vs RTX 30xx, 40xx, 50xx CUDA Programming and Performance	27	821	November 30, 2025
High Compute in Flight, low DRAM Bandwidth usage CUDA Programming and Performance	35	628	January 19, 2025
Why the performance of tf32 tensor_core is poor? CUDA Programming and Performance	20	2165	August 8, 2023
NVIDIA Hopper Architecture In-Depth Technical Blog	3	1209	August 22, 2025
How to test FP64 (no tensor core) in A100 CUDA Programming and Performance cuda	7	125	November 7, 2025
Bf16 slower than fp32 on A10 and A100? CUDA Programming and Performance cuda , kernel , deep-learning , a100	4	1882	July 13, 2024
SFU Performance in A100 CUDA Programming and Performance cuda	12	4615	December 11, 2021
why the Tesla T4 peak performance test result mismatch with the official doc CUDA Programming and Performance	8	2708	October 19, 2019
Nvidia GF104 vs GF100 CUDA Programming and Performance	24	23269	October 12, 2010
HPL benchmark on A100(40GB PCIe) GPU-Accelerated Libraries cuda	1	1478	May 8, 2022

FP64 Performance - Power Limitation - H100 vs A100

Related topics