FP64 Performance - Power Limitation - H100 vs A100

To get close to the theoretical peak performance you are referring to (whether on A100 or H100), you would need basically a matrix multiply code. Suggesting that another algorithm or computational sequence also meets the necessary criteria is immediately doubtful to me, based on my own experience.

The vast majority of codes are memory bound.

The 1.92 ratio you report between the two GPUs would be consistent with the memory bandwidth ratio, and as you have already pointed out, inconsistent with the referenced compute throughput ratio.

If I wanted to test the claim I am implying here, since FP64-non-TC cannot be easily accessed via libraries like CUBLAS (on A100 and H100), I would seek to construct a code that nevertheless had many back-to-back DFMA sequences, and see how that compares between the two GPUs.