Parallel usage of FP64 and Tensor cores in H100

,

Hi,
I have looked at H100 whitepaper (NVIDIA H100 Tensor Core GPU Architecture Overview) and have the following query regarding its FP64 computation capability.

The peak computational bandwidth of FP64 execution units and FP64 tensor cores are 33.5 and 66.9 TFLOPS respectively.

Since these execution units are independent, can I not use them both together (in parallel) and get about 100 TFLOPS for FP64 computations?

Someone please comment on why such simultaneous usage is not possible. Thanks.

AFAIK the simultaneous usage is posssible.
Depending on the algorithm it may be added work to combine them effectively.
FP64 is easier in this regard, as the data bandwidth is lower than FP32 and FP16.
With FP32 and FP16 getting to peek computational speed often is limited by moving in enough new data, here the MMA architecture helps, as it combines data from different threads and reuses it in a matrix fashion, for some problems this is more difficult with the conventional FP32 execution units.

I have tested out cuBLASdgemm on H100 PCIe variant with an assumption that cuBLAS would use both (Tensor cores and FP64 cores) to maximize the performance gains. However, the performance peaks out for square matrix with size 2048 and corresponds to 39.13 TFLOPS.
But the theoretical peak for this variant is 51.2 and 25.6 TFLOPS for tensor and FP64 cores respectively.

Hence, my query on whether is it possible to use both tensor and fp64 cores simultaneously.

It is possible to use them at the same time. That does not mean that it is feasible for each algorithm to profit from it. At least it does not seem to be implemented for cuBLAS. The TFLOPS numbers are the theoretical peak numbers for computation.

For matrix multiplications providing the data fast enough is at least as important.
If you calculate each FLOP has 2*8 byte in and 8 byte out, you get around 1 Exabyte/s. Tensor Core instructions help to keep this bandwidth in check, because the data is routed internally within the warp.