I understand that tensor cores are particularly used for low precision and mixed precision computation. I have also noticed that they can operate on fp64 data without any loss of precision. How are they able to do this? And not able to do fp32 arithmetic?
I’m not an internal expert at NVIDIA, but I recall that in the early days, Tensor Cores only supported half
precision (on V100). Later, support for more formats and sizes was gradually added. It’s likely a trade-off, isn’t it? Since large models rarely use FP64 for computation, there’s probably no Tensor Core support for it.
The FP64 Tensor Core support is for the datacenter cards and for compatibility with low speed for all other GPUs.
FP64 is mostly used for scientific calculations and uses dedicated hardware, FP32 can be done with normal (non-Tensor Core) operations.
Do do FP32 with Tensor Cores:
If you have a datacenter card, just use FP64; otherwise there are ways to increase the bit size by combining several lower precision INT or FP tensor core operations manually.
Those numbers probably use sparse matrices for FP16. Up till H100 the factor is 1:16 (for the enterprise cards), which is okay, as the computational overhead increases with the square of the bitsize.
However, with Blackwell it is 1:64. Nvidia reduces the FP64 Tensor core performance.