In my understanding, tf32 is just a different representation of the value, it has 1 sign bit, 8 exponent and 10 mantissa, and when I actually create a tf32 variable, it still occupies 32 bits of memory, which I verified with the following code.
using ElementInputA = cutlass::tfloat32_t;
printf("%d\n", sizeof(ElementInputA));
// output: 4
However, I read some blogs that say that training is often done using FP16, BF16 and TF32 to reduce the memory footprint, so from that point of view, I think using TF32 will not reduce the memory usage.
I think your question might get more attention and replies in the AI category, maybe cuDNN or TensorRT, but I can try to shed some light.
What you state with respect to memory footprint is correct. To accommodate the TensorFloat datatype we need to use 4 bytes, since 3 byte data structures are not that common in popular programming languages.
But internally there are optimizations that will take advantage of the datatype in combination with the GPU’s Tensor cores, allowing for the up to 10x speedup compared to common FP32 operations on A100 for example (see What is the TensorFloat-32 Precision Format? | NVIDIA Blog for more detail).