If I use tensor core instructions with float (fp32) as accumulators (e.g. m16n8k8), will its numerics be the same as running float matmul on CPUs with fp16 inputs being cast to fp32?
In other words,
I have an input A, B, C where A, B are fp16, C is fp32.
on CPU, I first cast A, B to fp32, then do D = A * B + C in fp32 and finally cast D to fp16.
on GPU, I use tensor core (m16n8k8 instructions) to perform D = A * B + C directly with fp32 accumulaotrs. In the end, i cast D to fp16.
Will these two be numerically equivalent (do not consider the impact of the order of numerical operations)?
I think the results should be comparable. I don’t know if they would be identical in all cases. Usually, for reasons of order of operations and floating point mechanics in general (e.g. rounding), I personally don’t look for identical results between two floating point computation paths, especially when the things being compared are on two different platforms.