If I use tensor core instructions with float (fp32) as accumulators (e.g. m16n8k8), will its numerics be the same as running float matmul on CPUs with fp16 inputs being cast to fp32?
In other words,
I have an input A, B, C where A, B are fp16, C is fp32.
on CPU, I first cast A, B to fp32, then do D = A * B + C in fp32 and finally cast D to fp16.
on GPU, I use tensor core (m16n8k8 instructions) to perform D = A * B + C directly with fp32 accumulaotrs. In the end, i cast D to fp16.
Will these two be numerically equivalent (do not consider the impact of the order of numerical operations)?
this may be of interest.
I think the results should be comparable. I don’t know if they would be identical in all cases. Usually, for reasons of order of operations and floating point mechanics in general (e.g. rounding), I personally don’t look for identical results between two floating point computation paths, especially when the things being compared are on two different platforms.
Thank you Robert. I am wondering if the underlying semantics are the same.
A, B= 6000 (fp16). C = -6000*6000
on CPU, the result would be D = 0
on GPU, what will be the result? will it be D = C (fp32) + fp16(A * B) << 0
I think i can test this but testing with mma instructions requires a little work. hence this question :)
tensor core multiples A and B to create a FP32 result. Take a look at the diagram I linked. It is not what you have shown
You can test tensorcore ops using CUBLAS. You don’t need to write your own MMA code.