I have two programs (LU decomposition). First one is a mixed pression (fp16 and fp32) and the other one is just writen in fp64.
For mixed pression I am doing the panel factorisation in FP32 and the trailing update is FP16 (GEMEX_I16_O16_C32).
I am testing theme with diferent architecture (A100 and V100s). Executation time for mixed precision is same for both architecture (sometimes 10% slower for A100!), but fp64 version is working 2 time faster on A100.
Are those results aceptable? (I know that A100 has a tensor core for fp64 and V100s does not have it).
If we should expect a beeter performance for mixed vesion (fp16+fp32), then what I should consider about to make A100 version faster?
As I am seeing the peak performance for A100 and V100 is:
A100(FP16) =312 TFLOPS V100(FP16)=125 TFLOPS
A100(FP32) =19.5 TFLOPS V100(FP32)=15.7 TFLOPS
What is the main change here in Ampere architecture that cause the same code does not have a same performance? I am not sure but maybe the problem is coming from the trainling update which is the GEMM_I16_O16_C32. Maybe performance of GEMM is dame for both architecture. Is it true?