Performance of A100 vs. V100s for mixed pression

I have two programs (LU decomposition). First one is a mixed pression (fp16 and fp32) and the other one is just writen in fp64.

For mixed pression I am doing the panel factorisation in FP32 and the trailing update is FP16 (GEMEX_I16_O16_C32).

I am testing theme with diferent architecture (A100 and V100s). Executation time for mixed precision is same for both architecture (sometimes 10% slower for A100!), but fp64 version is working 2 time faster on A100.

Are those results aceptable? (I know that A100 has a tensor core for fp64 and V100s does not have it).

If we should expect a beeter performance for mixed vesion (fp16+fp32), then what I should consider about to make A100 version faster?

As I am seeing the peak performance for A100 and V100 is:

A100(FP16) =312 TFLOPS V100(FP16)=125 TFLOPS
A100(FP32) =19.5 TFLOPS V100(FP32)=15.7 TFLOPS

What is the main change here in Ampere architecture that cause the same code does not have a same performance? I am not sure but maybe the problem is coming from the trainling update which is the GEMM_I16_O16_C32. Maybe performance of GEMM is dame for both architecture. Is it true?

There isn’t any performance metric between A100 and V100 that is the same. A100 memory bandwidth is higher. A100 compute throughput is higher. A100 Tensorcore throughput is higher. There isn’t any ratio among those 3 that is less than 1.3x faster for A100.

What is the main change here in Ampere architecture that cause the same code does not have a same performance?

I would recommend profiling your code.