Mixed Precision Algorithm in Ampere Is Slower Than Volta

When I read the characteristics of Ampere architecture it seems to work better and faster but when I am running my mixed precision program on A100 I am seeing 30% - 40% slowdown. How can I understnad the problem and tune my code?
It seems to me that we do not have support for single (Tensor Core GEMM support) and TF32 is replaced. Should I change the data type? my code contain GEMM, TRSM, GETRF.

If understand your code better please use Nsight Systems and Compute. If you think there is an issue with Math Libraries please provide a simple reproducer.