Mixed Precision Algorithm in Ampere Is Slower Than Volta

uniadam · October 6, 2021, 10:00am

When I read the characteristics of Ampere architecture it seems to work better and faster but when I am running my mixed precision program on A100 I am seeing 30% - 40% slowdown. How can I understnad the problem and tune my code?
It seems to me that we do not have support for single (Tensor Core GEMM support) and TF32 is replaced. Should I change the data type? my code contain GEMM, TRSM, GETRF.

mnicely · October 6, 2021, 8:36pm

If understand your code better please use Nsight Systems and Compute. If you think there is an issue with Math Libraries please provide a simple reproducer.