We have a A100, V100, A2, and GTX 1050 Ti.
We have tested the V100, A2, and GTX 1050 Ti in different (Windows) systems regarding their FP64 performance.
The V100 and the GTX 1050 Ti behave as expected. But the A2 FP64-performance is lower than that of the GTX 1050 Ti.
According to this information List of Nvidia graphics processing units - Wikipedia the A2 should offer 140 GFLOPS (FP64) and the GTX 1050 Ti ~66 GFLOPS (FP64).
So, it should be at least 2 times faster.
But with CUDA-Z and Matlab I can only achieve ~75% of the FP64-performance of the GTX 1050 Ti.
What’s going wrong here?
" Double-Precision Tensor Cores are among a battery of new capabilities in the NVIDIA Ampere architecture, driving HPC performance as well as AI training and inference to new heights."
Is the A2 spec sheet maybe listing tensor core performance, while the 1050Ti is listing performance of CUDA cores? If so, you’d need a specialized benchmark that makes use of the Tensor cores.
Thanks for the reply.
But using the same benchmark on our A100 (Ampere architecture) even shows better results than expected, not worse.
A2 is limited by power. It’s expected that you won’t be able to achieve peak performance.
Here are related threads: 1 2 3
While the A2 is running whatever test you are running, you may wish to use
nvidia-smi -a to see in the section of “clocks throttle reasons” whether power-capping is occurring. If so that is one possible explanation for not achieving whatever you are expecting to achieve.