Performance of A100 vs. V100s for mixed pression

uniadam · December 3, 2021, 3:58pm

I have two programs (LU decomposition). First one is a mixed pression (fp16 and fp32) and the other one is just writen in fp64.

For mixed pression I am doing the panel factorisation in FP32 and the trailing update is FP16 (GEMEX_I16_O16_C32).

I am testing theme with diferent architecture (A100 and V100s). Executation time for mixed precision is same for both architecture (sometimes 10% slower for A100!), but fp64 version is working 2 time faster on A100.

Are those results aceptable? (I know that A100 has a tensor core for fp64 and V100s does not have it).

If we should expect a beeter performance for mixed vesion (fp16+fp32), then what I should consider about to make A100 version faster?

As I am seeing the peak performance for A100 and V100 is:

A100(FP16) =312 TFLOPS V100(FP16)=125 TFLOPS
A100(FP32) =19.5 TFLOPS V100(FP32)=15.7 TFLOPS

What is the main change here in Ampere architecture that cause the same code does not have a same performance? I am not sure but maybe the problem is coming from the trainling update which is the GEMM_I16_O16_C32. Maybe performance of GEMM is dame for both architecture. Is it true?

Robert_Crovella · December 3, 2021, 9:14pm

There isn’t any performance metric between A100 and V100 that is the same. A100 memory bandwidth is higher. A100 compute throughput is higher. A100 Tensorcore throughput is higher. There isn’t any ratio among those 3 that is less than 1.3x faster for A100.

What is the main change here in Ampere architecture that cause the same code does not have a same performance?

I would recommend profiling your code.

Topic		Replies	Views
Mixed precision GEMM Performance (A100 & V100) CUDA Programming and Performance	1	1442	December 3, 2021
Mixed Precision Algorithm in Ampere Is Slower Than Volta GPU-Accelerated Libraries cublas	1	503	October 6, 2021
Separate CUDA Core pipeline for FP16 and FP32? Nsight Compute	11	296	August 20, 2024
How to calculate the Tensor Core FP16 performance of H100? CUDA Programming and Performance	9	5703	August 14, 2024
How cuda core compute fp16 data in different nvidia arch？ CUDA Programming and Performance cuda	8	482	November 25, 2024
Difference in SM performance of float16 and bfloat16 CUDA Programming and Performance	4	443	August 7, 2024
Performace on A100SXM40GB TF32 vs FP32 CUDA Programming and Performance cuda , ampere	1	969	January 26, 2023
How FP32 and FP16 units are implemented in GP100 GPU's CUDA Programming and Performance	8	7467	March 28, 2017
Nvidia A2: FP64 performance is lower than specified in specs CUDA Programming and Performance	3	521	October 27, 2023
How A30 GPU is faster than A10 GPU? GPU-Accelerated Libraries gpu	3	5920	July 5, 2022

Performance of A100 vs. V100s for mixed pression

Related topics