Double precision tensor core performance on A100

Robert_Crovella · July 7, 2023, 7:54pm

If we used your logic to do the calculation of FP16 throughput chip-wide, we would get 16x16x16x2x432x1410 which yields a number much higher than the actual spec throughput (312TFLOPs/s).

The assumption that a single TC unit has a throughput of 1 op/clk for each and every combination of TC ops listed in the programming guide cannot be correct.

The A100 whitepaper also says that the throughput of a single TC unit is 256 FMA ops/cycle for FP16, and this number is consistent with the stated chip-wide throughput.

So I think you should use the 19.5 TFLOPs/s as the stated chip-wide throughput (for DP TC ops), and if you need a per unit/per clock throughput, do the division. If I do this arithmetic, I reach the conclusion that the A100 TC unit has a throughput of 16 DP FMA ops per unit per cycle:

19,500,000 / (1410x2x432) = 16.0

Unlike some other functional units in the GPU, the TC units do not necessarily seem to be able to schedule 1 op per cycle on average.

Topic		Replies	Views
What is the TFLOPS for CUDA/Tensor Cores with FP16 on V100? CUDA Programming and Performance	9	488	December 10, 2024
Question regarding Tensor Cores/GV100 CUDA Programming and Performance	8	2547	August 12, 2017
How cuda core compute fp16 data in different nvidia arch？ CUDA Programming and Performance cuda	8	628	November 25, 2024
Tensor core, is my analysis correct? CUDA Programming and Performance	2	69	February 5, 2025
How to calculate the Tensor Core FP16 performance of H100? CUDA Programming and Performance	9	6266	August 14, 2024
Question about tensor cores performance CUDA Programming and Performance	3	662	October 12, 2021
About the relationship between warp and tensor_core CUDA Programming and Performance	7	1386	July 7, 2023
A100: 312 TMAC/s or 312 TFLOP/s CUDA Programming and Performance	3	537	January 12, 2023
TF32 TFLOPs of GeForce RTX 3090 vs A40 CUDA Programming and Performance	2	2677	September 11, 2023
Theoretical TFLOPS for FP16, BF16 and TF32 for tensor and non-tensor GPU-Accelerated Libraries	4	5174	June 21, 2022

Double precision tensor core performance on A100

Related topics