Question about tensor cores performance

juliefraysse · May 1, 2021, 8:25pm

Hi,
I have a RTX4000 (mobile version) GPU.
The “FP32 performance” is 7.98 TFLOPS.
I suppose it refers to the performance for all the GPU CUDA cores (and not TENSOR cores). I am right ?
I would like to get the FP32 performance for all the GPU TENSOR cores.
Considering a Tensor cores as a matrix multiplier [4x4] x [4x4] + [4x4], is it right to deduce the Tensor cores performance by doing : (FP32 perf/nb_CUDA_cores) x 64 x nb_TENSOR_cores ?
And what about FP16 and TF32 ?

Robert_Crovella · May 1, 2021, 9:26pm

Tensor cores (TC) do not do FP32 calculations. (OK, in FP16 mode, the accumulation can be done into FP32, but the calculation as a whole does not use FP32 throughout). I’m not going to try to calculate the FP32 throughput through TC, as I consider that misleading.

That is a turing class GPU, it does not support TF32 calculations. TF32 is a new feature for Ampere GPUs.
For FP16, or performance estimates in general, most of the specs you need are here

For FP16 each TC unit can retire one 4x4 matrix multiply per clock.
The FP16 throughput through TC is:

clk*#TC*4*4*4*2

The clock speed (clk) is something you have to pick. If you want the peak theoretical number, pick the peak boost clock for your GPU.

Using data from the link then, the result is:

1.56*320*4*4*4*2 = ~64TF

Since this is a peak theoretical calculation, you should not assume you can actually witness this level of throughput, eg. via an appropriately crafted cublas GEMM operation. The measured number will be lower.

juliefraysse · May 6, 2021, 10:15am

Thank you for your answer.

Topic		Replies	Views
Question regarding Tensor Cores/GV100 CUDA Programming and Performance	8	2511	August 12, 2017
Why tensor cores can't do FP32 arithmetic? CUDA Programming and Performance hw	4	82	December 10, 2024
About GPU peak performance CUDA Programming and Performance	6	1367	August 29, 2023
How to calculate the Tensor Core FP16 performance of H100? CUDA Programming and Performance	9	5534	August 14, 2024
Some confuse about TX1 and TX2 FLOPS calculation CUDA Programming and Performance	4	5242	May 31, 2019
RTX 3090 Peak Performance GPU-Accelerated Libraries cutensor	1	6613	December 14, 2021
FP32 and FP16 activity during a pure 32bit float CUDA application is running CUDA Programming and Performance	4	1059	April 26, 2018
Understanding of Tensor Core, Cuda Core and other cores in Ampere architecture CUDA Programming and Performance tensorrt , cuda	8	3397	December 3, 2022
Theoretical TFLOPS for FP16, BF16 and TF32 for tensor and non-tensor GPU-Accelerated Libraries	4	4406	June 21, 2022
How to measure Tensor FLOPs? CUDA Programming and Performance tensorrt , cuda , kernel	14	1569	May 15, 2024

Question about tensor cores performance

Related topics