I have a RTX4000 (mobile version) GPU.
The “FP32 performance” is 7.98 TFLOPS.
I suppose it refers to the performance for all the GPU CUDA cores (and not TENSOR cores). I am right ?
I would like to get the FP32 performance for all the GPU TENSOR cores.
Considering a Tensor cores as a matrix multiplier [4x4] x [4x4] + [4x4], is it right to deduce the Tensor cores performance by doing : (FP32 perf/nb_CUDA_cores) x 64 x nb_TENSOR_cores ?
And what about FP16 and TF32 ?
Tensor cores (TC) do not do FP32 calculations. (OK, in FP16 mode, the accumulation can be done into FP32, but the calculation as a whole does not use FP32 throughout). I’m not going to try to calculate the FP32 throughput through TC, as I consider that misleading.
That is a turing class GPU, it does not support TF32 calculations. TF32 is a new feature for Ampere GPUs.
For FP16, or performance estimates in general, most of the specs you need are here
For FP16 each TC unit can retire one 4x4 matrix multiply per clock.
The FP16 throughput through TC is:
The clock speed (clk) is something you have to pick. If you want the peak theoretical number, pick the peak boost clock for your GPU.
Using data from the link then, the result is:
1.56*320*4*4*4*2 = ~64TF
Since this is a peak theoretical calculation, you should not assume you can actually witness this level of throughput, eg. via an appropriately crafted cublas GEMM operation. The measured number will be lower.
Thank you for your answer.