Ranking GPUs based on their GPU performance

kalani.wataraka.gamage · February 11, 2025, 9:31am

I want to compare and rank multiple GPUs for the GPU performance.

Targeted models:

I am converting TAO Models (ex YOLOv4/ DINO) into Tensorrt using .trtexec and running them on Tensorrt backend. When I convert the models to Tensorrt I use fp16 precision, however as I can see in logs, some layers fall back to fp32 during conversion and it seems like this is expected in some cases when the optimiser cant find fp16 kernals or when its required to maintain accuracy.
Therefore my models will be run on either fp16 or mixed precision (fp16 + fp32) on tensorrt.

Is there a single metric that we can use to roughly compare the GPU performance between different GPUs based on their datasheets? If its mutliple metrics, what are those?
can you please explain following metrics

Single-precision performance - fp32 TFLOPS?
Tensor performance - can we know this FP16 or fp32 TFLOPS? Is there a seperate FP16 TFLOPS value and seperate mixed precision TFLOPS values?

How can we interpret them for a) fp16 precesion model b) precision (fp16 + fp32) for inference ?

(Note: I have tested models on one GPU and its GPU utilisation is very high and cannot scale for the expected rate of inputs, therefore I am looking to compare the known GPU to others in market to find a one that would suit my task )

Robert_Crovella · February 11, 2025, 2:37pm

This recent discussion may be of interest.

For the FP16 case, the most relevant single metric would be FP16 tensorcore throughput. This may have variants such as FP16 with FP16 accumulation, or FP16 with FP32 accumulation, but I wouldn’t worry too much about those because its likely to be the same treatment regardless of the GPU, so it should not affect a GPU-GPU comparison. But of course if only one layer gets actual FP16 treatment and many others are converted to FP32, then FP16 tensorcore throughput may not be a particularly good metric/predictor.

One of the key takeaways from the thread I linked is that such an exercise (reliably predicting GPU performance from a single metric/number) will usually/probably depend on the exact behavior of the code in question.

For tensorcore, there is no “pure” FP32 path. Depending on your GPU, TRT might be “falling back” from FP16 to “ordinary” FP32 (i.e. non-tensorcore) or it may possibly be using TF32 (another tensorcore path) if the underlying GPU supports it – that would probably be a specific question for the TRT forum. You may get better help asking TRT questions there.

Curefab · February 11, 2025, 5:36pm

Multiply the number in this unofficial table

For your architecture (compute capability, the GPU models are listed in an early chapter of that Wikipedia page) and the data format multiply the number in the table cell with the number of tensor cores/SM (4 since Ampere) and the number of SMs of your GPU and the base or boost clock to get the number of multiply-add instructions per second.

There is no FP32 tensor core, only FP16 and smaller formats, TF32 (19 bits), and FP64 (with limited performance on consumer GPUs).

Topic		Replies	Views
Question about tensor cores performance CUDA Programming and Performance	3	650	October 12, 2021
Question regarding Tensor Cores/GV100 CUDA Programming and Performance	8	2545	August 12, 2017
How FP32 and FP16 units are implemented in GP100 GPU's CUDA Programming and Performance	8	7531	March 28, 2017
Mixed-Precision Programming with CUDA 8 Technical Blog	1	391	February 23, 2017
Question about the tensorrt precision transformation TensorRT	4	470	July 12, 2021
How to measure Tensor FLOPs? CUDA Programming and Performance tensorrt , cuda , kernel	14	2458	May 15, 2024
Why tensor cores can't do FP32 arithmetic? CUDA Programming and Performance hw	4	230	December 10, 2024
Understanding of Tensor Core, Cuda Core and other cores in Ampere architecture CUDA Programming and Performance tensorrt , cuda	8	4461	December 3, 2022
A question on single and double precision performance calculation with CUDA cores CUDA Programming and Performance	7	1835	May 31, 2024
Looking for full specs on NVIDIA A5000 CUDA Programming and Performance	2	2805	June 16, 2022

Ranking GPUs based on their GPU performance

Related topics