Ranking GPUs based on their GPU performance

I want to compare and rank multiple GPUs for the GPU performance.

Targeted models:

  • I am converting TAO Models (ex YOLOv4/ DINO) into Tensorrt using .trtexec and running them on Tensorrt backend. When I convert the models to Tensorrt I use fp16 precision, however as I can see in logs, some layers fall back to fp32 during conversion and it seems like this is expected in some cases when the optimiser cant find fp16 kernals or when its required to maintain accuracy.
  • Therefore my models will be run on either fp16 or mixed precision (fp16 + fp32) on tensorrt.
  1. Is there a single metric that we can use to roughly compare the GPU performance between different GPUs based on their datasheets? If its mutliple metrics, what are those?

  2. can you please explain following metrics

  • Single-precision performance - fp32 TFLOPS?
  • Tensor performance - can we know this FP16 or fp32 TFLOPS? Is there a seperate FP16 TFLOPS value and seperate mixed precision TFLOPS values?
  1. How can we interpret them for a) fp16 precesion model b) precision (fp16 + fp32) for inference ?

(Note: I have tested models on one GPU and its GPU utilisation is very high and cannot scale for the expected rate of inputs, therefore I am looking to compare the known GPU to others in market to find a one that would suit my task )

This recent discussion may be of interest.

For the FP16 case, the most relevant single metric would be FP16 tensorcore throughput. This may have variants such as FP16 with FP16 accumulation, or FP16 with FP32 accumulation, but I wouldn’t worry too much about those because its likely to be the same treatment regardless of the GPU, so it should not affect a GPU-GPU comparison. But of course if only one layer gets actual FP16 treatment and many others are converted to FP32, then FP16 tensorcore throughput may not be a particularly good metric/predictor.

One of the key takeaways from the thread I linked is that such an exercise (reliably predicting GPU performance from a single metric/number) will usually/probably depend on the exact behavior of the code in question.

For tensorcore, there is no “pure” FP32 path. Depending on your GPU, TRT might be “falling back” from FP16 to “ordinary” FP32 (i.e. non-tensorcore) or it may possibly be using TF32 (another tensorcore path) if the underlying GPU supports it – that would probably be a specific question for the TRT forum. You may get better help asking TRT questions there.

Multiply the number in this unofficial table

For your architecture (compute capability, the GPU models are listed in an early chapter of that Wikipedia page) and the data format multiply the number in the table cell with the number of tensor cores/SM (4 since Ampere) and the number of SMs of your GPU and the base or boost clock to get the number of multiply-add instructions per second.

There is no FP32 tensor core, only FP16 and smaller formats, TF32 (19 bits), and FP64 (with limited performance on consumer GPUs).