I see here:
That it has the following specs:
Single-precision performance 27.8 TFLOPS
RT Core performance 54.2 TFLOPS
Tensor performance 222.2 TFLOPS
I’m trying to work out how to compare cards to each other in order to find the most cost-efficient cards for my application. My application is multiple streams (100’s of cameras) running object detection.
I don’t think we’re interested in RT Cores since they are for raytracing, so we can scrap them.
I think Single-precision is FP32 performance.
Tensor performance confuses me a bit - Given it’s such a big number it might refer to INT4 performance which isn’t applicable to me.
If I have a model on an RTX A5000 that gets e.g. 50fps on single-precision, how do we work out how this model will perform on e.g. an RTX A6000 (single 38.7 TFLOPS, Tensor 309.7)? Does is scale linearly with either spec?
Looking at the GPU Support matrix here: Support Matrix :: NVIDIA Deep Learning TensorRT Documentation and cross referencing against CUDA GPUs - Compute Capability | NVIDIA Developer it seems that the RTX A5000 is capable of half precision fp16 and also INT8 - where would I find the specs for this? A lot of edge devices (some jetsons, Google Coral TPU) have specs in TOPS so it would be good to compare INT8 directly.
TFLOPS refers to floating point calculation. INT4 is not a floating point type, and INT4 calculations are not floating point ops. INT1/4/8 throughput would be reported in TOPS.
To do a proper job of this requires roofline analysis. Find the limiting factor (e.g. math pipe, memory bandwidth, etc.) on the A5000 for that particular workload. Compare that to the same capability on the A6000. At that level of throughput, do any of the A6000 limiting factors interfere (hit your head on the roof) first. This requires profiler analysis, and is difficult to do for a complex workload such as a full inference pipeline. Instead you would probably want to pick a representative part of the pipeline (e.g. a call to a TRT kernel) and do a roofline analysis on that.
If your model is being calculated in single-precision (fp32),why not see about inferencing it in another datatype on the A5000 that can take advantage of tensor core (e.g. TF32, FP16, etc.)?
You can get an estimate of some substance by looking at the factors that you think might be relevant, e.g. fp32 throughput and memory bandwidth, finding out which one represents the lowest percentage improvement going from A5000 to A6000, then use that to scale the perf, as an estimate. If your actual pipeline is bound by something else, like Host->Device data transfer, then this is going to miss the mark.
For specs not directly provided by NVIDIA, you can sometimes find them in other places, eg. techpowerup
If you know one specification, certain others can often be calculated using this table in the programming guide.
Finally, the chip architecture and associated whitepaper can be used to figure some things out. A5000 appears to be based on the GA102 chip, which has an associated whitepaper here.
To work through an example, lets take a look at Tensorcore (TC) perf on the A5000. The datasheet says 222.2 TFLOPS. And it has a footnote (6) which indicates this assumes sparsity.
Referring to the whitepaper, table 3, we see that for GA102 the highest reported TFLOPS is associated with TC ops that are FP16, FP16 Accumulate, with Sparsity. This is what the 222.2 number corresponds to. Not using or disabling sparsity cuts the relevant numbers in half.
For TF32 calculations, the GA102 delivers 1/2 of the equivalent FP16 number, so 222.2/2 is the TFLOPS throughput of TF32 with sparsity, for A5000.
Hopefully that walkthrough will allow you to answer other similar questions, like what is the INT8 throughput? With sparsity, it should be double the 222.2 number (TOPS, not TFLOPS).