Looking for full specs on NVIDIA A5000

brian0b6iu · June 2, 2022, 11:12am

I see here:

That it has the following specs:

Single-precision performance 27.8 TFLOPS
RT Core performance 54.2 TFLOPS
Tensor performance 222.2 TFLOPS

I’m trying to work out how to compare cards to each other in order to find the most cost-efficient cards for my application. My application is multiple streams (100’s of cameras) running object detection.

I don’t think we’re interested in RT Cores since they are for raytracing, so we can scrap them.

I think Single-precision is FP32 performance.

Tensor performance confuses me a bit - Given it’s such a big number it might refer to INT4 performance which isn’t applicable to me.

My questions:

If I have a model on an RTX A5000 that gets e.g. 50fps on single-precision, how do we work out how this model will perform on e.g. an RTX A6000 (single 38.7 TFLOPS, Tensor 309.7)? Does is scale linearly with either spec?
Looking at the GPU Support matrix here: Support Matrix :: NVIDIA Deep Learning TensorRT Documentation and cross referencing against https://developer.nvidia.com/cuda-gpus#compute it seems that the RTX A5000 is capable of half precision fp16 and also INT8 - where would I find the specs for this? A lot of edge devices (some jetsons, Google Coral TPU) have specs in TOPS so it would be good to compare INT8 directly.

Thanks!

Robert_Crovella · June 16, 2022, 10:50pm

TFLOPS refers to floating point calculation. INT4 is not a floating point type, and INT4 calculations are not floating point ops. INT1/4/8 throughput would be reported in TOPS.

To do a proper job of this requires roofline analysis. Find the limiting factor (e.g. math pipe, memory bandwidth, etc.) on the A5000 for that particular workload. Compare that to the same capability on the A6000. At that level of throughput, do any of the A6000 limiting factors interfere (hit your head on the roof) first. This requires profiler analysis, and is difficult to do for a complex workload such as a full inference pipeline. Instead you would probably want to pick a representative part of the pipeline (e.g. a call to a TRT kernel) and do a roofline analysis on that.

If your model is being calculated in single-precision (fp32),why not see about inferencing it in another datatype on the A5000 that can take advantage of tensor core (e.g. TF32, FP16, etc.)?

You can get an estimate of some substance by looking at the factors that you think might be relevant, e.g. fp32 throughput and memory bandwidth, finding out which one represents the lowest percentage improvement going from A5000 to A6000, then use that to scale the perf, as an estimate. If your actual pipeline is bound by something else, like Host->Device data transfer, then this is going to miss the mark.

For specs not directly provided by NVIDIA, you can sometimes find them in other places, eg. techpowerup
If you know one specification, certain others can often be calculated using this table in the programming guide.
Finally, the chip architecture and associated whitepaper can be used to figure some things out. A5000 appears to be based on the GA102 chip, which has an associated whitepaper here.

Robert_Crovella · June 16, 2022, 11:07pm

To work through an example, lets take a look at Tensorcore (TC) perf on the A5000. The datasheet says 222.2 TFLOPS. And it has a footnote (6) which indicates this assumes sparsity.

Referring to the whitepaper, table 3, we see that for GA102 the highest reported TFLOPS is associated with TC ops that are FP16, FP16 Accumulate, with Sparsity. This is what the 222.2 number corresponds to. Not using or disabling sparsity cuts the relevant numbers in half.

For TF32 calculations, the GA102 delivers 1/2 of the equivalent FP16 number, so 222.2/2 is the TFLOPS throughput of TF32 with sparsity, for A5000.

Hopefully that walkthrough will allow you to answer other similar questions, like what is the INT8 throughput? With sparsity, it should be double the 222.2 number (TOPS, not TFLOPS).

Topic		Replies	Views
NVIDIA A5000 - How to get full specs and how to compare cards? Computer Vision & Image Processing gpu , benchmarks , jetson	3	1579	June 16, 2022
GPU Performance Comparison: A5000 to A6000 TensorRT	2	1699	June 17, 2022
How to calculate the Tensor Core FP16 performance of H100? CUDA Programming and Performance	9	7678	August 14, 2024
Nvidia A2: FP64 performance is lower than specified in specs CUDA Programming and Performance	3	688	October 27, 2023
Tensorcore roofline Nsight Compute	2	390	August 27, 2024
What is the FP64 Rpeak value for A5000? GPU-Accelerated Libraries hw	1	1030	November 24, 2021
L40 vs. RTX 6000 Ada FP16/FP8 throughput? GPU - Hardware benchmarks	7	16096	April 4, 2023
A100: 312 TMAC/s or 312 TFLOP/s CUDA Programming and Performance	3	612	January 12, 2023
Double precision tensor core performance on A100 CUDA Programming and Performance cuda , a100 , ampere	1	1105	July 7, 2023
Ranking GPUs based on their GPU performance CUDA Programming and Performance tensorrt , inference-server-triton , tao	2	371	February 11, 2025

Looking for full specs on NVIDIA A5000

Related topics