I’ve been comparing the specs of A10 vs A30 for AI Inference workflow. I really don’t understand how A30 is faster than A10 on FP16 Tensor Core compute:

A30 has 3,804 CUDA cores and 224 third-gen Tensor Cores. A10 has 9,216 CUDA cores and 288 third-gen Tensor Cores.

Could you explain a bit more of how A30 gains extra performance? I’ve been using the number of cuda cores and tensor cores to estimate inference performance, but it seems I’ve missed something more.

A30 is from the same product group as A100 and is based on the GA100 chip architecture.

A10 is based on the GA102 chip architecture.

The tensorcore unit/design between these two chips/groups is not identical.

The A10 performance can be derived from the GA102 whitepaper (e.g. table 3) by scaling (down) the performance from A6000 (or A40) quoted there, based on clocks and SM count. The A40 has 84 GA102 SMs, and the non-sparsity FP16 TC number quoted there is about 150 TF. The A10 has 72 GA102 SMs (you can get this from e.g. deviceQuery, or from e.g. Techpowerup online), so:

72/84 * 150 = 128 TF

and the remaining difference (to 125) comes from clocking differences between A10 and A40.

A30 is based on the GA100 design, and from that whitepaper (e.g. table 2) we see that A100 delivers 312 FP16 TC TFLOPS, non-sparsity.

The A100 has 108 SMs, whereas the A30 has 56 SMs (again, discoverable from e.g. deviceQuery, or online e.g. techpowerup). Again:

56/108 * 312 = 162 TF

and the remaining difference (to 165) comes from clocking differences.

Note that I’m just pointing out how the FP16 TC numbers are calculated. I’m not suggesting that one is faster than the other, that will be something you would have to look at actual benchmarks for a specific workload to determine.

Note also that whereas A10 has about 600GB/s of memory bandwidth, A30 has about 900 GB/s of memory bandwidth. For some workloads, that might be the determining factor.