The tensor core performance detail of Jetson AGX Orin 32GB

I want to know the performance detail of Tensor core on Jetson AGX Orin 32GB.

From the NVIDIA Jetson AGX Orin Series datasheet (v1.2),
I could confirm the following specs of Jetson AGX Orin 32GB.

• Max Operating Frequency: 939MHz
• Tensor Core num : 56 Cores
• Tensor core performance: 54 FP16 TFLOPS
• Sparsity: fine grained structured sparsity doubles throughput.

And, I think 3rd generation tensor core can execute 128 Multiply-add per cycle.

So I calculate the tensor core performance as follows.

939 (MHz) * 56 (cores) * 2(Sparsity) * 128 * 2 (multiply-add/cycles) = 26.9 TFLOPS

It doesn’t reach 54 FP16 TFLOPS.
I think I’m overlooking something.

Could you give me any advice on this matter?

Regards,
hiro

(512 FMA ops * 2 * .939 Ghz) * 56 tensor core) = 54 Dense INT8 TOPs * 2 = 108 INT8 TOPs (sp)

FP16 is half of INT8, so 54 FP16 Sparse TFLOPs.

Dear @kayccc

I think the result of (512 FMA ops * 2 * .939 Ghz) * 56 tensor core) is around 54 TOPS as follows.

(512 FMA ops * 2 * .939 Ghz) * 56 tensor core) = 53,846.016 TOPS

So I cannot understand why you multiply the result by 2.

(512 FMA ops * 2 * .939 Ghz) * 56 tensor core) = 54 Dense INT8 TOPs * 2

Could you tell me the reason?

Regards,
hiro

Unfortunately, I still don’t understand the detail of Tensor core performance.

Regards,
hiro

Dear @kayccc,

I think your following calculation is not correct because (512 FMA ops * 2 * .939 Ghz) * 56 tensor core) = 53,846.016 TOPS.

So I want to reconfirm the calculation of Tensor Core performance.

Regards,
hiro

The 2 is to convert FMA to OPs. Each FMA is a floating point multiply + floating point ADD (two FP ops).

Dear @kayccc,

I understood final 2 means multiply-add/cycles.

(512 FMA ops * 2 * .939 Ghz) * 56 tensor core) = 54 Dense INT8 TOPs * 2 (multiply-add/cycles)

If so, could you tell me “512 FMA ops * 2”?

(512 FMA ops * 2 * .939 Ghz) * 56 tensor core) = 54 Dense INT8 TOPs * 2 (multiply-add/cycles)

From following NVIDIA AMPERE GA102 GPU ARCHITECTURE, I think
Jetson’s tensor Core has 512 Sparse INT8 FMA / Core because it has 256 FP16 FMA / Core.

NVIDIA AMPERE GA102 GPU ARCHITECTURE
(https://www.nvidia.com/content/PDF/nvidia-ampere-ga-102-gpu-architecture-whitepaper-v2.pdf)

If so, what 2 means in “512 FMA ops * 2”?
I think it doesn’t mean 2 (multiply-add/cycles).
And it doesn’t mean 2 (Sparse).

Regards,
hiro

The first 2 is for the Multiply-Add/Cycles, and the second 2 is to convert from Dense to Sparse TOPs.

Dear @kayccc ,

The first 2 is for the Multiply-Add/Cycles, and the second 2 is to convert from Dense to Sparse TOPs.

Do you mean “512 FMA ops” is “512 DENSE INT8 FMA”?

I thought Jetson’s tensor Core has 512 Sparse INT8 FMA as follows.
Is this wrong?

Jetson’s tensor Core has 512 Sparse INT8 FMA / Core because it has 256 Sparse FP16 FMA / Core.

Please reconfirm the Jetson’s tensor core spec.

Regards,
hiro

Dear @kayccc,

Could I ask whether “512 FMA ops” is “512 DENSE INT8 FMA” or “512 Sparse INT8 FMA”?

Regards,
hiro

The architecture for the Ampere GPU in Orin follows 512 Dense INT8 FMA Ops.

Dear @kayccc,

The architecture for the Ampere GPU in Orin follows 512 Dense INT8 FMA Ops.

You mean the architecture for the Ampere GPU in Orin follows 256 Dense FP16 FMA Ops because FP16 is half of INT8.
And the Orin’s tensor core architecture is GA100 SM type?

I heard Orin’s tensor core architecture is GA100 SM type in the following post.

Tensor core of Jetson AGX Orin - Jetson & Embedded Systems / Jetson AGX Orin - NVIDIA Developer Forums

Could I confirm Orin’s tensor core architecture?

Regards,
hiro

The GA10 version that Orin uses is different than the GA10x that is in the doc you referenced, it is a custom GA10 architecture not listed in the document pointed to that has 256 FP16 FMA Ops which is half of INT8, and 512 Dense INT8 FMA Ops.

Dear @kayccc,