Hello,
I’m trying to understand the specs for the Jetson AGX Orin SoC to accurately compare it to an A100 for my research. I’ll be profiling custom kernels with CUTLASS (using dense/sparse tensor cores) and builtin PyTorch ops with TensorRT. I’m looking at the developer datasheet and I see:
JAO 64GB: Ampere GPU two GPC  eight TPC  Up to 170 INT8 Sparse
TOPS or 85 FP16 TFLOPS (Tensor Cores)  Up to 5.32 FP32 TFLOPS
or 10.649 FP16 TFLOPS (CUDA cores)
…
Deep Learning Accelerator (DLA)
JAO 64GB: Up to 105 INT8 TOPS (Sparse, Deep Learning Inference)
…
AI Performance
JAO 64GB: Up to 275 Sparse TOPS (INT8)
Three questions:

How can any kernel use both the DLA and the GPU at the same time? Assuming it can’t, the fastest I should expect my kernel to run is by using sparse tensor cores on the GPU at 170 TOPs?

When the above says “85 FP16 TFLOPS” I assume that means sparse tensor cores, so dense tensor cores would be half this at 42.5 FP16 TFLOPS?

The GA100 SM has 64 INT32, 64 FP32 and 32 FP64 CUDA cores. Given the Orin docs say the GPU has “2048 CUDA Cores” that seems to imply the Orin SM has 128 INT32, 64 FP32 and 0 FP64 CUDA cores. Thus a more accurate statement would be that the Orin GPU has 2048 INT32 CUDA Cores, and 1024 FP32 CUDA Cores?
Thank you,
Collin
I‘m also curious about this question
I spent a bit more time calculating some numbers for A100 vs. AGX Orin and for both cases there are some “weird” things not being explained.
For the A100, the whitepaper on page 36 lists 6912 FP32 Cores/GPU which implies a peak TFLOPS of
6912 FP32 Cores * 1.41 GHz * 2 OP/FMA * 1 FMA/clock * = 19.49 TFLOPS
which matches the “Peak FP32 TFLOPS (nonTensor)” value in the table. However I would expect both the Peak BF16 and Peak FP16 to be double this (e.g. we can launch two FP16 or BF16 FMAs on one FP32 core per clock). But the Peak FP16 is actually 4x the Peak FP32. That either implies:
 The 3456 FP64 cores can actually be used as an additional set of 6912 FP32 cores for FP16 computations (but they’re not used for FP32 computations on the A100, or that peak value would be higher)
 Somehow we can launch 4 FP16 FMAs per clock on one FP32 core, but only 2 BF16 FMAs per clock (seems impossible)
 It’s a typo (seems unlikely)
If the answer were #1 then a similar thing could be happening on the AGX Orin. The FP64 cores are actually there (e.g. both the GA100 SM and the Orin GPU SMs are physically the same, with 64 INT32, 64 FP32, 32 “FP64” cores per SM), but the FP64 cores can be easily switched to permanently run in “FP32” mode for the AGX Orin to essentially double the number of FP32 cores. That would imply the AGX Orin “effectively” has 2048 FP32, 1024 INT32 and 0 FP64 CUDA Cores.
Looking at the H100 Whitepaper the H100 Peak FP16 and BF16 TFLOPS (nonTensor) are the same, but it still shows for the A100 the Peak FP16 is 2x the Peak BF16. So whatever is going on to achieve that seems to be specific to Ampere.
Hi,
1. First, please check if your kernel can run on Tensor Core.
Then launch the kernel to use full resources so it will use all the available GPU cores and Tensor cores.
2. 85 is Sparse FP16 TFLOPs. Half of this would be Dense FP16 TFLOPs.
3. Orin AGX has a 2048 general CUDA core, no FP64 CUDA core.
Thanks.
Hi! I’m very curious about your word " If the answer were #1 then a similar thing could be happening on the AGX Orin. The FP64 cores are actually there (e.g. both the GA100 SM and the Orin GPU SMs are physically the same, with 64 INT32, 64 FP32, 32 “FP64” cores per SM), but the FP64 cores can be easily switched to permanently run in “FP32” mode for the AGX Orin to essentially double the number of FP32 cores." since I’ve got a strange result after running a test on AGX Orin 64G with PyTorch. The test code is shown as following:
import torch
from torch.utils import benchmark
typ = torch.float16
n = 1024 * 16
a = torch.randn(n, n).type(typ).cuda()
b = torch.randn(n, n).type(typ).cuda()
t = benchmark.Timer(
stmt=‘a @ b’,
globals={‘a’: a, ‘b’: b})
x = t.timeit(50)
print(2*n**3 / x.median /1e12)
It is testing the computing power of Orin’s CUDA core(I think) in fp16 mode. In theory, the result should be less than 10.65 TFLOPS(2048 cores * 1.3 GHz * 2 OP/FMA * 2 FMA/clock). However, the result is actually 23.55. Does anybody has reasonable explanation？
There is no update from you for a period, assuming this is not an issue any more.
Hence we are closing this topic. If need further support, please open a new one.
Thanks
Hi,
Could you share the document you found with us?
Just want to doublecheck that the FP64 indicates the GPU core or Tensor core.
Thanks.