Pegasus dGPU 130TOPs comes from?

Please provide the following info (check/uncheck the boxes after creating this topic):
Software Version
[.] DRIVE OS Linux 5.2.6
DRIVE OS Linux 5.2.6 and DriveWorks 4.0
DRIVE OS Linux 5.2.0
DRIVE OS Linux 5.2.0 and DriveWorks 3.5
NVIDIA DRIVE™ Software 10.0 (Linux)
NVIDIA DRIVE™ Software 9.0 (Linux)
other DRIVE OS version

Target Operating System
[.] Linux

Hardware Platform
NVIDIA DRIVE™ AGX Xavier DevKit (E3550)
[.] NVIDIA DRIVE™ AGX Pegasus DevKit (E3550)

SDK Manager Version
[.] 1.9.10816

Host Machine Version
[.] native Ubuntu 18.04

Few docs mentioned where Turing GPU(TU-104) 130TOPs came from, and how many TPC, Tensor Cores it has. only found it has 44SM, 64 cuda cores/SM, then toal 44*64=2816 cuda cores.

Dear @ming.xu4,
130 TOPS indicates DL TOPS(Deep learning Tera operations) can performed by the GPU.
It can be calculated like DL TOPS = GPU clock * SMs * INT8 GigaOps per SM.
You can run the CUDA deviceQuery sample to get the details like number of SMs, GPU clock details. If you want to understand the calculation. Share the deviceQuery output using TU104 GPU.

Thanks SivaRamaKrishnaNV.

1.5(GHz) * 44(SMs) * (?)=130TOPS
so, ? = 1969.7

Could you help identify how to get 8192 on orin, and 1969.7 on TU-104?

Device 0: “Graphics Device”
CUDA Driver Version / Runtime Version 10.2 / 10.2
CUDA Capability Major/Minor version number: 7.5
Total amount of global memory: 7680 MBytes (8052998144 bytes)
(44) Multiprocessors, ( 64) CUDA Cores/MP: 2816 CUDA Cores
GPU Max Clock rate: 1500 MHz (1.50 GHz)
Memory Clock rate: 1440 Mhz
Memory Bus Width: 256-bit
L2 Cache Size: 4194304 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
Maximum Layered 1D Texture Size, (num) layers 1D=(32768), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(32768, 32768), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 1024
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 3 copy engine(s)
Run time limit on kernels: No
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
Device supports Unified Addressing (UVA): Yes
Device supports Compute Preemption: Yes
Supports Cooperative Kernel Launch: Yes
Supports MultiDevice Co-op Kernel Launch: Yes
Device PCI Domain ID / Bus ID / location ID: 1 / 1 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

Dear @ming.xu4,
For Orin SoC, Please check About Orin SoC Performance
Similarly, For TU 104, INT8 GigaOps (IMMA )ops per SM is 2048, So it is 1.5 * 44 * 2048 Giga Ops.

Thanks SivaRamaKrishnaNV.
could you help explain IMMA is 4 times FP16 FMA/SM?

and in orin, 2048*4=8192, we are talking about sparse performance benchmark?

Dear @ming.xu4,
In case of orin is it Sparse-IMMA but in case of Turing it is IMMA.

Thanks, we can close this topic.