Discrepancy Between Claimed and Actual Sparse INT8 Performance of Tensor Cores on Jetson AGX Orin

bloom.andy · August 12, 2024, 11:57am

Hello, everyone.

Question:

The Jetson AGX Orin Tensor Core is advertised to have a sparse INT8 performance of 170 sparse INT8 TOPS. However, based on tests using cuSPARSELt, the measured performance is 77 TOPS.

Test Environment:

Jetson Orin Development Kit version
JetPack 6.0
The maximum GPU frequency observed via jetson_clocks is 1.3 GHz.
Therefore, it is hypothesized that the theoretical maximum performance of the Tensor Core for sparse INT8 should be around 170 TOPS.

The system is set to Performance Mode 0, and the GPU frequency is set to maximum.

Test Code:

Based on the example from NVIDIA’s cuSPARSELt GitHub repository, with modifications to include run counts and time metrics.
sparse_gemm_int8.cpp.zip (4.6 KB)

Test Results and Analysis:

Sparse INT8 TOPS: Actual performance is 77 TOPS, which is 45% of the theoretical 170 TOPS.

We’re curious to know what might be causing this significant discrepancy in performance. Thank you.

AastaLLL · August 13, 2024, 5:13am

Hi,

We need to reproduce this locally and provide more info to you later.

Thanks.

bloom.andy · August 13, 2024, 6:05am

Hi, AastaLLL. Thanks for taking the time to investigate this further. I look forward to your findings, and I’m happy to provide any additional information if needed. Thanks again!

AastaLLL · August 13, 2024, 6:39am

Hi,

The sample has a dependency on cusparseLt.h.
Could you also share how you setup the library with us?

Thanks.

bloom.andy · August 13, 2024, 7:32am

Hello,

I installed it using the instructions from this webpage:

https://developer.nvidia.com/cusparselt-downloads?target_os=Linux&target_arch=aarch64-jetson&Compilation=Native&Distribution=Ubuntu&target_version=22.04&target_type=deb_local

Then I ran the following commands:

wget https://developer.download.nvidia.com/compute/cusparselt/0.6.2/local_installers/cusparselt-local-tegra-repo-ubuntu2204-0.6.2_1.0-1_arm64.deb
sudo dpkg -i cusparselt-local-tegra-repo-ubuntu2204-0.6.2_1.0-1_arm64.deb
sudo cp /var/cusparselt-local-tegra-repo-ubuntu2204-0.6.2/cusparselt-*-keyring.gpg /usr/share/keyrings/
sudo apt-get update
sudo apt-get -y install libcusparselt0 libcusparselt-dev

For compilation, I used the following command:

nvcc -o sparse_gemm_int8 sparse_gemm_int8.cpp -L/usr/local/cuda-12.2/targets/aarch64-linux/lib/ -lcublas -lcusparse -lcusparseLt_static -lcusparse -ldl -I/usr/include/ -L/usr/lib/aarch64-linux-gnu

AastaLLL · August 14, 2024, 5:07am

Hi,

Thanks for the detailed steps.
Confirmed that we also see the ~77 TOPs on AGX Orin in sparse gemm.

We need to check this issue with our internal team.
Will provide more info to you later.

Thanks.

bloom.andy · August 15, 2024, 8:00am

Hi AastaLLL,

I hope you’re doing well.

I wanted to follow up on the INT8 performance issue that we discussed earlier. I understand that you needed to check with your internal team, and I’m wondering if there have been any updates on this matter.

Your assistance is greatly appreciated, and I look forward to any additional information you can provide.

Thank you again for your support.

Best regards.

AastaLLL · August 19, 2024, 8:31am

Hi,

Thanks for your patience but our internal team is still working on it.

In the meantime, have you checked the INT8 GEMM with cublas?
If yes, could you share some info with us?

Thanks.

bloom.andy · August 19, 2024, 11:17am

Hi,
We’ve conducted performance tests on the Jetson Orin for Tensor Core GEMM Dense INT8 using cuBLAS. The results showed a performance of 3.27276 TFLOPS, which is significantly lower than the claimed 85 TOPS for Dense INT8. This discrepancy is quite puzzling to me. The source code can be found in the attached file:
dense_gemm_int8.cu.zip (1.5 KB).

Additionally, we tested Tensor Core GEMM Dense FP16, achieving 38.1824 TFLOPS, which is relatively close to the claimed 43 TOPS for Dense FP16.

Thank you for your continued support and assistance!

AastaLLL · August 21, 2024, 6:30am

Hi,

Please try our cutlass library for fast (sparse) GEMM.

$ git clone https://github.com/NVIDIA/cutlass.git
$ cd cutlass/
$ mkdir build && cd build
$ cmake .. -DCUTLASS_NVCC_ARCHS=87
$ make cutlass_profiler -j12

The performance depends on the chosen kernel.
For example:

$ ./tools/profiler/cutlass_profiler --gemm_kind=universal --m=3456 --n=4096 --k=8192 --A=b1:row --B=b1:column --C=s32:column --D=s32:column  \
                  --alpha=1 --beta=0 --split_k_mode=serial --split_k_slices=1 --batch_count=1 --raster_order=heuristic  \
                  --swizzle_size=1 --op_class=tensorop --accum=s32 --cta_m=256 --cta_n=128 --cta_k=512 --cluster_m=1  \
                  --cluster_n=1 --cluster_k=1 --stages=2 --warps_m=4 --warps_n=2 --warps_k=1 --inst_m=8 --inst_n=8 --inst_k=128  \
                  --min_cc=75 --max_cc=1024



=============================
  Problem ID: 1

        Provider: CUTLASS
   OperationKind: gemm
       Operation: cutlass_tensorop_i88128xorgemm_b1_256x128_512x2_tn_align128

          Status: Success
    Verification: ON
     Disposition: Not verified

reference_device: Not run
          cuBLAS: Not run
           cuDNN: Not run

       Arguments: --gemm_kind=universal --m=3456 --n=4096 --k=8192 --A=b1:row --B=b1:column --C=s32:column --D=s32:column  \
                  --alpha=1 --beta=0 --split_k_mode=serial --split_k_slices=1 --batch_count=1 --raster_order=heuristic  \
                  --swizzle_size=1 --op_class=tensorop --accum=s32 --cta_m=256 --cta_n=128 --cta_k=512 --cluster_m=1  \
                  --cluster_n=1 --cluster_k=1 --stages=2 --warps_m=4 --warps_n=2 --warps_k=1 --inst_m=8 --inst_n=8 --inst_k=128  \
                  --min_cc=75 --max_cc=1024

           Bytes: 64356352  bytes
           FLOPs: 231956545536  flops
           FLOPs/Byte: 3604

         Runtime: 1.01133  ms
          Memory: 59.2652 GiB/s

            Math: 229359 GFLOP/s


=============================

CSV Results:

Problem,Provider,OperationKind,Operation,Disposition,Status,gemm_kind,m,n,k,A,B,C,D,alpha,beta,split_k_mode,split_k_slices,batch_count,raster_order,swizzle_size,op_class,accum,cta_m,cta_n,cta_k,cluster_m,cluster_n,cluster_k,stages,warps_m,warps_n,warps_k,inst_m,inst_n,inst_k,min_cc,max_cc,Bytes,Flops,Flops/Byte,Runtime,GB/s,GFLOPs
1,CUTLASS,gemm,cutlass_tensorop_88128xorgemm_b1_256x128_512x2_tn_align128,not_verified,success,universal,3456,4096,8192,b1:row,b1:column,s32:column,s32:column,1,0,serial,1,1,heuristic,1,tensorop,s32,256,128,512,1,1,1,2,4,2,1,8,8,128,75,1024,64356352,231956545536,3604,1.01133,59.2652,229359

Thanks.

bloom.andy · August 26, 2024, 1:41am

Hi,

Thank you for the guidance! I have tested both General GEMM and Sparse GEMM using the CUTLASS library on the Jetson Orin. I am puzzled because the measured performance for both General GEMM and Sparse GEMM is quite similar, around 220+ TOPS. This is quite different from what I expected based on the Jetson AGX Orin datasheet, which lists the performance for Tensor Core Sparse GEMM INT8 as 175 TOPS, and Dense GEMM INT8 as 85 TOPS—almost a twofold difference between the two.

Below are the commands I used with the CUTLASS profiler for Dense GEMM and Sparse GEMM testing, respectively:

./tools/profiler/cutlass_profiler --gemm_kind=universal --m=16384 --n=16384 --k=16384 --A=b1:row --B=b1:column --C=s32:column --D=s32:column                    --alpha=1 --beta=0 --split_k_mode=serial --split_k_slices=1 --batch_count=1 --raster_order=heuristic                    --swizzle_size=1 --op_class=tensorop --accum=s32 --cta_m=256 --cta_n=128 --cta_k=512 --cluster_m=1                    --cluster_n=1 --cluster_k=1 --stages=2 --warps_m=4 --warps_n=2 --warps_k=1 --inst_m=8 --inst_n=8 --inst_k=128                    --min_cc=75 --max_cc=1024



=============================
  Problem ID: 1

        Provider: CUTLASS
   OperationKind: gemm
       Operation: cutlass_tensorop_i88128xorgemm_b1_256x128_512x2_tn_align128

          Status: Success
    Verification: ON
     Disposition: Not verified

reference_device: Not run
          cuBLAS: Not run
           cuDNN: Not run

       Arguments: --gemm_kind=universal --m=16384 --n=16384 --k=16384 --A=b1:row --B=b1:column --C=s32:column --D=s32:column  \
                  --alpha=1 --beta=0 --split_k_mode=serial --split_k_slices=1 --batch_count=1 --raster_order=heuristic  \
                  --swizzle_size=1 --op_class=tensorop --accum=s32 --cta_m=256 --cta_n=128 --cta_k=512 --cluster_m=1  \
                  --cluster_n=1 --cluster_k=1 --stages=2 --warps_m=4 --warps_n=2 --warps_k=1 --inst_m=8 --inst_n=8 --inst_k=128  \
                  --min_cc=75 --max_cc=1024

           Bytes: 1140850688  bytes
           FLOPs: 8796629893120  flops
           FLOPs/Byte: 7710

         Runtime: 39.2066  ms
          Memory: 27.1 GiB/s

            Math: 224366 GFLOP/s


=============================

CSV Results:

Problem,Provider,OperationKind,Operation,Disposition,Status,gemm_kind,m,n,k,A,B,C,D,alpha,beta,split_k_mode,split_k_slices,batch_count,raster_order,swizzle_size,op_class,accum,cta_m,cta_n,cta_k,cluster_m,cluster_n,cluster_k,stages,warps_m,warps_n,warps_k,inst_m,inst_n,inst_k,min_cc,max_cc,Bytes,Flops,Flops/Byte,Runtime,GB/s,GFLOPs
1,CUTLASS,gemm,cutlass_tensorop_i88128xorgemm_b1_256x128_512x2_tn_align128,not_verified,success,universal,16384,16384,16384,b1:row,b1:column,s32:column,s32:column,1,0,serial,1,1,heuristic,1,tensorop,s32,256,128,512,1,1,1,2,4,2,1,8,8,128,75,1024,1140850688,8796629893120,7710,39.2066,27.1,224366

./tools/profiler/cutlass_profiler  --gemm_kind=spgemm --m=16384 --n=16384 --k=16384 --A=b1:row --B=b1:column --C=s32:column --D=s32:column                    --alpha=1 --beta=0 --split_k_mode=serial --split_k_slices=1 --batch_count=1 --raster_order=heuristic                    --swizzle_size=1 --op_class=tensorop --accum=s32 --cta_m=256 --cta_n=128 --cta_k=512 --cluster_m=1                    --cluster_n=1 --cluster_k=1 --stages=2 --warps_m=4 --warps_n=2 --warps_k=1 --inst_m=8 --inst_n=8 --inst_k=128                    --min_cc=75 --max_cc=1024



=============================
  Problem ID: 1

        Provider: CUTLASS
   OperationKind: gemm
       Operation: cutlass_tensorop_i88128xorgemm_b1_256x128_512x2_tn_align128

          Status: Success
    Verification: ON
     Disposition: Not verified

reference_device: Not run
          cuBLAS: Not run
           cuDNN: Not run

       Arguments: --gemm_kind=universal --m=16384 --n=16384 --k=16384 --A=b1:row --B=b1:column --C=s32:column --D=s32:column  \
                  --alpha=1 --beta=0 --split_k_mode=serial --split_k_slices=1 --batch_count=1 --raster_order=heuristic  \
                  --swizzle_size=1 --op_class=tensorop --accum=s32 --cta_m=256 --cta_n=128 --cta_k=512 --cluster_m=1  \
                  --cluster_n=1 --cluster_k=1 --stages=2 --warps_m=4 --warps_n=2 --warps_k=1 --inst_m=8 --inst_n=8 --inst_k=128  \
                  --min_cc=75 --max_cc=1024

           Bytes: 1140850688  bytes
           FLOPs: 8796629893120  flops
           FLOPs/Byte: 7710

         Runtime: 39.3207  ms
          Memory: 27.0214 GiB/s

            Math: 223715 GFLOP/s


=============================

CSV Results:

Problem,Provider,OperationKind,Operation,Disposition,Status,gemm_kind,m,n,k,A,B,C,D,alpha,beta,split_k_mode,split_k_slices,batch_count,raster_order,swizzle_size,op_class,accum,cta_m,cta_n,cta_k,cluster_m,cluster_n,cluster_k,stages,warps_m,warps_n,warps_k,inst_m,inst_n,inst_k,min_cc,max_cc,Bytes,Flops,Flops/Byte,Runtime,GB/s,GFLOPs
1,CUTLASS,gemm,cutlass_tensorop_i88128xorgemm_b1_256x128_512x2_tn_align128,not_verified,success,universal,16384,16384,16384,b1:row,b1:column,s32:column,s32:column,1,0,serial,1,1,heuristic,1,tensorop,s32,256,128,512,1,1,1,2,4,2,1,8,8,128,75,1024,1140850688,8796629893120,7710,39.3207,27.0214,223715

As shown, both the General GEMM and Sparse GEMM performance are very close. Could you provide some insights into why there isn’t a more noticeable difference?

Thank you once again for your help and guidance.

Best regards,

AastaLLL · August 26, 2024, 6:36am

Hi,

We need to check with the cutlass team for more info and updates.

But you can try the profiling with all parameters, for example:

$ ./tools/profiler/cutlass_profiler --kernels=sgemm  # sparse GEMM
$ ./tools/profiler/cutlass_profiler --kernels=gemm   # GEMM

Since the peak performance of GEMM and sparse GEMM might come with different parameters or matrix sizes.

Thanks.

AastaLLL · August 28, 2024, 4:49am

Hi,

Thanks for your patience.

Sorry, there are some incorrect messages in the previous suggestion.
The kernel you used before is for 1-bit dense GEMM.

To benchmark int8 sparse GEMM, please recompile the library with -DCUTLASS_LIBRARY_KERNELS=i16864spgemm.

Since Orin only has 16 SMS, we also recommend testing this with smaller problem sizes.
Then changing the below line from Identity8 to Identity4 also helps:
https://github.com/NVIDIA/cutlass/blob/main/python/cutlass_library/generator.py#L131

For example, we can get 98.7TOPs with m=1024, n=1024, k=8192:

=============================
  Problem ID: 1

        Provider: CUTLASS
   OperationKind: spgemm
       Operation: cutlass_tensorop_s8_i16864spgemm_s8_256x128_128x3_tn_align16

          Status: Success
    Verification: ON
     Dis"hljs-comment">--gemm_kind=spgemm --m=1024 --n=1024 --k=8192 --A=s8:row --B=s8:column --C=s8:row --E=u32:nk2 --alpha=1  \
                  --beta=0 --split_k_slices=1 --batch_count=1 --op_class=tensorop --accum=s32 --cta_m=256 --cta_n=128  \
                  --cta_k=128 --cluster_m=1 --cluster_n=1 --cluster_k=1 --stages=3 --warps_m=4 --warps_n=2 --warps_k=1  \
                  --inst_m=16 --inst_n=8 --inst_k=64 --min_cc=80 --max_cc=1024

           Bytes: 14680064  bytes
           FLOPs: 17181966336  flops
           FLOPs/Byte: 1170

         Runtime: 0.17401  ms
          Memory: 78.5694 GiB/s

            Math: 98741.1 GFLOP/s

We expect sparse int8 GEMM to have ~60-70% peak sol with CUTLASS public source code and public compiler.
The result shared above is pretty close but you can still play around with the parameters.

Thanks.

bloom.andy · September 9, 2024, 8:30am

Thanks for your continued support.
However, when benchmarking int8 sparse GEMM after recompiling the library with -DCUTLASS_LIBRARY_KERNELS=i16864spgemm, the performance reached 98,741.1 GFLOP/s, which is still significantly below Jetson’s claimed sparse INT8 performance of 170 TOPS.
Appreciate your help!

AastaLLL · September 11, 2024, 3:06am

Hi,

We usually expect 60-70% SOL of theoretical peak performance.

The above 98.7 TOPs peak is around 58%SOL.
Maybe tweaking some parameters (ex. m, n, k) can improve further to reach 60% but it’s already close enough.

Thanks.

Topic		Replies	Views
How to verify Orin the TOPS performance Jetson Orin NX cuda	9	2390	September 10, 2024
The tensor core performance detail of Jetson AGX Orin 32GB Jetson AGX Orin	13	1613	June 13, 2023
Jetson AGX Orin TOPs / CUDA Cores Explained Jetson AGX Orin jetson-inference	7	7963	May 11, 2023
GPU stress test for Orin NX Jetson Orin NX tensorrt	8	1400	June 4, 2025
NVIDIA Orin Performance Jetson AGX Orin tensorrt	2	530	October 14, 2024
About Orin SoC Performance DRIVE AGX Orin General drive-docs	6	1499	November 21, 2022
Xavier Tensor Core int8 Peformance cannot reach 22TOPS with cublasGemmEx API? Jetson AGX Xavier	7	1068	August 22, 2020
INT1 Performance of Jetson AGX Orin Development Kit Jetson AGX Orin jetson-inference	1	68	January 13, 2026
Unexpected performance for INT8 ResNet50 on Jetson AGX Orin MAXN Jetson AGX Orin cuda	5	477	March 16, 2026
Jetson series TOPS mean in FLOPS or INTS? Jetson AGX Orin performance	4	7674	November 20, 2023

Discrepancy Between Claimed and Actual Sparse INT8 Performance of Tensor Cores on Jetson AGX Orin

Related topics