Discrepancy Between Claimed and Actual Sparse INT8 Performance of Tensor Cores on Jetson AGX Orin

Hello, everyone.

Question:

  1. The Jetson AGX Orin Tensor Core is advertised to have a sparse INT8 performance of 170 sparse INT8 TOPS. However, based on tests using cuSPARSELt, the measured performance is 77 TOPS.

Test Environment:

  • Jetson Orin Development Kit version
  • JetPack 6.0
  • The maximum GPU frequency observed via jetson_clocks is 1.3 GHz.
  • Therefore, it is hypothesized that the theoretical maximum performance of the Tensor Core for sparse INT8 should be around 170 TOPS.

The system is set to Performance Mode 0, and the GPU frequency is set to maximum.

Test Code:

Test Results and Analysis:

  1. Sparse INT8 TOPS: Actual performance is 77 TOPS, which is 45% of the theoretical 170 TOPS.

We’re curious to know what might be causing this significant discrepancy in performance. Thank you.

1 Like

Hi,

We need to reproduce this locally and provide more info to you later.

Thanks.

Hi, AastaLLL. Thanks for taking the time to investigate this further. I look forward to your findings, and I’m happy to provide any additional information if needed. Thanks again!

Hi,

The sample has a dependency on cusparseLt.h.
Could you also share how you setup the library with us?

Thanks.

Hello,

I installed it using the instructions from this webpage:

https://developer.nvidia.com/cusparselt-downloads?target_os=Linux&target_arch=aarch64-jetson&Compilation=Native&Distribution=Ubuntu&target_version=22.04&target_type=deb_local

Then I ran the following commands:

wget https://developer.download.nvidia.com/compute/cusparselt/0.6.2/local_installers/cusparselt-local-tegra-repo-ubuntu2204-0.6.2_1.0-1_arm64.deb
sudo dpkg -i cusparselt-local-tegra-repo-ubuntu2204-0.6.2_1.0-1_arm64.deb
sudo cp /var/cusparselt-local-tegra-repo-ubuntu2204-0.6.2/cusparselt-*-keyring.gpg /usr/share/keyrings/
sudo apt-get update
sudo apt-get -y install libcusparselt0 libcusparselt-dev

For compilation, I used the following command:

nvcc -o sparse_gemm_int8 sparse_gemm_int8.cpp -L/usr/local/cuda-12.2/targets/aarch64-linux/lib/ -lcublas -lcusparse -lcusparseLt_static -lcusparse -ldl -I/usr/include/ -L/usr/lib/aarch64-linux-gnu

Hi,

Thanks for the detailed steps.
Confirmed that we also see the ~77 TOPs on AGX Orin in sparse gemm.

We need to check this issue with our internal team.
Will provide more info to you later.

Thanks.

Hi AastaLLL,

I hope you’re doing well.

I wanted to follow up on the INT8 performance issue that we discussed earlier. I understand that you needed to check with your internal team, and I’m wondering if there have been any updates on this matter.

Your assistance is greatly appreciated, and I look forward to any additional information you can provide.

Thank you again for your support.

Best regards.

Hi,

Thanks for your patience but our internal team is still working on it.

In the meantime, have you checked the INT8 GEMM with cublas?
If yes, could you share some info with us?

Thanks.

Hi,
We’ve conducted performance tests on the Jetson Orin for Tensor Core GEMM Dense INT8 using cuBLAS. The results showed a performance of 3.27276 TFLOPS, which is significantly lower than the claimed 85 TOPS for Dense INT8. This discrepancy is quite puzzling to me. The source code can be found in the attached file:
dense_gemm_int8.cu.zip (1.5 KB).

Additionally, we tested Tensor Core GEMM Dense FP16, achieving 38.1824 TFLOPS, which is relatively close to the claimed 43 TOPS for Dense FP16.

Thank you for your continued support and assistance!

Hi,

Please try our cutlass library for fast (sparse) GEMM.

$ git clone https://github.com/NVIDIA/cutlass.git
$ cd cutlass/
$ mkdir build && cd build
$ cmake .. -DCUTLASS_NVCC_ARCHS=87
$ make cutlass_profiler -j12

The performance depends on the chosen kernel.
For example:

$ ./tools/profiler/cutlass_profiler --gemm_kind=universal --m=3456 --n=4096 --k=8192 --A=b1:row --B=b1:column --C=s32:column --D=s32:column  \
                  --alpha=1 --beta=0 --split_k_mode=serial --split_k_slices=1 --batch_count=1 --raster_order=heuristic  \
                  --swizzle_size=1 --op_class=tensorop --accum=s32 --cta_m=256 --cta_n=128 --cta_k=512 --cluster_m=1  \
                  --cluster_n=1 --cluster_k=1 --stages=2 --warps_m=4 --warps_n=2 --warps_k=1 --inst_m=8 --inst_n=8 --inst_k=128  \
                  --min_cc=75 --max_cc=1024



=============================
  Problem ID: 1

        Provider: CUTLASS
   OperationKind: gemm
       Operation: cutlass_tensorop_i88128xorgemm_b1_256x128_512x2_tn_align128

          Status: Success
    Verification: ON
     Disposition: Not verified

reference_device: Not run
          cuBLAS: Not run
           cuDNN: Not run

       Arguments: --gemm_kind=universal --m=3456 --n=4096 --k=8192 --A=b1:row --B=b1:column --C=s32:column --D=s32:column  \
                  --alpha=1 --beta=0 --split_k_mode=serial --split_k_slices=1 --batch_count=1 --raster_order=heuristic  \
                  --swizzle_size=1 --op_class=tensorop --accum=s32 --cta_m=256 --cta_n=128 --cta_k=512 --cluster_m=1  \
                  --cluster_n=1 --cluster_k=1 --stages=2 --warps_m=4 --warps_n=2 --warps_k=1 --inst_m=8 --inst_n=8 --inst_k=128  \
                  --min_cc=75 --max_cc=1024

           Bytes: 64356352  bytes
           FLOPs: 231956545536  flops
           FLOPs/Byte: 3604

         Runtime: 1.01133  ms
          Memory: 59.2652 GiB/s

            Math: 229359 GFLOP/s


=============================

CSV Results:

Problem,Provider,OperationKind,Operation,Disposition,Status,gemm_kind,m,n,k,A,B,C,D,alpha,beta,split_k_mode,split_k_slices,batch_count,raster_order,swizzle_size,op_class,accum,cta_m,cta_n,cta_k,cluster_m,cluster_n,cluster_k,stages,warps_m,warps_n,warps_k,inst_m,inst_n,inst_k,min_cc,max_cc,Bytes,Flops,Flops/Byte,Runtime,GB/s,GFLOPs
1,CUTLASS,gemm,cutlass_tensorop_88128xorgemm_b1_256x128_512x2_tn_align128,not_verified,success,universal,3456,4096,8192,b1:row,b1:column,s32:column,s32:column,1,0,serial,1,1,heuristic,1,tensorop,s32,256,128,512,1,1,1,2,4,2,1,8,8,128,75,1024,64356352,231956545536,3604,1.01133,59.2652,229359

Thanks.

Hi,

Thank you for the guidance! I have tested both General GEMM and Sparse GEMM using the CUTLASS library on the Jetson Orin. I am puzzled because the measured performance for both General GEMM and Sparse GEMM is quite similar, around 220+ TOPS. This is quite different from what I expected based on the Jetson AGX Orin datasheet, which lists the performance for Tensor Core Sparse GEMM INT8 as 175 TOPS, and Dense GEMM INT8 as 85 TOPS—almost a twofold difference between the two.

Below are the commands I used with the CUTLASS profiler for Dense GEMM and Sparse GEMM testing, respectively:

./tools/profiler/cutlass_profiler --gemm_kind=universal --m=16384 --n=16384 --k=16384 --A=b1:row --B=b1:column --C=s32:column --D=s32:column                    --alpha=1 --beta=0 --split_k_mode=serial --split_k_slices=1 --batch_count=1 --raster_order=heuristic                    --swizzle_size=1 --op_class=tensorop --accum=s32 --cta_m=256 --cta_n=128 --cta_k=512 --cluster_m=1                    --cluster_n=1 --cluster_k=1 --stages=2 --warps_m=4 --warps_n=2 --warps_k=1 --inst_m=8 --inst_n=8 --inst_k=128                    --min_cc=75 --max_cc=1024



=============================
  Problem ID: 1

        Provider: CUTLASS
   OperationKind: gemm
       Operation: cutlass_tensorop_i88128xorgemm_b1_256x128_512x2_tn_align128

          Status: Success
    Verification: ON
     Disposition: Not verified

reference_device: Not run
          cuBLAS: Not run
           cuDNN: Not run

       Arguments: --gemm_kind=universal --m=16384 --n=16384 --k=16384 --A=b1:row --B=b1:column --C=s32:column --D=s32:column  \
                  --alpha=1 --beta=0 --split_k_mode=serial --split_k_slices=1 --batch_count=1 --raster_order=heuristic  \
                  --swizzle_size=1 --op_class=tensorop --accum=s32 --cta_m=256 --cta_n=128 --cta_k=512 --cluster_m=1  \
                  --cluster_n=1 --cluster_k=1 --stages=2 --warps_m=4 --warps_n=2 --warps_k=1 --inst_m=8 --inst_n=8 --inst_k=128  \
                  --min_cc=75 --max_cc=1024

           Bytes: 1140850688  bytes
           FLOPs: 8796629893120  flops
           FLOPs/Byte: 7710

         Runtime: 39.2066  ms
          Memory: 27.1 GiB/s

            Math: 224366 GFLOP/s


=============================

CSV Results:

Problem,Provider,OperationKind,Operation,Disposition,Status,gemm_kind,m,n,k,A,B,C,D,alpha,beta,split_k_mode,split_k_slices,batch_count,raster_order,swizzle_size,op_class,accum,cta_m,cta_n,cta_k,cluster_m,cluster_n,cluster_k,stages,warps_m,warps_n,warps_k,inst_m,inst_n,inst_k,min_cc,max_cc,Bytes,Flops,Flops/Byte,Runtime,GB/s,GFLOPs
1,CUTLASS,gemm,cutlass_tensorop_i88128xorgemm_b1_256x128_512x2_tn_align128,not_verified,success,universal,16384,16384,16384,b1:row,b1:column,s32:column,s32:column,1,0,serial,1,1,heuristic,1,tensorop,s32,256,128,512,1,1,1,2,4,2,1,8,8,128,75,1024,1140850688,8796629893120,7710,39.2066,27.1,224366
./tools/profiler/cutlass_profiler  --gemm_kind=spgemm --m=16384 --n=16384 --k=16384 --A=b1:row --B=b1:column --C=s32:column --D=s32:column                    --alpha=1 --beta=0 --split_k_mode=serial --split_k_slices=1 --batch_count=1 --raster_order=heuristic                    --swizzle_size=1 --op_class=tensorop --accum=s32 --cta_m=256 --cta_n=128 --cta_k=512 --cluster_m=1                    --cluster_n=1 --cluster_k=1 --stages=2 --warps_m=4 --warps_n=2 --warps_k=1 --inst_m=8 --inst_n=8 --inst_k=128                    --min_cc=75 --max_cc=1024



=============================
  Problem ID: 1

        Provider: CUTLASS
   OperationKind: gemm
       Operation: cutlass_tensorop_i88128xorgemm_b1_256x128_512x2_tn_align128

          Status: Success
    Verification: ON
     Disposition: Not verified

reference_device: Not run
          cuBLAS: Not run
           cuDNN: Not run

       Arguments: --gemm_kind=universal --m=16384 --n=16384 --k=16384 --A=b1:row --B=b1:column --C=s32:column --D=s32:column  \
                  --alpha=1 --beta=0 --split_k_mode=serial --split_k_slices=1 --batch_count=1 --raster_order=heuristic  \
                  --swizzle_size=1 --op_class=tensorop --accum=s32 --cta_m=256 --cta_n=128 --cta_k=512 --cluster_m=1  \
                  --cluster_n=1 --cluster_k=1 --stages=2 --warps_m=4 --warps_n=2 --warps_k=1 --inst_m=8 --inst_n=8 --inst_k=128  \
                  --min_cc=75 --max_cc=1024

           Bytes: 1140850688  bytes
           FLOPs: 8796629893120  flops
           FLOPs/Byte: 7710

         Runtime: 39.3207  ms
          Memory: 27.0214 GiB/s

            Math: 223715 GFLOP/s


=============================

CSV Results:

Problem,Provider,OperationKind,Operation,Disposition,Status,gemm_kind,m,n,k,A,B,C,D,alpha,beta,split_k_mode,split_k_slices,batch_count,raster_order,swizzle_size,op_class,accum,cta_m,cta_n,cta_k,cluster_m,cluster_n,cluster_k,stages,warps_m,warps_n,warps_k,inst_m,inst_n,inst_k,min_cc,max_cc,Bytes,Flops,Flops/Byte,Runtime,GB/s,GFLOPs
1,CUTLASS,gemm,cutlass_tensorop_i88128xorgemm_b1_256x128_512x2_tn_align128,not_verified,success,universal,16384,16384,16384,b1:row,b1:column,s32:column,s32:column,1,0,serial,1,1,heuristic,1,tensorop,s32,256,128,512,1,1,1,2,4,2,1,8,8,128,75,1024,1140850688,8796629893120,7710,39.3207,27.0214,223715

As shown, both the General GEMM and Sparse GEMM performance are very close. Could you provide some insights into why there isn’t a more noticeable difference?

Thank you once again for your help and guidance.

Best regards,

Hi,

We need to check with the cutlass team for more info and updates.

But you can try the profiling with all parameters, for example:

$ ./tools/profiler/cutlass_profiler --kernels=sgemm  # sparse GEMM
$ ./tools/profiler/cutlass_profiler --kernels=gemm   # GEMM

Since the peak performance of GEMM and sparse GEMM might come with different parameters or matrix sizes.

Thanks.

Hi,

Thanks for your patience.

Sorry, there are some incorrect messages in the previous suggestion.
The kernel you used before is for 1-bit dense GEMM.

To benchmark int8 sparse GEMM, please recompile the library with -DCUTLASS_LIBRARY_KERNELS=i16864spgemm.

Since Orin only has 16 SMS, we also recommend testing this with smaller problem sizes.
Then changing the below line from Identity8 to Identity4 also helps:
https://github.com/NVIDIA/cutlass/blob/main/python/cutlass_library/generator.py#L131

For example, we can get 98.7TOPs with m=1024, n=1024, k=8192:

=============================
  Problem ID: 1

        Provider: CUTLASS
   OperationKind: spgemm
       Operation: cutlass_tensorop_s8_i16864spgemm_s8_256x128_128x3_tn_align16

          Status: Success
    Verification: ON
     Dis"hljs-comment">--gemm_kind=spgemm --m=1024 --n=1024 --k=8192 --A=s8:row --B=s8:column --C=s8:row --E=u32:nk2 --alpha=1  \
                  --beta=0 --split_k_slices=1 --batch_count=1 --op_class=tensorop --accum=s32 --cta_m=256 --cta_n=128  \
                  --cta_k=128 --cluster_m=1 --cluster_n=1 --cluster_k=1 --stages=3 --warps_m=4 --warps_n=2 --warps_k=1  \
                  --inst_m=16 --inst_n=8 --inst_k=64 --min_cc=80 --max_cc=1024

           Bytes: 14680064  bytes
           FLOPs: 17181966336  flops
           FLOPs/Byte: 1170

         Runtime: 0.17401  ms
          Memory: 78.5694 GiB/s

            Math: 98741.1 GFLOP/s

We expect sparse int8 GEMM to have ~60-70% peak sol with CUTLASS public source code and public compiler.
The result shared above is pretty close but you can still play around with the parameters.

Thanks.

Thanks for your continued support.
However, when benchmarking int8 sparse GEMM after recompiling the library with -DCUTLASS_LIBRARY_KERNELS=i16864spgemm, the performance reached 98,741.1 GFLOP/s, which is still significantly below Jetson’s claimed sparse INT8 performance of 170 TOPS.
Appreciate your help!

Hi,

We usually expect 60-70% SOL of theoretical peak performance.

The above 98.7 TOPs peak is around 58%SOL.
Maybe tweaking some parameters (ex. m, n, k) can improve further to reach 60% but it’s already close enough.

Thanks.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.