Discrepancy Between Claimed and Actual Sparse INT8 Performance of Tensor Cores on Jetson AGX Orin

AastaLLL · August 28, 2024, 4:49am

Hi,

Thanks for your patience.

Sorry, there are some incorrect messages in the previous suggestion.
The kernel you used before is for 1-bit dense GEMM.

To benchmark int8 sparse GEMM, please recompile the library with -DCUTLASS_LIBRARY_KERNELS=i16864spgemm.

Since Orin only has 16 SMS, we also recommend testing this with smaller problem sizes.
Then changing the below line from Identity8 to Identity4 also helps:
https://github.com/NVIDIA/cutlass/blob/main/python/cutlass_library/generator.py#L131

For example, we can get 98.7TOPs with m=1024, n=1024, k=8192:

=============================
  Problem ID: 1

        Provider: CUTLASS
   OperationKind: spgemm
       Operation: cutlass_tensorop_s8_i16864spgemm_s8_256x128_128x3_tn_align16

          Status: Success
    Verification: ON
     Dis"hljs-comment">--gemm_kind=spgemm --m=1024 --n=1024 --k=8192 --A=s8:row --B=s8:column --C=s8:row --E=u32:nk2 --alpha=1  \
                  --beta=0 --split_k_slices=1 --batch_count=1 --op_class=tensorop --accum=s32 --cta_m=256 --cta_n=128  \
                  --cta_k=128 --cluster_m=1 --cluster_n=1 --cluster_k=1 --stages=3 --warps_m=4 --warps_n=2 --warps_k=1  \
                  --inst_m=16 --inst_n=8 --inst_k=64 --min_cc=80 --max_cc=1024

           Bytes: 14680064  bytes
           FLOPs: 17181966336  flops
           FLOPs/Byte: 1170

         Runtime: 0.17401  ms
          Memory: 78.5694 GiB/s

            Math: 98741.1 GFLOP/s

We expect sparse int8 GEMM to have ~60-70% peak sol with CUTLASS public source code and public compiler.
The result shared above is pretty close but you can still play around with the parameters.

Thanks.

Topic		Replies	Views
How to verify Orin the TOPS performance Jetson Orin NX cuda	10	978	October 9, 2024
Jetson AGX Orin TOPs / CUDA Cores Explained Jetson AGX Orin jetson-inference	8	5741	May 24, 2023
Inference slow even using TensorRT Jetson AGX Orin tensorrt	15	1737	November 6, 2023
The performance of the Jetson Orin Nano module does not match the data provided on the official website Jetson AGX Orin cuda , performance	15	2522	September 28, 2023
cuBLAS GEMM INT8 is much slower than FP16 in T4 GPU-Accelerated Libraries cublas	11	4283	November 2, 2023
Perfomances drop after AGX Orin update Jetson AGX Orin cudnn	7	72	March 28, 2025
ResourceExhaustedError: Running TF-TRT integration on Jetson AGX Jetson AGX Xavier	10	1138	October 18, 2021
Concat OP dimension mismatch causing inference failure on tensorflow-tenosrrt inference pipeline Jetson AGX Orin driveos-dl	3	558	March 22, 2024
Emulation Flash Configurations not working with Jetpack 6.0 Jetson AGX Orin reflash	26	110	December 31, 2024
Jetson Orin Nano Encrypted RootFs Black Screen Jetson Orin Nano security	8	876	November 20, 2023

Discrepancy Between Claimed and Actual Sparse INT8 Performance of Tensor Cores on Jetson AGX Orin

Related topics