Discrepancy Between Claimed and Actual Sparse INT8 Performance of Tensor Cores on Jetson AGX Orin

bloom.andy · August 26, 2024, 1:41am

Hi,

Thank you for the guidance! I have tested both General GEMM and Sparse GEMM using the CUTLASS library on the Jetson Orin. I am puzzled because the measured performance for both General GEMM and Sparse GEMM is quite similar, around 220+ TOPS. This is quite different from what I expected based on the Jetson AGX Orin datasheet, which lists the performance for Tensor Core Sparse GEMM INT8 as 175 TOPS, and Dense GEMM INT8 as 85 TOPS—almost a twofold difference between the two.

Below are the commands I used with the CUTLASS profiler for Dense GEMM and Sparse GEMM testing, respectively:

./tools/profiler/cutlass_profiler --gemm_kind=universal --m=16384 --n=16384 --k=16384 --A=b1:row --B=b1:column --C=s32:column --D=s32:column                    --alpha=1 --beta=0 --split_k_mode=serial --split_k_slices=1 --batch_count=1 --raster_order=heuristic                    --swizzle_size=1 --op_class=tensorop --accum=s32 --cta_m=256 --cta_n=128 --cta_k=512 --cluster_m=1                    --cluster_n=1 --cluster_k=1 --stages=2 --warps_m=4 --warps_n=2 --warps_k=1 --inst_m=8 --inst_n=8 --inst_k=128                    --min_cc=75 --max_cc=1024



=============================
  Problem ID: 1

        Provider: CUTLASS
   OperationKind: gemm
       Operation: cutlass_tensorop_i88128xorgemm_b1_256x128_512x2_tn_align128

          Status: Success
    Verification: ON
     Disposition: Not verified

reference_device: Not run
          cuBLAS: Not run
           cuDNN: Not run

       Arguments: --gemm_kind=universal --m=16384 --n=16384 --k=16384 --A=b1:row --B=b1:column --C=s32:column --D=s32:column  \
                  --alpha=1 --beta=0 --split_k_mode=serial --split_k_slices=1 --batch_count=1 --raster_order=heuristic  \
                  --swizzle_size=1 --op_class=tensorop --accum=s32 --cta_m=256 --cta_n=128 --cta_k=512 --cluster_m=1  \
                  --cluster_n=1 --cluster_k=1 --stages=2 --warps_m=4 --warps_n=2 --warps_k=1 --inst_m=8 --inst_n=8 --inst_k=128  \
                  --min_cc=75 --max_cc=1024

           Bytes: 1140850688  bytes
           FLOPs: 8796629893120  flops
           FLOPs/Byte: 7710

         Runtime: 39.2066  ms
          Memory: 27.1 GiB/s

            Math: 224366 GFLOP/s


=============================

CSV Results:

Problem,Provider,OperationKind,Operation,Disposition,Status,gemm_kind,m,n,k,A,B,C,D,alpha,beta,split_k_mode,split_k_slices,batch_count,raster_order,swizzle_size,op_class,accum,cta_m,cta_n,cta_k,cluster_m,cluster_n,cluster_k,stages,warps_m,warps_n,warps_k,inst_m,inst_n,inst_k,min_cc,max_cc,Bytes,Flops,Flops/Byte,Runtime,GB/s,GFLOPs
1,CUTLASS,gemm,cutlass_tensorop_i88128xorgemm_b1_256x128_512x2_tn_align128,not_verified,success,universal,16384,16384,16384,b1:row,b1:column,s32:column,s32:column,1,0,serial,1,1,heuristic,1,tensorop,s32,256,128,512,1,1,1,2,4,2,1,8,8,128,75,1024,1140850688,8796629893120,7710,39.2066,27.1,224366

./tools/profiler/cutlass_profiler  --gemm_kind=spgemm --m=16384 --n=16384 --k=16384 --A=b1:row --B=b1:column --C=s32:column --D=s32:column                    --alpha=1 --beta=0 --split_k_mode=serial --split_k_slices=1 --batch_count=1 --raster_order=heuristic                    --swizzle_size=1 --op_class=tensorop --accum=s32 --cta_m=256 --cta_n=128 --cta_k=512 --cluster_m=1                    --cluster_n=1 --cluster_k=1 --stages=2 --warps_m=4 --warps_n=2 --warps_k=1 --inst_m=8 --inst_n=8 --inst_k=128                    --min_cc=75 --max_cc=1024



=============================
  Problem ID: 1

        Provider: CUTLASS
   OperationKind: gemm
       Operation: cutlass_tensorop_i88128xorgemm_b1_256x128_512x2_tn_align128

          Status: Success
    Verification: ON
     Disposition: Not verified

reference_device: Not run
          cuBLAS: Not run
           cuDNN: Not run

       Arguments: --gemm_kind=universal --m=16384 --n=16384 --k=16384 --A=b1:row --B=b1:column --C=s32:column --D=s32:column  \
                  --alpha=1 --beta=0 --split_k_mode=serial --split_k_slices=1 --batch_count=1 --raster_order=heuristic  \
                  --swizzle_size=1 --op_class=tensorop --accum=s32 --cta_m=256 --cta_n=128 --cta_k=512 --cluster_m=1  \
                  --cluster_n=1 --cluster_k=1 --stages=2 --warps_m=4 --warps_n=2 --warps_k=1 --inst_m=8 --inst_n=8 --inst_k=128  \
                  --min_cc=75 --max_cc=1024

           Bytes: 1140850688  bytes
           FLOPs: 8796629893120  flops
           FLOPs/Byte: 7710

         Runtime: 39.3207  ms
          Memory: 27.0214 GiB/s

            Math: 223715 GFLOP/s


=============================

CSV Results:

Problem,Provider,OperationKind,Operation,Disposition,Status,gemm_kind,m,n,k,A,B,C,D,alpha,beta,split_k_mode,split_k_slices,batch_count,raster_order,swizzle_size,op_class,accum,cta_m,cta_n,cta_k,cluster_m,cluster_n,cluster_k,stages,warps_m,warps_n,warps_k,inst_m,inst_n,inst_k,min_cc,max_cc,Bytes,Flops,Flops/Byte,Runtime,GB/s,GFLOPs
1,CUTLASS,gemm,cutlass_tensorop_i88128xorgemm_b1_256x128_512x2_tn_align128,not_verified,success,universal,16384,16384,16384,b1:row,b1:column,s32:column,s32:column,1,0,serial,1,1,heuristic,1,tensorop,s32,256,128,512,1,1,1,2,4,2,1,8,8,128,75,1024,1140850688,8796629893120,7710,39.3207,27.0214,223715

As shown, both the General GEMM and Sparse GEMM performance are very close. Could you provide some insights into why there isn’t a more noticeable difference?

Thank you once again for your help and guidance.

Best regards,

Topic		Replies	Views
How to verify Orin the TOPS performance Jetson Orin NX cuda	10	1031	October 9, 2024
Jetson AGX Orin TOPs / CUDA Cores Explained Jetson AGX Orin jetson-inference	8	5842	May 24, 2023
Inference slow even using TensorRT Jetson AGX Orin tensorrt	15	1753	November 6, 2023
The performance of the Jetson Orin Nano module does not match the data provided on the official website Jetson AGX Orin cuda , performance	15	2534	September 28, 2023
cuBLAS GEMM INT8 is much slower than FP16 in T4 GPU-Accelerated Libraries cublas	11	4299	November 2, 2023
Perfomances drop after AGX Orin update Jetson AGX Orin cudnn	7	75	March 28, 2025
ResourceExhaustedError: Running TF-TRT integration on Jetson AGX Jetson AGX Xavier	10	1138	October 18, 2021
Concat OP dimension mismatch causing inference failure on tensorflow-tenosrrt inference pipeline Jetson AGX Orin driveos-dl	3	560	March 22, 2024
Emulation Flash Configurations not working with Jetpack 6.0 Jetson AGX Orin reflash	26	114	December 31, 2024
Jetson Orin Nano Encrypted RootFs Black Screen Jetson Orin Nano security	8	877	November 20, 2023

Discrepancy Between Claimed and Actual Sparse INT8 Performance of Tensor Cores on Jetson AGX Orin

Related topics