Discrepancy Between Claimed and Actual Sparse INT8 Performance of Tensor Cores on Jetson AGX Orin

Hi,

Thank you for the guidance! I have tested both General GEMM and Sparse GEMM using the CUTLASS library on the Jetson Orin. I am puzzled because the measured performance for both General GEMM and Sparse GEMM is quite similar, around 220+ TOPS. This is quite different from what I expected based on the Jetson AGX Orin datasheet, which lists the performance for Tensor Core Sparse GEMM INT8 as 175 TOPS, and Dense GEMM INT8 as 85 TOPS—almost a twofold difference between the two.

Below are the commands I used with the CUTLASS profiler for Dense GEMM and Sparse GEMM testing, respectively:

./tools/profiler/cutlass_profiler --gemm_kind=universal --m=16384 --n=16384 --k=16384 --A=b1:row --B=b1:column --C=s32:column --D=s32:column                    --alpha=1 --beta=0 --split_k_mode=serial --split_k_slices=1 --batch_count=1 --raster_order=heuristic                    --swizzle_size=1 --op_class=tensorop --accum=s32 --cta_m=256 --cta_n=128 --cta_k=512 --cluster_m=1                    --cluster_n=1 --cluster_k=1 --stages=2 --warps_m=4 --warps_n=2 --warps_k=1 --inst_m=8 --inst_n=8 --inst_k=128                    --min_cc=75 --max_cc=1024



=============================
  Problem ID: 1

        Provider: CUTLASS
   OperationKind: gemm
       Operation: cutlass_tensorop_i88128xorgemm_b1_256x128_512x2_tn_align128

          Status: Success
    Verification: ON
     Disposition: Not verified

reference_device: Not run
          cuBLAS: Not run
           cuDNN: Not run

       Arguments: --gemm_kind=universal --m=16384 --n=16384 --k=16384 --A=b1:row --B=b1:column --C=s32:column --D=s32:column  \
                  --alpha=1 --beta=0 --split_k_mode=serial --split_k_slices=1 --batch_count=1 --raster_order=heuristic  \
                  --swizzle_size=1 --op_class=tensorop --accum=s32 --cta_m=256 --cta_n=128 --cta_k=512 --cluster_m=1  \
                  --cluster_n=1 --cluster_k=1 --stages=2 --warps_m=4 --warps_n=2 --warps_k=1 --inst_m=8 --inst_n=8 --inst_k=128  \
                  --min_cc=75 --max_cc=1024

           Bytes: 1140850688  bytes
           FLOPs: 8796629893120  flops
           FLOPs/Byte: 7710

         Runtime: 39.2066  ms
          Memory: 27.1 GiB/s

            Math: 224366 GFLOP/s


=============================

CSV Results:

Problem,Provider,OperationKind,Operation,Disposition,Status,gemm_kind,m,n,k,A,B,C,D,alpha,beta,split_k_mode,split_k_slices,batch_count,raster_order,swizzle_size,op_class,accum,cta_m,cta_n,cta_k,cluster_m,cluster_n,cluster_k,stages,warps_m,warps_n,warps_k,inst_m,inst_n,inst_k,min_cc,max_cc,Bytes,Flops,Flops/Byte,Runtime,GB/s,GFLOPs
1,CUTLASS,gemm,cutlass_tensorop_i88128xorgemm_b1_256x128_512x2_tn_align128,not_verified,success,universal,16384,16384,16384,b1:row,b1:column,s32:column,s32:column,1,0,serial,1,1,heuristic,1,tensorop,s32,256,128,512,1,1,1,2,4,2,1,8,8,128,75,1024,1140850688,8796629893120,7710,39.2066,27.1,224366
./tools/profiler/cutlass_profiler  --gemm_kind=spgemm --m=16384 --n=16384 --k=16384 --A=b1:row --B=b1:column --C=s32:column --D=s32:column                    --alpha=1 --beta=0 --split_k_mode=serial --split_k_slices=1 --batch_count=1 --raster_order=heuristic                    --swizzle_size=1 --op_class=tensorop --accum=s32 --cta_m=256 --cta_n=128 --cta_k=512 --cluster_m=1                    --cluster_n=1 --cluster_k=1 --stages=2 --warps_m=4 --warps_n=2 --warps_k=1 --inst_m=8 --inst_n=8 --inst_k=128                    --min_cc=75 --max_cc=1024



=============================
  Problem ID: 1

        Provider: CUTLASS
   OperationKind: gemm
       Operation: cutlass_tensorop_i88128xorgemm_b1_256x128_512x2_tn_align128

          Status: Success
    Verification: ON
     Disposition: Not verified

reference_device: Not run
          cuBLAS: Not run
           cuDNN: Not run

       Arguments: --gemm_kind=universal --m=16384 --n=16384 --k=16384 --A=b1:row --B=b1:column --C=s32:column --D=s32:column  \
                  --alpha=1 --beta=0 --split_k_mode=serial --split_k_slices=1 --batch_count=1 --raster_order=heuristic  \
                  --swizzle_size=1 --op_class=tensorop --accum=s32 --cta_m=256 --cta_n=128 --cta_k=512 --cluster_m=1  \
                  --cluster_n=1 --cluster_k=1 --stages=2 --warps_m=4 --warps_n=2 --warps_k=1 --inst_m=8 --inst_n=8 --inst_k=128  \
                  --min_cc=75 --max_cc=1024

           Bytes: 1140850688  bytes
           FLOPs: 8796629893120  flops
           FLOPs/Byte: 7710

         Runtime: 39.3207  ms
          Memory: 27.0214 GiB/s

            Math: 223715 GFLOP/s


=============================

CSV Results:

Problem,Provider,OperationKind,Operation,Disposition,Status,gemm_kind,m,n,k,A,B,C,D,alpha,beta,split_k_mode,split_k_slices,batch_count,raster_order,swizzle_size,op_class,accum,cta_m,cta_n,cta_k,cluster_m,cluster_n,cluster_k,stages,warps_m,warps_n,warps_k,inst_m,inst_n,inst_k,min_cc,max_cc,Bytes,Flops,Flops/Byte,Runtime,GB/s,GFLOPs
1,CUTLASS,gemm,cutlass_tensorop_i88128xorgemm_b1_256x128_512x2_tn_align128,not_verified,success,universal,16384,16384,16384,b1:row,b1:column,s32:column,s32:column,1,0,serial,1,1,heuristic,1,tensorop,s32,256,128,512,1,1,1,2,4,2,1,8,8,128,75,1024,1140850688,8796629893120,7710,39.3207,27.0214,223715

As shown, both the General GEMM and Sparse GEMM performance are very close. Could you provide some insights into why there isn’t a more noticeable difference?

Thank you once again for your help and guidance.

Best regards,