Discrepancy Between Claimed and Actual Sparse INT8 Performance of Tensor Cores on Jetson AGX Orin

Hi,

Thanks for your patience.

Sorry, there are some incorrect messages in the previous suggestion.
The kernel you used before is for 1-bit dense GEMM.

To benchmark int8 sparse GEMM, please recompile the library with -DCUTLASS_LIBRARY_KERNELS=i16864spgemm.

Since Orin only has 16 SMS, we also recommend testing this with smaller problem sizes.
Then changing the below line from Identity8 to Identity4 also helps:
https://github.com/NVIDIA/cutlass/blob/main/python/cutlass_library/generator.py#L131

For example, we can get 98.7TOPs with m=1024, n=1024, k=8192:

=============================
  Problem ID: 1

        Provider: CUTLASS
   OperationKind: spgemm
       Operation: cutlass_tensorop_s8_i16864spgemm_s8_256x128_128x3_tn_align16

          Status: Success
    Verification: ON
     Dis"hljs-comment">--gemm_kind=spgemm --m=1024 --n=1024 --k=8192 --A=s8:row --B=s8:column --C=s8:row --E=u32:nk2 --alpha=1  \
                  --beta=0 --split_k_slices=1 --batch_count=1 --op_class=tensorop --accum=s32 --cta_m=256 --cta_n=128  \
                  --cta_k=128 --cluster_m=1 --cluster_n=1 --cluster_k=1 --stages=3 --warps_m=4 --warps_n=2 --warps_k=1  \
                  --inst_m=16 --inst_n=8 --inst_k=64 --min_cc=80 --max_cc=1024

           Bytes: 14680064  bytes
           FLOPs: 17181966336  flops
           FLOPs/Byte: 1170

         Runtime: 0.17401  ms
          Memory: 78.5694 GiB/s

            Math: 98741.1 GFLOP/s

We expect sparse int8 GEMM to have ~60-70% peak sol with CUTLASS public source code and public compiler.
The result shared above is pretty close but you can still play around with the parameters.

Thanks.