Hi,
Thanks for your patience.
Sorry, there are some incorrect messages in the previous suggestion.
The kernel you used before is for 1-bit dense GEMM.
To benchmark int8 sparse GEMM, please recompile the library with -DCUTLASS_LIBRARY_KERNELS=i16864spgemm
.
Since Orin only has 16 SMS, we also recommend testing this with smaller problem sizes.
Then changing the below line from Identity8 to Identity4 also helps:
https://github.com/NVIDIA/cutlass/blob/main/python/cutlass_library/generator.py#L131
For example, we can get 98.7TOPs with m=1024, n=1024, k=8192:
=============================
Problem ID: 1
Provider: CUTLASS
OperationKind: spgemm
Operation: cutlass_tensorop_s8_i16864spgemm_s8_256x128_128x3_tn_align16
Status: Success
Verification: ON
Dis"hljs-comment">--gemm_kind=spgemm --m=1024 --n=1024 --k=8192 --A=s8:row --B=s8:column --C=s8:row --E=u32:nk2 --alpha=1 \
--beta=0 --split_k_slices=1 --batch_count=1 --op_class=tensorop --accum=s32 --cta_m=256 --cta_n=128 \
--cta_k=128 --cluster_m=1 --cluster_n=1 --cluster_k=1 --stages=3 --warps_m=4 --warps_n=2 --warps_k=1 \
--inst_m=16 --inst_n=8 --inst_k=64 --min_cc=80 --max_cc=1024
Bytes: 14680064 bytes
FLOPs: 17181966336 flops
FLOPs/Byte: 1170
Runtime: 0.17401 ms
Memory: 78.5694 GiB/s
Math: 98741.1 GFLOP/s
We expect sparse int8 GEMM to have ~60-70% peak sol with CUTLASS public source code and public compiler.
The result shared above is pretty close but you can still play around with the parameters.
Thanks.