Hi,
Thank you for the guidance! I have tested both General GEMM and Sparse GEMM using the CUTLASS library on the Jetson Orin. I am puzzled because the measured performance for both General GEMM and Sparse GEMM is quite similar, around 220+ TOPS. This is quite different from what I expected based on the Jetson AGX Orin datasheet, which lists the performance for Tensor Core Sparse GEMM INT8 as 175 TOPS, and Dense GEMM INT8 as 85 TOPS—almost a twofold difference between the two.
Below are the commands I used with the CUTLASS profiler for Dense GEMM and Sparse GEMM testing, respectively:
./tools/profiler/cutlass_profiler --gemm_kind=universal --m=16384 --n=16384 --k=16384 --A=b1:row --B=b1:column --C=s32:column --D=s32:column --alpha=1 --beta=0 --split_k_mode=serial --split_k_slices=1 --batch_count=1 --raster_order=heuristic --swizzle_size=1 --op_class=tensorop --accum=s32 --cta_m=256 --cta_n=128 --cta_k=512 --cluster_m=1 --cluster_n=1 --cluster_k=1 --stages=2 --warps_m=4 --warps_n=2 --warps_k=1 --inst_m=8 --inst_n=8 --inst_k=128 --min_cc=75 --max_cc=1024
=============================
Problem ID: 1
Provider: CUTLASS
OperationKind: gemm
Operation: cutlass_tensorop_i88128xorgemm_b1_256x128_512x2_tn_align128
Status: Success
Verification: ON
Disposition: Not verified
reference_device: Not run
cuBLAS: Not run
cuDNN: Not run
Arguments: --gemm_kind=universal --m=16384 --n=16384 --k=16384 --A=b1:row --B=b1:column --C=s32:column --D=s32:column \
--alpha=1 --beta=0 --split_k_mode=serial --split_k_slices=1 --batch_count=1 --raster_order=heuristic \
--swizzle_size=1 --op_class=tensorop --accum=s32 --cta_m=256 --cta_n=128 --cta_k=512 --cluster_m=1 \
--cluster_n=1 --cluster_k=1 --stages=2 --warps_m=4 --warps_n=2 --warps_k=1 --inst_m=8 --inst_n=8 --inst_k=128 \
--min_cc=75 --max_cc=1024
Bytes: 1140850688 bytes
FLOPs: 8796629893120 flops
FLOPs/Byte: 7710
Runtime: 39.2066 ms
Memory: 27.1 GiB/s
Math: 224366 GFLOP/s
=============================
CSV Results:
Problem,Provider,OperationKind,Operation,Disposition,Status,gemm_kind,m,n,k,A,B,C,D,alpha,beta,split_k_mode,split_k_slices,batch_count,raster_order,swizzle_size,op_class,accum,cta_m,cta_n,cta_k,cluster_m,cluster_n,cluster_k,stages,warps_m,warps_n,warps_k,inst_m,inst_n,inst_k,min_cc,max_cc,Bytes,Flops,Flops/Byte,Runtime,GB/s,GFLOPs
1,CUTLASS,gemm,cutlass_tensorop_i88128xorgemm_b1_256x128_512x2_tn_align128,not_verified,success,universal,16384,16384,16384,b1:row,b1:column,s32:column,s32:column,1,0,serial,1,1,heuristic,1,tensorop,s32,256,128,512,1,1,1,2,4,2,1,8,8,128,75,1024,1140850688,8796629893120,7710,39.2066,27.1,224366
./tools/profiler/cutlass_profiler --gemm_kind=spgemm --m=16384 --n=16384 --k=16384 --A=b1:row --B=b1:column --C=s32:column --D=s32:column --alpha=1 --beta=0 --split_k_mode=serial --split_k_slices=1 --batch_count=1 --raster_order=heuristic --swizzle_size=1 --op_class=tensorop --accum=s32 --cta_m=256 --cta_n=128 --cta_k=512 --cluster_m=1 --cluster_n=1 --cluster_k=1 --stages=2 --warps_m=4 --warps_n=2 --warps_k=1 --inst_m=8 --inst_n=8 --inst_k=128 --min_cc=75 --max_cc=1024
=============================
Problem ID: 1
Provider: CUTLASS
OperationKind: gemm
Operation: cutlass_tensorop_i88128xorgemm_b1_256x128_512x2_tn_align128
Status: Success
Verification: ON
Disposition: Not verified
reference_device: Not run
cuBLAS: Not run
cuDNN: Not run
Arguments: --gemm_kind=universal --m=16384 --n=16384 --k=16384 --A=b1:row --B=b1:column --C=s32:column --D=s32:column \
--alpha=1 --beta=0 --split_k_mode=serial --split_k_slices=1 --batch_count=1 --raster_order=heuristic \
--swizzle_size=1 --op_class=tensorop --accum=s32 --cta_m=256 --cta_n=128 --cta_k=512 --cluster_m=1 \
--cluster_n=1 --cluster_k=1 --stages=2 --warps_m=4 --warps_n=2 --warps_k=1 --inst_m=8 --inst_n=8 --inst_k=128 \
--min_cc=75 --max_cc=1024
Bytes: 1140850688 bytes
FLOPs: 8796629893120 flops
FLOPs/Byte: 7710
Runtime: 39.3207 ms
Memory: 27.0214 GiB/s
Math: 223715 GFLOP/s
=============================
CSV Results:
Problem,Provider,OperationKind,Operation,Disposition,Status,gemm_kind,m,n,k,A,B,C,D,alpha,beta,split_k_mode,split_k_slices,batch_count,raster_order,swizzle_size,op_class,accum,cta_m,cta_n,cta_k,cluster_m,cluster_n,cluster_k,stages,warps_m,warps_n,warps_k,inst_m,inst_n,inst_k,min_cc,max_cc,Bytes,Flops,Flops/Byte,Runtime,GB/s,GFLOPs
1,CUTLASS,gemm,cutlass_tensorop_i88128xorgemm_b1_256x128_512x2_tn_align128,not_verified,success,universal,16384,16384,16384,b1:row,b1:column,s32:column,s32:column,1,0,serial,1,1,heuristic,1,tensorop,s32,256,128,512,1,1,1,2,4,2,1,8,8,128,75,1024,1140850688,8796629893120,7710,39.3207,27.0214,223715
As shown, both the General GEMM and Sparse GEMM performance are very close. Could you provide some insights into why there isn’t a more noticeable difference?
Thank you once again for your help and guidance.
Best regards,