Hi !
I am trying to verify the officially claimed Tensor Core performance (TOPS) of the Jetson Thor (Blackwell architecture).
I followed the approach that worked perfectly on Jetson AGX Orin in this thread: Discrepancy Between Claimed and Actual Sparse INT8 Performance of Tensor Cores on Jetson AGX Orin
Unfortunately the same method does not work on Thor.
Specifically, the kernel cutlass_tensorop_i88128xorgemm_b1_256x128_512x2_tn_align128 (the one that gave the best results on Orin for sparse INT8) fails to run when compiled for the correct architecture:
cmake .. -DCUTLASS_NVCC_ARCHS=110 -DCUTLASS_LIBRARY_KERNELS=cutlass_tensorop_i88128xorgemm_b1_256x128_512x2_tn_align128
→ The resulting cutlass_profiler builds successfully but the kernel is not executed at all (“No results”).
If I force compilation for Ampere (SM80), the kernel compiles and runs, but the achieved performance is extremely low:
cmake .. -DCUTLASS_NVCC_ARCHS=80 -DCUTLASS_LIBRARY_KERNELS=cutlass_tensorop_i88128xorgemm_b1_256x128_512x2_tn_align128
Profiler command and result (16384×16384×16384, sparse INT8 TN gemm, alpha=1, beta=0):
./tools/profiler/cutlass_profiler --gemm_kind=universal --m=16384 --n=16384 --k=16384 \
--A=b1:row --B=b1:column --C=s32:column --D=s32:column --alpha=1 --beta=0 \
--split_k_mode=serial --split_k_slices=1 --batch_count=1 --raster_order=heuristic \
--swizzle_size=1 --op_class=tensorop --accum=s32 --cta_m=256 --cta_n=128 --cta_k=512 \
--cluster_m=1 --cluster_n=1 --cluster_k=1 --stages=2 --warps_m=4 --warps_n=2 --warps_k=1 \
--inst_m=8 --inst_n=8 --inst_k=128 --min_cc=75 --max_cc=1024
Output excerpt:
=============================
Problem ID: 1
Provider: CUTLASS
OperationKind: gemm
Operation: cutlass_tensorop_i88128xorgemm_b1_256x128_512x2_tn_align128
Status: Success
Verification: ON
Disposition: Not verified
reference_device: Not run
cuBLAS: Not run
cuDNN: Not run
Arguments: --gemm_kind=universal --m=16384 --n=16384 --k=16384 --A=b1:row --B=b1:column --C=s32:column --D=s32:column \
--alpha=1 --beta=0 --split_k_mode=serial --split_k_slices=1 --batch_count=1 --raster_order=heuristic \
--runtime_input_datatype_a=invalid --runtime_input_datatype_b=invalid --use_pdl=false --enable_sm90_mixed_dtype_shuffle_test=false \
--swizzle_size=1 --op_class=tensorop --accum=s32 --cta_m=256 --cta_n=128 --cta_k=512 --cluster_m=1 \
--cluster_n=1 --cluster_k=1 --cluster_m_fallback=1 --cluster_n_fallback=1 --cluster_k_fallback=1 --stages=2 \
--warps_m=4 --warps_n=2 --warps_k=1 --inst_m=8 --inst_n=8 --inst_k=128 --min_cc=75 --max_cc=1024
Bytes: 1140850688 bytes
FLOPs: 8796629893120 flops
FLOPs/Byte: 7710
Runtime: 1272.44 ms
Memory: 0.835007 GiB/s
Math: 6913.18 GFLOP/s
=============================
This is orders of magnitude below the marketed sparse INT8 TOPS of Jetson Thor.
Questions:
- What is the correct Compute Capability / NVCC arch flag for Jetson Thor (Blackwell) in CUTLASS today? Is SM110 already supported?
- Which CUTLASS kernel should be used to reach peak sparse INT8 Tensor Core performance on Thor?
- Is there an official/recommended way (CUTLASS example, benchmark code, or otherwise) to measure and verify the claimed sparse INT8 TOPS on Jetson Thor?
Any help or example configuration that actually achieves close-to-advertised performance would be greatly appreciated.
Thank you!
