Verifying claimed TOPS performance on Jetson Thor – CUTLASS kernel for SM110 does not run, SM80 gives very low performance (~6.9 TFLOP/s)

Hi !

I am trying to verify the officially claimed Tensor Core performance (TOPS) of the Jetson Thor (Blackwell architecture).

I followed the approach that worked perfectly on Jetson AGX Orin in this thread: Discrepancy Between Claimed and Actual Sparse INT8 Performance of Tensor Cores on Jetson AGX Orin

Unfortunately the same method does not work on Thor.

Specifically, the kernel cutlass_tensorop_i88128xorgemm_b1_256x128_512x2_tn_align128 (the one that gave the best results on Orin for sparse INT8) fails to run when compiled for the correct architecture:

cmake .. -DCUTLASS_NVCC_ARCHS=110 -DCUTLASS_LIBRARY_KERNELS=cutlass_tensorop_i88128xorgemm_b1_256x128_512x2_tn_align128

→ The resulting cutlass_profiler builds successfully but the kernel is not executed at all (“No results”).

If I force compilation for Ampere (SM80), the kernel compiles and runs, but the achieved performance is extremely low:

cmake .. -DCUTLASS_NVCC_ARCHS=80 -DCUTLASS_LIBRARY_KERNELS=cutlass_tensorop_i88128xorgemm_b1_256x128_512x2_tn_align128

Profiler command and result (16384×16384×16384, sparse INT8 TN gemm, alpha=1, beta=0):

./tools/profiler/cutlass_profiler --gemm_kind=universal --m=16384 --n=16384 --k=16384 \
--A=b1:row --B=b1:column --C=s32:column --D=s32:column --alpha=1 --beta=0 \
--split_k_mode=serial --split_k_slices=1 --batch_count=1 --raster_order=heuristic \
--swizzle_size=1 --op_class=tensorop --accum=s32 --cta_m=256 --cta_n=128 --cta_k=512 \
--cluster_m=1 --cluster_n=1 --cluster_k=1 --stages=2 --warps_m=4 --warps_n=2 --warps_k=1 \
--inst_m=8 --inst_n=8 --inst_k=128 --min_cc=75 --max_cc=1024

Output excerpt:

=============================
  Problem ID: 1

        Provider: CUTLASS
   OperationKind: gemm
       Operation: cutlass_tensorop_i88128xorgemm_b1_256x128_512x2_tn_align128

          Status: Success
    Verification: ON
     Disposition: Not verified

reference_device: Not run
          cuBLAS: Not run
           cuDNN: Not run

       Arguments: --gemm_kind=universal --m=16384 --n=16384 --k=16384 --A=b1:row --B=b1:column --C=s32:column --D=s32:column  \
                  --alpha=1 --beta=0 --split_k_mode=serial --split_k_slices=1 --batch_count=1 --raster_order=heuristic  \
                  --runtime_input_datatype_a=invalid --runtime_input_datatype_b=invalid --use_pdl=false --enable_sm90_mixed_dtype_shuffle_test=false  \
                  --swizzle_size=1 --op_class=tensorop --accum=s32 --cta_m=256 --cta_n=128 --cta_k=512 --cluster_m=1  \
                  --cluster_n=1 --cluster_k=1 --cluster_m_fallback=1 --cluster_n_fallback=1 --cluster_k_fallback=1 --stages=2  \
                  --warps_m=4 --warps_n=2 --warps_k=1 --inst_m=8 --inst_n=8 --inst_k=128 --min_cc=75 --max_cc=1024

           Bytes: 1140850688  bytes
           FLOPs: 8796629893120  flops
           FLOPs/Byte: 7710

         Runtime: 1272.44  ms
          Memory: 0.835007 GiB/s

            Math: 6913.18 GFLOP/s


=============================

This is orders of magnitude below the marketed sparse INT8 TOPS of Jetson Thor.

Questions:

  1. What is the correct Compute Capability / NVCC arch flag for Jetson Thor (Blackwell) in CUTLASS today? Is SM110 already supported?
  2. Which CUTLASS kernel should be used to reach peak sparse INT8 Tensor Core performance on Thor?
  3. Is there an official/recommended way (CUTLASS example, benchmark code, or otherwise) to measure and verify the claimed sparse INT8 TOPS on Jetson Thor?

Any help or example configuration that actually achieves close-to-advertised performance would be greatly appreciated.

Thank you!

Hi,

Suppose CUTLASS can work on Thor with the same configuration as DRIVE Thor:

https://github.com/NVIDIA/cutlass/tree/main?tab=readme-ov-file#hardware

$ export CUDACXX=/usr/local/cuda-13.0/bin/nvcc
$ cmake .. -DCUTLASS_NVCC_ARCHS=110a 
...

We will check it and share more information with you later.

Thanks.

Thank you for the quick response and the suggestion!

I tried exactly the DRIVE Thor configuration from the CUTLASS README:

export CUDACXX=/usr/local/cuda-13.0/bin/nvcc
cmake .. -DCUTLASS_NVCC_ARCHS=110a 

but cutlass_tensorop_i88128xorgemm_b1_256x128_512x2_tn_align128 does not appear. So the old Orin kernel that gave the best sparse INT8 results is not available on Thor (Blackwell).

Could you please tell us: Which kernel (or kernel name pattern) should be used on Thor/Blackwell to achieve peak sparse INT8/FP8 /FP4 Tensor Core performance?

We would appreciate it if you could share more test information. Thank you again for your help!

Hi,

Thor and Orin use different GPU architectures, so the kernel will also be different.

You can try our cutlass_profiler to get the peak results with exhaustive searching:
https://docs.nvidia.com/cutlass/media/docs/cpp/profiler.html#

Thanks.

Thank you for the suggestion! I’ve tried using the cutlass_profiler for exhaustive searching to get the peak results, and I’m now able to achieve 870+ TOPS for Dense FP4 computations. That’s a great start.

  Problem ID: 1

        Provider: CUTLASS
   OperationKind: gemm
       Operation: cutlass3x_sm100_bstensorop_gemm_ue4m3xf4_ue4m3xf4_f32_f16_e5m2_256x256x256_0x0x1_0_tnn_align32_o_vs16_2sm_epi_nosmem

          Status: Success
    Verification: OFF
     Disposition: Not verified


       Arguments: --gemm_kind=universal --m=4096 --n=8192 --k=8192 --A=f4:row --B=f4:column --C=f16:column --D=fe5m2:column  \
                  --alpha=1 --beta=0 --split_k_mode=serial --split_k_slices=1 --batch_count=1 --runtime_input_datatype_a=invalid  \
                  --runtime_input_datatype_b=invalid --raster_order=along_m --swizzle_size=1 --use_pdl=false --op_class=invalid  \
                  --accum=f32 --cta_m=256 --cta_n=256 --cta_k=256 --cluster_m=4 --cluster_n=1 --cluster_k=1 --cluster_m_fallback=2  \
                  --cluster_n_fallback=1 --cluster_k_fallback=1 --stages=5 --warps_m=4 --warps_n=1 --warps_k=1 --inst_m=256  \
                  --inst_n=256 --inst_k=64 --min_cc=100 --max_cc=1024

           Bytes: 123731968  bytes
           FLOPs: 549822922752  flops
           FLOPs/Byte: 4443

         Runtime: 0.626187  ms
          Memory: 184.026 GiB/s

            Math: 878050 GFLOP/s

However, I’m wondering how to get closer to the theoretical performance for Sparse FP4. Do you have any additional tips or methods for testing and optimizing in that scenario?

I’d appreciate any further advice or examples you can share.

Best regards

Hi,

Good to know you get the expected performance with dense gemm.
Could you try if you can get the sparse results with --operation=spgemm?

Thanks.

Hi Bro, I’m also evaluating FP4/FP8 capability on Thor. Could you send the cutlass compile command and how to run cutlass_profiler? I’m curious about your command line arguments. Here is my command to test cutlass:

./tools/profiler/cutlass_profiler --operation=SparseGemm --m=4096 --n=8192 --k=8192 --A=f4:row --B=f4:column --C=f16:column --D=fe5m2:column --enable-best-kernel-for-fixed-shape --sort-results-flops-per-sec

But I got nothing output and also no error after this command.

Hi, You can try this comile command:

cmake .. -DCUTLASS_NVCC_ARCHS="110a" -DCUTLASS_LIBRARY_KERNELS=all  -DCUTLASS_UNITY_BUILD_ENABLED=ON

With CUDA 13.0 and 130W Power.

By the way, could you please post your cutlass_profiler command line? I have tried several combination method but still got no result.

./tools/profiler/cutlass_profiler   --m=4096 --n=8192 --k=8192   --verification-enabled=false   --kernels="cutlass3x_sm100_bstensorop_*"  --enable-kernel-performance-search --sort-results-flops-per-sec --enable-best-kernel-for-fixed-shape

hi bro, i also test dense fp4 computation that it can run 875TFOPS,but NV said it can run 1035TFLOPS, so can your tests run that fast

Congratulations, but I think the 1035TFLOPS advertised by NVIDIA officially is a theoretical value obtained through calculations, rather than from actual testing.

Hi bro, may I ask if Thor supports nvfp4 or mxfp4 or both?

Hi,

Thor can run NVFP4.

Thanks.

Thanks for your reply.
Here are still 2 questions.

  1. What you mean is that Thor doesn’t support MXFP4?
  2. I am reading the PTX docs and the command tcgen05. mma. cta_group. kind. block_stcale {. scale-vectorsize} indicates that . scale-vectorsize can only be used with sm_100a, sm_100f, and sm110f, but thor is sm_110a. But when the data type is . kind: mxf4nvf4, K is at least 64. I want to confirm if the . scale-vectorsize parameter is available on Thor? Thank you again.
    here is the docs:
    https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#tcgen05-mma-instructions-mma

Hi,

Thor is a Blackwell chip so it has instruction support for mxfp4.
Thanks.

I write a kernel using nvcc 13.0.8. Using asm(“tcgen05.xxx“). nvcc -arch=sm_110 test.cu. It will report not support tcgen05.xxx in tartget sm_110

Hi,

Could you upgrade your CUDA to 13.1 and try it again?

If the issue remains, please share a sample to reproduce the error so we can check it in our environment.

Thanks.

use sm_110a or sm_110f

Hi bro, have you solved this problem?