Verifying claimed TOPS performance on Jetson Thor – CUTLASS kernel for SM110 does not run, SM80 gives very low performance (~6.9 TFLOP/s)

john_c · November 20, 2025, 10:08am

Hi !

I am trying to verify the officially claimed Tensor Core performance (TOPS) of the Jetson Thor (Blackwell architecture).

I followed the approach that worked perfectly on Jetson AGX Orin in this thread: Discrepancy Between Claimed and Actual Sparse INT8 Performance of Tensor Cores on Jetson AGX Orin

Unfortunately the same method does not work on Thor.

Specifically, the kernel cutlass_tensorop_i88128xorgemm_b1_256x128_512x2_tn_align128 (the one that gave the best results on Orin for sparse INT8) fails to run when compiled for the correct architecture:

cmake .. -DCUTLASS_NVCC_ARCHS=110 -DCUTLASS_LIBRARY_KERNELS=cutlass_tensorop_i88128xorgemm_b1_256x128_512x2_tn_align128

→ The resulting cutlass_profiler builds successfully but the kernel is not executed at all (“No results”).

If I force compilation for Ampere (SM80), the kernel compiles and runs, but the achieved performance is extremely low:

cmake .. -DCUTLASS_NVCC_ARCHS=80 -DCUTLASS_LIBRARY_KERNELS=cutlass_tensorop_i88128xorgemm_b1_256x128_512x2_tn_align128

Profiler command and result (16384×16384×16384, sparse INT8 TN gemm, alpha=1, beta=0):

./tools/profiler/cutlass_profiler --gemm_kind=universal --m=16384 --n=16384 --k=16384 \
--A=b1:row --B=b1:column --C=s32:column --D=s32:column --alpha=1 --beta=0 \
--split_k_mode=serial --split_k_slices=1 --batch_count=1 --raster_order=heuristic \
--swizzle_size=1 --op_class=tensorop --accum=s32 --cta_m=256 --cta_n=128 --cta_k=512 \
--cluster_m=1 --cluster_n=1 --cluster_k=1 --stages=2 --warps_m=4 --warps_n=2 --warps_k=1 \
--inst_m=8 --inst_n=8 --inst_k=128 --min_cc=75 --max_cc=1024

Output excerpt:

=============================
  Problem ID: 1

        Provider: CUTLASS
   OperationKind: gemm
       Operation: cutlass_tensorop_i88128xorgemm_b1_256x128_512x2_tn_align128

          Status: Success
    Verification: ON
     Disposition: Not verified

reference_device: Not run
          cuBLAS: Not run
           cuDNN: Not run

       Arguments: --gemm_kind=universal --m=16384 --n=16384 --k=16384 --A=b1:row --B=b1:column --C=s32:column --D=s32:column  \
                  --alpha=1 --beta=0 --split_k_mode=serial --split_k_slices=1 --batch_count=1 --raster_order=heuristic  \
                  --runtime_input_datatype_a=invalid --runtime_input_datatype_b=invalid --use_pdl=false --enable_sm90_mixed_dtype_shuffle_test=false  \
                  --swizzle_size=1 --op_class=tensorop --accum=s32 --cta_m=256 --cta_n=128 --cta_k=512 --cluster_m=1  \
                  --cluster_n=1 --cluster_k=1 --cluster_m_fallback=1 --cluster_n_fallback=1 --cluster_k_fallback=1 --stages=2  \
                  --warps_m=4 --warps_n=2 --warps_k=1 --inst_m=8 --inst_n=8 --inst_k=128 --min_cc=75 --max_cc=1024

           Bytes: 1140850688  bytes
           FLOPs: 8796629893120  flops
           FLOPs/Byte: 7710

         Runtime: 1272.44  ms
          Memory: 0.835007 GiB/s

            Math: 6913.18 GFLOP/s


=============================

This is orders of magnitude below the marketed sparse INT8 TOPS of Jetson Thor.

Questions:

What is the correct Compute Capability / NVCC arch flag for Jetson Thor (Blackwell) in CUTLASS today? Is SM110 already supported?
Which CUTLASS kernel should be used to reach peak sparse INT8 Tensor Core performance on Thor?
Is there an official/recommended way (CUTLASS example, benchmark code, or otherwise) to measure and verify the claimed sparse INT8 TOPS on Jetson Thor?

Any help or example configuration that actually achieves close-to-advertised performance would be greatly appreciated.

Thank you!

AastaLLL · November 21, 2025, 3:03am

Hi,

Suppose CUTLASS can work on Thor with the same configuration as DRIVE Thor:

https://github.com/NVIDIA/cutlass/tree/main?tab=readme-ov-file#hardware

$ export CUDACXX=/usr/local/cuda-13.0/bin/nvcc
$ cmake .. -DCUTLASS_NVCC_ARCHS=110a 
...

We will check it and share more information with you later.

Thanks.

john_c · November 21, 2025, 7:26am

Thank you for the quick response and the suggestion!

I tried exactly the DRIVE Thor configuration from the CUTLASS README:

export CUDACXX=/usr/local/cuda-13.0/bin/nvcc
cmake .. -DCUTLASS_NVCC_ARCHS=110a

but cutlass_tensorop_i88128xorgemm_b1_256x128_512x2_tn_align128 does not appear. So the old Orin kernel that gave the best sparse INT8 results is not available on Thor (Blackwell).

Could you please tell us: Which kernel (or kernel name pattern) should be used on Thor/Blackwell to achieve peak sparse INT8/FP8 /FP4 Tensor Core performance?

We would appreciate it if you could share more test information. Thank you again for your help!

AastaLLL · November 24, 2025, 8:47am

Hi,

Thor and Orin use different GPU architectures, so the kernel will also be different.

You can try our cutlass_profiler to get the peak results with exhaustive searching:
https://docs.nvidia.com/cutlass/media/docs/cpp/profiler.html#

Thanks.

john_c · November 25, 2025, 8:20am

Thank you for the suggestion! I’ve tried using the cutlass_profiler for exhaustive searching to get the peak results, and I’m now able to achieve 870+ TOPS for Dense FP4 computations. That’s a great start.

  Problem ID: 1

        Provider: CUTLASS
   OperationKind: gemm
       Operation: cutlass3x_sm100_bstensorop_gemm_ue4m3xf4_ue4m3xf4_f32_f16_e5m2_256x256x256_0x0x1_0_tnn_align32_o_vs16_2sm_epi_nosmem

          Status: Success
    Verification: OFF
     Disposition: Not verified


       Arguments: --gemm_kind=universal --m=4096 --n=8192 --k=8192 --A=f4:row --B=f4:column --C=f16:column --D=fe5m2:column  \
                  --alpha=1 --beta=0 --split_k_mode=serial --split_k_slices=1 --batch_count=1 --runtime_input_datatype_a=invalid  \
                  --runtime_input_datatype_b=invalid --raster_order=along_m --swizzle_size=1 --use_pdl=false --op_class=invalid  \
                  --accum=f32 --cta_m=256 --cta_n=256 --cta_k=256 --cluster_m=4 --cluster_n=1 --cluster_k=1 --cluster_m_fallback=2  \
                  --cluster_n_fallback=1 --cluster_k_fallback=1 --stages=5 --warps_m=4 --warps_n=1 --warps_k=1 --inst_m=256  \
                  --inst_n=256 --inst_k=64 --min_cc=100 --max_cc=1024

           Bytes: 123731968  bytes
           FLOPs: 549822922752  flops
           FLOPs/Byte: 4443

         Runtime: 0.626187  ms
          Memory: 184.026 GiB/s

            Math: 878050 GFLOP/s

However, I’m wondering how to get closer to the theoretical performance for Sparse FP4. Do you have any additional tips or methods for testing and optimizing in that scenario?

I’d appreciate any further advice or examples you can share.

Best regards

AastaLLL · November 26, 2025, 7:33am

Hi,

Good to know you get the expected performance with dense gemm.
Could you try if you can get the sparse results with --operation=spgemm?

Thanks.

laixuejin31 · December 4, 2025, 3:28am

Hi Bro, I’m also evaluating FP4/FP8 capability on Thor. Could you send the cutlass compile command and how to run cutlass_profiler? I’m curious about your command line arguments. Here is my command to test cutlass:

./tools/profiler/cutlass_profiler --operation=SparseGemm --m=4096 --n=8192 --k=8192 --A=f4:row --B=f4:column --C=f16:column --D=fe5m2:column --enable-best-kernel-for-fixed-shape --sort-results-flops-per-sec

But I got nothing output and also no error after this command.

john_c · December 8, 2025, 1:15am

Hi, You can try this comile command:

cmake .. -DCUTLASS_NVCC_ARCHS="110a" -DCUTLASS_LIBRARY_KERNELS=all  -DCUTLASS_UNITY_BUILD_ENABLED=ON

With CUDA 13.0 and 130W Power.

laixuejin31 · December 8, 2025, 10:49am

By the way, could you please post your cutlass_profiler command line? I have tried several combination method but still got no result.

john_c · December 9, 2025, 8:28am

./tools/profiler/cutlass_profiler   --m=4096 --n=8192 --k=8192   --verification-enabled=false   --kernels="cutlass3x_sm100_bstensorop_*"  --enable-kernel-performance-search --sort-results-flops-per-sec --enable-best-kernel-for-fixed-shape

653576489 · December 12, 2025, 2:47am

hi bro, i also test dense fp4 computation that it can run 875TFOPS,but NV said it can run 1035TFLOPS, so can your tests run that fast

john_c · December 12, 2025, 3:15am

Congratulations, but I think the 1035TFLOPS advertised by NVIDIA officially is a theoretical value obtained through calculations, rather than from actual testing.

1457689744 · December 16, 2025, 8:12am

Hi bro, may I ask if Thor supports nvfp4 or mxfp4 or both？

AastaLLL · December 17, 2025, 7:13am

Hi,

Thor can run NVFP4.

Thanks.

1457689744 · December 17, 2025, 7:31am

Thanks for your reply.
Here are still 2 questions.

What you mean is that Thor doesn’t support MXFP4?
I am reading the PTX docs and the command tcgen05. mma. cta_group. kind. block_stcale {. scale-vectorsize} indicates that . scale-vectorsize can only be used with sm_100a, sm_100f, and sm110f, but thor is sm_110a. But when the data type is . kind: mxf4nvf4, K is at least 64. I want to confirm if the . scale-vectorsize parameter is available on Thor? Thank you again.
here is the docs:
https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#tcgen05-mma-instructions-mma

AastaLLL · December 18, 2025, 5:35am

Hi,

Thor is a Blackwell chip so it has instruction support for mxfp4.
Thanks.

309850350 · December 22, 2025, 8:18am

I write a kernel using nvcc 13.0.8. Using asm(“tcgen05.xxx“). nvcc -arch=sm_110 test.cu. It will report not support tcgen05.xxx in tartget sm_110

AastaLLL · December 24, 2025, 6:18am

Hi,

Could you upgrade your CUDA to 13.1 and try it again?

If the issue remains, please share a sample to reproduce the error so we can check it in our environment.

Thanks.

never_released · December 24, 2025, 8:09am

use sm_110a or sm_110f

1457689744 · December 31, 2025, 7:00am

Hi bro, have you solved this problem?

Topic		Replies	Views
How to benchmark on Thor to get the real FP4/FP8 performance TFOPS Jetson Thor nvbugs , benchmarks	10	535	March 16, 2026
【Jetson Thor】Cutlass FP4/FP8/FP16 Performance Test Jetson Thor cuda	14	273	June 15, 2026
Performance Benchmarking on Jetson Thor Jetson Thor cublas	6	1803	November 5, 2025
Thor torch.mm benchmark results (float32/float16/float8_e3m2fn) Jetson Thor cuda , pytorch , benchmarks	4	459	September 15, 2025
Conditions on NVJet kernels on Jetson Thor Jetson Thor cublas	13	456	December 11, 2025
How to verify Orin the TOPS performance Jetson Orin NX cuda	9	2390	September 10, 2024
vLLM FP8 models unusable on AGX Thor (SM 11.0): kernels compiled for sm100f only — This kernel only supports sm100f. → CUBLAS_STATUS_INTERNAL_ERROR Jetson Thor cublas , llm	2	72	July 8, 2026
Verifying TOPS with Jetson Orin Nano Jetson Orin NX cudnn	1	611	December 30, 2024
Question on Reproducing DGX Spark (GB10) FP4 1 PFLOPS Performance Using CUTLASS Profiler DGX Spark / GB10 cuda	2	407	January 15, 2026
Discrepancy Between Claimed and Actual Sparse INT8 Performance of Tensor Cores on Jetson AGX Orin Jetson AGX Orin tensorrt , performance	14	982	September 11, 2024

Verifying claimed TOPS performance on Jetson Thor – CUTLASS kernel for SM110 does not run, SM80 gives very low performance (~6.9 TFLOP/s)

Related topics