Verifying claimed TOPS performance on Jetson Thor – CUTLASS kernel for SM110 does not run, SM80 gives very low performance (~6.9 TFLOP/s)

1457689744 · December 17, 2025, 7:31am

Thanks for your reply.
Here are still 2 questions.

What you mean is that Thor doesn’t support MXFP4?
I am reading the PTX docs and the command tcgen05. mma. cta_group. kind. block_stcale {. scale-vectorsize} indicates that . scale-vectorsize can only be used with sm_100a, sm_100f, and sm110f, but thor is sm_110a. But when the data type is . kind: mxf4nvf4, K is at least 64. I want to confirm if the . scale-vectorsize parameter is available on Thor? Thank you again.
here is the docs:
https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#tcgen05-mma-instructions-mma

Topic		Replies	Views
How to benchmark on Thor to get the real FP4/FP8 performance TFOPS Jetson Thor nvbugs , benchmarks	10	381	March 16, 2026
Performance Benchmarking on Jetson Thor Jetson Thor cublas	7	1166	December 2, 2025
Thor torch.mm benchmark results (float32/float16/float8_e3m2fn) Jetson Thor cuda , pytorch , benchmarks	5	356	September 15, 2025
Conditions on NVJet kernels on Jetson Thor Jetson Thor cublas	14	338	December 30, 2025
How to verify Orin the TOPS performance Jetson Orin NX cuda	10	2150	October 9, 2024
Verifying TOPS with Jetson Orin Nano Jetson Orin NX cudnn	2	502	December 30, 2024
Question on Reproducing DGX Spark (GB10) FP4 1 PFLOPS Performance Using CUTLASS Profiler DGX Spark / GB10 cuda	2	221	January 15, 2026
Discrepancy Between Claimed and Actual Sparse INT8 Performance of Tensor Cores on Jetson AGX Orin Jetson AGX Orin tensorrt , performance	15	814	September 11, 2024
GPU stress test for Orin NX Jetson Orin NX tensorrt	9	1091	June 4, 2025
How the 2070 TFLOPs of Jetson AGX Thor(T5000) is calculated? Jetson Thor	16	1001	October 7, 2025