How to benchmark on Thor to get the real FP4/FP8 performance TFOPS

laixuejin31 · December 5, 2025, 6:32am

As the topic described, anyone know how to benchmark on Thor to get the real FP4/FP8 performance? I try to compile Nvidia cutlass and run the profiler but got nothing output. Here are my procedures:

git clone -b v4.3.1 https://github.com/NVIDIA/cutlass.git
cd cutlass
mkdir build && cd build
cmake .. -DCMAKE_CUDA_ARCHITECTURES="110" -DCUTLASS_NVCC_ARCHS="110"
make cutlass_profiler -j

Then I run cutlass_profiler to do the benchmark as below:

./tools/profiler/cutlass_profiler --operation=Gemm --A=f4:row --B=f4:column --enable-best-kernel-for-fixed-shape --sort-results-flops-per-sec

./tools/profiler/cutlass_profiler --operation=Gemm --A=e2m1:row --B=e2m1:column --enable-best-kernel-for-fixed-shape --sort-results-flops-per-sec

./tools/profiler/cutlass_profiler --operation=Gemm --A=f8:row --B=f8:column --enable-best-kernel-for-fixed-shape --sort-results-flops-per-sec

./tools/profiler/cutlass_profiler --operation=Gemm --A=e4m3:row --B=e4m3:column --enable-best-kernel-for-fixed-shape --sort-results-flops-per-sec

But got nothing output meanwhile no error.

When I try f16 or f32, the command is

./tools/profiler/cutlass_profiler --operation=Gemm --A=f16:row --B=f16:column --enable-best-kernel-for-fixed-shape --sort-results-flops-per-sec

I can get the result as below

CSV Results:

Problem,Provider,OperationKind,Operation,Disposition,Status,gemm_kind,m,n,k,A,B,C,D,alpha,beta,split_k_mode,split_k_slices,batch_count,raster_order,runtime_input_datatype_a,runtime_input_datatype_b,use_pdl,enable_sm90_mixed_dtype_shuffle_test,swizzle_size,op_class,accum,cta_m,cta_n,cta_k,cluster_m,cluster_n,cluster_k,cluster_m_fallback,cluster_n_fallback,cluster_k_fallback,stages,warps_m,warps_n,warps_k,inst_m,inst_n,inst_k,min_cc,max_cc,Bytes,Flops,Flops/Byte,Runtime,GB/s,GFLOPs
1,CUTLASS,gemm,cutlass_tensorop_f16_s16816gemm_f16_256x128_32x3_tn_align8,passed,success,universal,1024,1024,1024,f16:row,f16:column,f16:column,f16:column,1,0,serial,1,1,along_m,invalid,invalid,false,false,1,tensorop,f32,256,128,32,1,1,1,1,1,1,3,4,2,1,16,8,16,80,1024,6291456,2149580800,341,0.0397382,147.449,54093.5
1,CUTLASS,gemm,cutlass_tensorop_h16816gemm_256x128_32x3_tn_align8,passed,success,universal,1024,1024,1024,f16:row,f16:column,f16:column,f16:column,1,0,serial,1,1,along_n,invalid,invalid,false,false,8,tensorop,f16,256,128,32,1,1,1,1,1,1,3,4,2,1,16,8,16,80,1024,6291456,2149580800,341,0.0406298,144.214,52906.6
1,CUTLASS,gemm,cutlass_tensorop_s16816gemm_f16_256x128_32x3_tn_align8,passed,success,universal,1024,1024,1024,f16:row,f16:column,f32:column,f32:column,1,0,serial,1,1,along_m,invalid,invalid,false,false,8,tensorop,f32,256,128,32,1,1,1,1,1,1,3,4,2,1,16,8,16,80,1024,8388608,2149580800,256,0.0484224,161.341,44392.3
1,CUTLASS,gemm,cutlass_tensorop_h1688gemm_256x128_32x2_tn_align8,passed,success,universal,1024,1024,1024,f16:row,f16:column,f16:column,f16:column,1,0,serial,1,1,along_m,invalid,invalid,false,false,8,tensorop,f16,256,128,32,1,1,1,1,1,1,2,4,2,1,16,8,8,75,1024,6291456,2149580800,341,0.0609462,96.1401,35270.1
1,CUTLASS,gemm,cutlass_tensorop_f16_s1688gemm_f16_256x128_32x2_tn_align8,passed,success,universal,1024,1024,1024,f16:row,f16:column,f16:column,f16:column,1,0,serial,1,1,along_n,invalid,invalid,false,false,8,tensorop,f32,256,128,32,1,1,1,1,1,1,2,4,2,1,16,8,8,75,1024,6291456,2149580800,341,0.0610381,95.9954,35217
1,CUTLASS,gemm,cutlass_tensorop_s1688gemm_f16_256x128_32x2_tn_align8,passed,success,universal,1024,1024,1024,f16:row,f16:column,f32:column,f32:column,1,0,serial,1,1,along_m,invalid,invalid,false,false,8,tensorop,f32,256,128,32,1,1,1,1,1,1,2,4,2,1,16,8,8,75,1024,8388608,2149580800,256,0.0707862,110.367,30367.2
1,CUTLASS,gemm,cutlass_simt_hgemm_256x128_8x2_tn_align1,passed,success,universal,1024,1024,1024,f16:row,f16:column,f16:column,f16:column,1,0,serial,1,1,along_n,invalid,invalid,false,false,8,simt,f16,256,128,8,1,1,1,1,1,1,2,4,2,1,1,1,1,50,1024,6291456,2149580800,341,0.147884,39.6215,14535.6

So why FP4/FP8 not available with cutlass_profiler, is that something wrong with my compile or benchmark arguments?

AastaLLL · December 8, 2025, 3:03am

Hi,

Please check the steps shared in the link below:

THanks.

laixuejin31 · December 9, 2025, 2:28am

It is weird. I have followed the steps. But still nothing output when measuring TOPS of FP4.

I am using cuda 13.0.

AastaLLL · December 10, 2025, 5:48am

Hi,

Have you rebuilt the library? Could you share your steps with us so we can check them further?

Thanks.

laixuejin31 · December 10, 2025, 10:04am

Yes I have rebuilt cutlass. I can post my testing steps here.

git clone -b v4.3.1 https://github.com/NVIDIA/cutlass.git

cd cutlass

mkdir build && cd build

cmake .. -DCUTLASS_NVCC_ARCHS="110a" -DCUTLASS_LIBRARY_KERNELS=all  -DCUTLASS_UNITY_BUILD_ENABLED=ON

make cutlass_profiler -j12

./tools/profiler/cutlass_profiler --operation=SparseGemm --m=4096 --n=8192 --k=8192 --A=f4:row --B=f4:column --C=f16:column --D=fe5m2:column --enable-best-kernel-for-fixed-shape --sort-results-flops-per-sec

./tools/profiler/cutlass_profiler --operation=Gemm --m=4096 --n=8192 --k=8192 --A=f4:row --B=f4:column --C=f16:column --D=fe5m2:column --enable-best-kernel-for-fixed-shape --sort-results-flops-per-sec

./tools/profiler/cutlass_profiler   --m=4096 --n=8192 --k=8192   --verification-enabled=false   --kernels="cutlass3x_sm100_bstensorop_*"  --enable-kernel-performance-search --sort-results-flops-per-sec --enable-best-kernel-for-fixed-shape

Nothing output when I run cutlass_profiler for FP4.

AastaLLL · December 11, 2025, 6:44am

Hi,

Thanks for sharing the steps.
We are testing this issue with your commands (it takes some time to compile the cutlass library).

Will update with more information later.

Thanks.

laixuejin31 · December 15, 2025, 2:54am

Any update about the test?

AastaLLL · December 15, 2025, 7:38am

Hi,

It looks like you can get the FP4 results already?

Thanks.

laixuejin31 · December 15, 2025, 9:08am

No, nothing output when I run cutlass_profiler for FP4 TOPS. It’s someone else. I follow the steps but still got no output.

AastaLLL · January 8, 2026, 6:45am

Hi,

We are checking this internally.
Will get back to you later.

Thanks.

AastaLLL · March 16, 2026, 8:35am

Hi,

Sorry for the late update.
We have confirmed that the F4/F8 benchmark is working with cutlass_profiler.

Please find the details below:

Setup

$ git clone https://github.com/NVIDIA/cutlass.git
$ cd cutlass/
$ git checkout v4.4.1
$ mkdir build
$ cd build/
$ export CUDACXX=/usr/local/cuda-13.0/bin/nvcc
$ cmake .. -DCUTLASS_NVCC_ARCHS="110a" -DCMAKE_CUDA_ARCHITECTURES="110a" -DCUTLASS_ENABLE_CUBLAS=true
$ make -j$(nproc)

F8 Kernel

./tools/profiler/cutlass_profiler --kernels="cutlass3x_sm100_tensorop_gemm_f8_f8_*" --enable-best-kernel-for-fixed-shape --m=9568 --n=1280 --k=3420 --sort-results-flops-per-sec
=============================
  Problem ID: 1

        Provider: CUTLASS
   OperationKind: gemm
       Operation: cutlass3x_sm100_tensorop_gemm_f8_f8_f32_void_e5m2_128x256x128_2x1x1_0_ntt_align16_1sm

          Status: Success
    Verification: ON
     Disposition: Not verified

reference_device: Not run
          cuBLAS: Not run
           cuDNN: Not run

       Arguments: --gemm_kind=universal --m=9568 --n=1280 --k=3420 --A=f8:column --B=f8:row --C=void:row --D=fe5m2:row  \
                  --alpha=1 --beta=0 --split_k_mode=serial --split_k_slices=1 --batch_count=1 --raster_order=along_n  \
                  --runtime_input_datatype_a=invalid --runtime_input_datatype_b=invalid --use_pdl=false --enable_sm90_mixed_dtype_shuffle_test=false  \
                  --swizzle_size=2 --op_class=tensorop --accum=f32 --cta_m=128 --cta_n=256 --cta_k=128 --cluster_m=2  \
                  --cluster_n=1 --cluster_k=1 --cluster_m_fallback=1 --cluster_n_fallback=1 --cluster_k_fallback=1 --stages=4  \
                  --warps_m=4 --warps_n=1 --warps_k=1 --inst_m=128 --inst_n=256 --inst_k=32 --min_cc=100 --max_cc=110  \
                 

           Bytes: 37100160  bytes
           FLOPs: 83794247680  flops
           FLOPs/Byte: 2258

         Runtime: 0.290212  ms
          Memory: 119.059 GiB/s

            Math: 288735 GFLOP/s

F4 Kernel

$ tools/profiler/cutlass_profiler --kernels="cutlass3x_sm100_tensorop_gemm_f4_f4_*" --enable-best-kernel-for-fixed-shape --m=4096 --n=4096 --k=4096 --sort-results-flops-per-sec
...
=============================
  Problem ID: 1

        Provider: CUTLASS
   OperationKind: gemm
       Operation: cutlass3x_sm100_tensorop_gemm_f4_f4_f32_void_e5m2_256x256x128_0x0x1_0_tnt_align128_2sm

          Status: Success
    Verification: ON
     Disposition: Not verified

reference_device: Not run
          cuBLAS: Not run
           cuDNN: Not run

       Arguments: --gemm_kind=universal --m=4096 --n=4096 --k=4096 --A=f4:row --B=f4:column --C=void:row --D=fe5m2:row  \
                  --alpha=1 --beta=0 --split_k_mode=serial --split_k_slices=1 --batch_count=1 --raster_order=along_n  \
                  --runtime_input_datatype_a=invalid --runtime_input_datatype_b=invalid --use_pdl=false --enable_sm90_mixed_dtype_shuffle_test=false  \
                  --swizzle_size=2 --op_class=tensorop --accum=f32 --cta_m=256 --cta_n=256 --cta_k=128 --cluster_m=8  \
                  --cluster_n=2 --cluster_k=1 --cluster_m_fallback=2 --cluster_n_fallback=1 --cluster_k_fallback=1 --stages=6  \
                  --warps_m=4 --warps_n=1 --warps_k=1 --inst_m=256 --inst_n=256 --inst_k=32 --min_cc=100 --max_cc=110  \
                 

           Bytes: 16777216  bytes
           FLOPs: 137472507904  flops
           FLOPs/Byte: 8194

         Runtime: 0.333727  ms
          Memory: 46.8197 GiB/s

            Math: 411931 GFLOP/s

Thanks.

Topic		Replies	Views
【Jetson Thor】Cutlass FP4/FP8/FP16 Performance Test Jetson Thor cuda	14	271	June 15, 2026
Verifying claimed TOPS performance on Jetson Thor – CUTLASS kernel for SM110 does not run, SM80 gives very low performance (~6.9 TFLOP/s) Jetson Thor cudnn , cublas	21	930	January 5, 2026
Question on Reproducing DGX Spark (GB10) FP4 1 PFLOPS Performance Using CUTLASS Profiler DGX Spark / GB10 cuda	2	406	January 15, 2026
1PFLOP - how? DGX Spark / GB10	3	816	June 23, 2026
Thor torch.mm benchmark results (float32/float16/float8_e3m2fn) Jetson Thor cuda , pytorch , benchmarks	4	458	September 15, 2025
Is there any official benchmark tool to test a GPU's FLOPS? GPU-Accelerated Libraries cublas , cutlass	3	7200	October 24, 2023
Just Released: CUTLASS 3.8 Technical Blog	0	410	February 3, 2025
Performance Benchmarking on Jetson Thor Jetson Thor cublas	6	1791	November 5, 2025
Conditions on NVJet kernels on Jetson Thor Jetson Thor cublas	13	456	December 11, 2025
Verify ai performance by cutlass_profiler,but it was too slow,why? Jetson Orin Nano cudnn , cublas	1	51	March 4, 2026

How to benchmark on Thor to get the real FP4/FP8 performance TFOPS

Setup

F8 Kernel

F4 Kernel

Related topics