How to benchmark on Thor to get the real FP4/FP8 performance TFOPS

As the topic described, anyone know how to benchmark on Thor to get the real FP4/FP8 performance? I try to compile Nvidia cutlass and run the profiler but got nothing output. Here are my procedures:

git clone -b v4.3.1 https://github.com/NVIDIA/cutlass.git
cd cutlass
mkdir build && cd build
cmake .. -DCMAKE_CUDA_ARCHITECTURES="110" -DCUTLASS_NVCC_ARCHS="110"
make cutlass_profiler -j

Then I run cutlass_profiler to do the benchmark as below:

./tools/profiler/cutlass_profiler --operation=Gemm --A=f4:row --B=f4:column --enable-best-kernel-for-fixed-shape --sort-results-flops-per-sec

./tools/profiler/cutlass_profiler --operation=Gemm --A=e2m1:row --B=e2m1:column --enable-best-kernel-for-fixed-shape --sort-results-flops-per-sec

./tools/profiler/cutlass_profiler --operation=Gemm --A=f8:row --B=f8:column --enable-best-kernel-for-fixed-shape --sort-results-flops-per-sec

./tools/profiler/cutlass_profiler --operation=Gemm --A=e4m3:row --B=e4m3:column --enable-best-kernel-for-fixed-shape --sort-results-flops-per-sec

But got nothing output meanwhile no error.

When I try f16 or f32, the command is

./tools/profiler/cutlass_profiler --operation=Gemm --A=f16:row --B=f16:column --enable-best-kernel-for-fixed-shape --sort-results-flops-per-sec

I can get the result as below

CSV Results:

Problem,Provider,OperationKind,Operation,Disposition,Status,gemm_kind,m,n,k,A,B,C,D,alpha,beta,split_k_mode,split_k_slices,batch_count,raster_order,runtime_input_datatype_a,runtime_input_datatype_b,use_pdl,enable_sm90_mixed_dtype_shuffle_test,swizzle_size,op_class,accum,cta_m,cta_n,cta_k,cluster_m,cluster_n,cluster_k,cluster_m_fallback,cluster_n_fallback,cluster_k_fallback,stages,warps_m,warps_n,warps_k,inst_m,inst_n,inst_k,min_cc,max_cc,Bytes,Flops,Flops/Byte,Runtime,GB/s,GFLOPs
1,CUTLASS,gemm,cutlass_tensorop_f16_s16816gemm_f16_256x128_32x3_tn_align8,passed,success,universal,1024,1024,1024,f16:row,f16:column,f16:column,f16:column,1,0,serial,1,1,along_m,invalid,invalid,false,false,1,tensorop,f32,256,128,32,1,1,1,1,1,1,3,4,2,1,16,8,16,80,1024,6291456,2149580800,341,0.0397382,147.449,54093.5
1,CUTLASS,gemm,cutlass_tensorop_h16816gemm_256x128_32x3_tn_align8,passed,success,universal,1024,1024,1024,f16:row,f16:column,f16:column,f16:column,1,0,serial,1,1,along_n,invalid,invalid,false,false,8,tensorop,f16,256,128,32,1,1,1,1,1,1,3,4,2,1,16,8,16,80,1024,6291456,2149580800,341,0.0406298,144.214,52906.6
1,CUTLASS,gemm,cutlass_tensorop_s16816gemm_f16_256x128_32x3_tn_align8,passed,success,universal,1024,1024,1024,f16:row,f16:column,f32:column,f32:column,1,0,serial,1,1,along_m,invalid,invalid,false,false,8,tensorop,f32,256,128,32,1,1,1,1,1,1,3,4,2,1,16,8,16,80,1024,8388608,2149580800,256,0.0484224,161.341,44392.3
1,CUTLASS,gemm,cutlass_tensorop_h1688gemm_256x128_32x2_tn_align8,passed,success,universal,1024,1024,1024,f16:row,f16:column,f16:column,f16:column,1,0,serial,1,1,along_m,invalid,invalid,false,false,8,tensorop,f16,256,128,32,1,1,1,1,1,1,2,4,2,1,16,8,8,75,1024,6291456,2149580800,341,0.0609462,96.1401,35270.1
1,CUTLASS,gemm,cutlass_tensorop_f16_s1688gemm_f16_256x128_32x2_tn_align8,passed,success,universal,1024,1024,1024,f16:row,f16:column,f16:column,f16:column,1,0,serial,1,1,along_n,invalid,invalid,false,false,8,tensorop,f32,256,128,32,1,1,1,1,1,1,2,4,2,1,16,8,8,75,1024,6291456,2149580800,341,0.0610381,95.9954,35217
1,CUTLASS,gemm,cutlass_tensorop_s1688gemm_f16_256x128_32x2_tn_align8,passed,success,universal,1024,1024,1024,f16:row,f16:column,f32:column,f32:column,1,0,serial,1,1,along_m,invalid,invalid,false,false,8,tensorop,f32,256,128,32,1,1,1,1,1,1,2,4,2,1,16,8,8,75,1024,8388608,2149580800,256,0.0707862,110.367,30367.2
1,CUTLASS,gemm,cutlass_simt_hgemm_256x128_8x2_tn_align1,passed,success,universal,1024,1024,1024,f16:row,f16:column,f16:column,f16:column,1,0,serial,1,1,along_n,invalid,invalid,false,false,8,simt,f16,256,128,8,1,1,1,1,1,1,2,4,2,1,1,1,1,50,1024,6291456,2149580800,341,0.147884,39.6215,14535.6

So why FP4/FP8 not available with cutlass_profiler, is that something wrong with my compile or benchmark arguments?

Hi,

Please check the steps shared in the link below:

THanks.

It is weird. I have followed the steps. But still nothing output when measuring TOPS of FP4.

I am using cuda 13.0.

Hi,

Have you rebuilt the library? Could you share your steps with us so we can check them further?

Thanks.

Yes I have rebuilt cutlass. I can post my testing steps here.

git clone -b v4.3.1 https://github.com/NVIDIA/cutlass.git

cd cutlass

mkdir build && cd build

cmake .. -DCUTLASS_NVCC_ARCHS="110a" -DCUTLASS_LIBRARY_KERNELS=all  -DCUTLASS_UNITY_BUILD_ENABLED=ON

make cutlass_profiler -j12

./tools/profiler/cutlass_profiler --operation=SparseGemm --m=4096 --n=8192 --k=8192 --A=f4:row --B=f4:column --C=f16:column --D=fe5m2:column --enable-best-kernel-for-fixed-shape --sort-results-flops-per-sec

./tools/profiler/cutlass_profiler --operation=Gemm --m=4096 --n=8192 --k=8192 --A=f4:row --B=f4:column --C=f16:column --D=fe5m2:column --enable-best-kernel-for-fixed-shape --sort-results-flops-per-sec

./tools/profiler/cutlass_profiler   --m=4096 --n=8192 --k=8192   --verification-enabled=false   --kernels="cutlass3x_sm100_bstensorop_*"  --enable-kernel-performance-search --sort-results-flops-per-sec --enable-best-kernel-for-fixed-shape

Nothing output when I run cutlass_profiler for FP4.

Hi,

Thanks for sharing the steps.
We are testing this issue with your commands (it takes some time to compile the cutlass library).

Will update with more information later.

Thanks.

Any update about the test?

Hi,

It looks like you can get the FP4 results already?

Thanks.

No, nothing output when I run cutlass_profiler for FP4 TOPS. It’s someone else. I follow the steps but still got no output.

Hi,

We are checking this internally.
Will get back to you later.

Thanks.