As the topic described, anyone know how to benchmark on Thor to get the real FP4/FP8 performance? I try to compile Nvidia cutlass and run the profiler but got nothing output. Here are my procedures:
git clone -b v4.3.1 https://github.com/NVIDIA/cutlass.git
cd cutlass
mkdir build && cd build
cmake .. -DCMAKE_CUDA_ARCHITECTURES="110" -DCUTLASS_NVCC_ARCHS="110"
make cutlass_profiler -j
Then I run cutlass_profiler to do the benchmark as below:
./tools/profiler/cutlass_profiler --operation=Gemm --A=f4:row --B=f4:column --enable-best-kernel-for-fixed-shape --sort-results-flops-per-sec
./tools/profiler/cutlass_profiler --operation=Gemm --A=e2m1:row --B=e2m1:column --enable-best-kernel-for-fixed-shape --sort-results-flops-per-sec
./tools/profiler/cutlass_profiler --operation=Gemm --A=f8:row --B=f8:column --enable-best-kernel-for-fixed-shape --sort-results-flops-per-sec
./tools/profiler/cutlass_profiler --operation=Gemm --A=e4m3:row --B=e4m3:column --enable-best-kernel-for-fixed-shape --sort-results-flops-per-sec
But got nothing output meanwhile no error.
When I try f16 or f32, the command is
./tools/profiler/cutlass_profiler --operation=Gemm --A=f16:row --B=f16:column --enable-best-kernel-for-fixed-shape --sort-results-flops-per-sec
I can get the result as below
CSV Results:
Problem,Provider,OperationKind,Operation,Disposition,Status,gemm_kind,m,n,k,A,B,C,D,alpha,beta,split_k_mode,split_k_slices,batch_count,raster_order,runtime_input_datatype_a,runtime_input_datatype_b,use_pdl,enable_sm90_mixed_dtype_shuffle_test,swizzle_size,op_class,accum,cta_m,cta_n,cta_k,cluster_m,cluster_n,cluster_k,cluster_m_fallback,cluster_n_fallback,cluster_k_fallback,stages,warps_m,warps_n,warps_k,inst_m,inst_n,inst_k,min_cc,max_cc,Bytes,Flops,Flops/Byte,Runtime,GB/s,GFLOPs
1,CUTLASS,gemm,cutlass_tensorop_f16_s16816gemm_f16_256x128_32x3_tn_align8,passed,success,universal,1024,1024,1024,f16:row,f16:column,f16:column,f16:column,1,0,serial,1,1,along_m,invalid,invalid,false,false,1,tensorop,f32,256,128,32,1,1,1,1,1,1,3,4,2,1,16,8,16,80,1024,6291456,2149580800,341,0.0397382,147.449,54093.5
1,CUTLASS,gemm,cutlass_tensorop_h16816gemm_256x128_32x3_tn_align8,passed,success,universal,1024,1024,1024,f16:row,f16:column,f16:column,f16:column,1,0,serial,1,1,along_n,invalid,invalid,false,false,8,tensorop,f16,256,128,32,1,1,1,1,1,1,3,4,2,1,16,8,16,80,1024,6291456,2149580800,341,0.0406298,144.214,52906.6
1,CUTLASS,gemm,cutlass_tensorop_s16816gemm_f16_256x128_32x3_tn_align8,passed,success,universal,1024,1024,1024,f16:row,f16:column,f32:column,f32:column,1,0,serial,1,1,along_m,invalid,invalid,false,false,8,tensorop,f32,256,128,32,1,1,1,1,1,1,3,4,2,1,16,8,16,80,1024,8388608,2149580800,256,0.0484224,161.341,44392.3
1,CUTLASS,gemm,cutlass_tensorop_h1688gemm_256x128_32x2_tn_align8,passed,success,universal,1024,1024,1024,f16:row,f16:column,f16:column,f16:column,1,0,serial,1,1,along_m,invalid,invalid,false,false,8,tensorop,f16,256,128,32,1,1,1,1,1,1,2,4,2,1,16,8,8,75,1024,6291456,2149580800,341,0.0609462,96.1401,35270.1
1,CUTLASS,gemm,cutlass_tensorop_f16_s1688gemm_f16_256x128_32x2_tn_align8,passed,success,universal,1024,1024,1024,f16:row,f16:column,f16:column,f16:column,1,0,serial,1,1,along_n,invalid,invalid,false,false,8,tensorop,f32,256,128,32,1,1,1,1,1,1,2,4,2,1,16,8,8,75,1024,6291456,2149580800,341,0.0610381,95.9954,35217
1,CUTLASS,gemm,cutlass_tensorop_s1688gemm_f16_256x128_32x2_tn_align8,passed,success,universal,1024,1024,1024,f16:row,f16:column,f32:column,f32:column,1,0,serial,1,1,along_m,invalid,invalid,false,false,8,tensorop,f32,256,128,32,1,1,1,1,1,1,2,4,2,1,16,8,8,75,1024,8388608,2149580800,256,0.0707862,110.367,30367.2
1,CUTLASS,gemm,cutlass_simt_hgemm_256x128_8x2_tn_align1,passed,success,universal,1024,1024,1024,f16:row,f16:column,f16:column,f16:column,1,0,serial,1,1,along_n,invalid,invalid,false,false,8,simt,f16,256,128,8,1,1,1,1,1,1,2,4,2,1,1,1,1,50,1024,6291456,2149580800,341,0.147884,39.6215,14535.6
So why FP4/FP8 not available with cutlass_profiler, is that something wrong with my compile or benchmark arguments?
