hi teams,
I have flashed orign nano with super mode(sudo ./tools/kernel_flash/l4t_initrd_flash.sh --external-device nvme0n1p1
-c tools/kernel_flash/flash_l4t_t234_nvme.xml -p “-c bootloader/generic/cfg/flash_t234_qspi.xml”
–showlogs --network usb0 jetson-orin-nano-devkit-super internal), and I build the cutlass on a night, I have got the result, it was more less than 67 tops. why?
eyecloud@eyecloud-desktop:~/duke/cutlass-main/build$ sudo nvpmodel -m 2
eyecloud@eyecloud-desktop:~/duke/cutlass-main/build$ sudo jetson_clocks
eyecloud@eyecloud-desktop:~/duke/cutlass-main/build$ ./tools/profiler/cutlass_profiler --operation=gemm --m=8192 --n=8192 --k=8192 --kernels=cutlass_tensorop_s8_i8816gemm_s8_256x128_64x2_tn_align16
=============================
Problem ID: 1
Provider: CUTLASS
OperationKind: gemm
Operation: cutlass_tensorop_s8_i8816gemm_s8_256x128_64x2_tn_align16
Status: Success
Verification: ON
Disposition: Passed
reference_device: Passed
cuBLAS: Not run
cuDNN: Not run
Arguments: --gemm_kind=universal --m=8192 --n=8192 --k=8192 --A=s8:row --B=s8:column --C=s8:column --D=s8:column \
--alpha=1 --beta=0 --split_k_mode=serial --split_k_slices=1 --batch_count=1 --raster_order=heuristic \
--runtime_input_datatype_a=invalid --runtime_input_datatype_b=invalid --use_pdl=false --enable_sm90_mixed_dtype_shuffle_test=false \
--swizzle_size=1 --op_class=tensorop --accum=s32 --cta_m=256 --cta_n=128 --cta_k=64 --cluster_m=1 --cluster_n=1 \
--cluster_k=1 --cluster_m_fallback=1 --cluster_n_fallback=1 --cluster_k_fallback=1 --stages=2 --warps_m=4 \
--warps_n=2 --warps_k=1 --inst_m=8 --inst_n=8 --inst_k=16 --min_cc=75 --max_cc=1024
Bytes: 201326592 bytes
FLOPs: 1099645845504 flops
FLOPs/Byte: 5462
Runtime: 71.6366 ms
Memory: 2.61738 GiB/s
Math: 15350.3 GFLOP/s
=============================
CSV Results:
Problem,Provider,OperationKind,Operation,Disposition,Status,gemm_kind,m,n,k,A,B,C,D,alpha,beta,split_k_mode,split_k_slices,batch_count,raster_order,runtime_input_datatype_a,runtime_input_datatype_b,use_pdl,enable_sm90_mixed_dtype_shuffle_test,swizzle_size,op_class,accum,cta_m,cta_n,cta_k,cluster_m,cluster_n,cluster_k,cluster_m_fallback,cluster_n_fallback,cluster_k_fallback,stages,warps_m,warps_n,warps_k,inst_m,inst_n,inst_k,min_cc,max_cc,Bytes,Flops,Flops/Byte,Runtime,GB/s,GFLOPs
1,CUTLASS,gemm,cutlass_tensorop_s8_i8816gemm_s8_256x128_64x2_tn_align16,passed,success,universal,8192,8192,8192,s8:row,s8:column,s8:column,s8:column,1,0,serial,1,1,heuristic,invalid,invalid,false,false,1,tensorop,s32,256,128,64,1,1,1,1,1,1,2,4,2,1,8,8,16,75,1024,201326592,1099645845504,5462,71.6366,2.61738,15350.3