Verify ai performance by cutlass_profiler,but it was too slow,why?

hi teams,
I have flashed orign nano with super mode(sudo ./tools/kernel_flash/l4t_initrd_flash.sh --external-device nvme0n1p1
-c tools/kernel_flash/flash_l4t_t234_nvme.xml -p “-c bootloader/generic/cfg/flash_t234_qspi.xml”
–showlogs --network usb0 jetson-orin-nano-devkit-super internal), and I build the cutlass on a night, I have got the result, it was more less than 67 tops. why?
eyecloud@eyecloud-desktop:~/duke/cutlass-main/build$ sudo nvpmodel -m 2
eyecloud@eyecloud-desktop:~/duke/cutlass-main/build$ sudo jetson_clocks
eyecloud@eyecloud-desktop:~/duke/cutlass-main/build$ ./tools/profiler/cutlass_profiler --operation=gemm --m=8192 --n=8192 --k=8192 --kernels=cutlass_tensorop_s8_i8816gemm_s8_256x128_64x2_tn_align16

=============================
Problem ID: 1

    Provider: CUTLASS

OperationKind: gemm
Operation: cutlass_tensorop_s8_i8816gemm_s8_256x128_64x2_tn_align16

      Status: Success
Verification: ON
 Disposition: Passed

reference_device: Passed
cuBLAS: Not run
cuDNN: Not run

   Arguments: --gemm_kind=universal --m=8192 --n=8192 --k=8192 --A=s8:row --B=s8:column --C=s8:column --D=s8:column  \
              --alpha=1 --beta=0 --split_k_mode=serial --split_k_slices=1 --batch_count=1 --raster_order=heuristic  \
              --runtime_input_datatype_a=invalid --runtime_input_datatype_b=invalid --use_pdl=false --enable_sm90_mixed_dtype_shuffle_test=false  \
              --swizzle_size=1 --op_class=tensorop --accum=s32 --cta_m=256 --cta_n=128 --cta_k=64 --cluster_m=1 --cluster_n=1  \
              --cluster_k=1 --cluster_m_fallback=1 --cluster_n_fallback=1 --cluster_k_fallback=1 --stages=2 --warps_m=4  \
              --warps_n=2 --warps_k=1 --inst_m=8 --inst_n=8 --inst_k=16 --min_cc=75 --max_cc=1024

       Bytes: 201326592  bytes
       FLOPs: 1099645845504  flops
       FLOPs/Byte: 5462

     Runtime: 71.6366  ms
      Memory: 2.61738 GiB/s

        Math: 15350.3 GFLOP/s

=============================

CSV Results:

Problem,Provider,OperationKind,Operation,Disposition,Status,gemm_kind,m,n,k,A,B,C,D,alpha,beta,split_k_mode,split_k_slices,batch_count,raster_order,runtime_input_datatype_a,runtime_input_datatype_b,use_pdl,enable_sm90_mixed_dtype_shuffle_test,swizzle_size,op_class,accum,cta_m,cta_n,cta_k,cluster_m,cluster_n,cluster_k,cluster_m_fallback,cluster_n_fallback,cluster_k_fallback,stages,warps_m,warps_n,warps_k,inst_m,inst_n,inst_k,min_cc,max_cc,Bytes,Flops,Flops/Byte,Runtime,GB/s,GFLOPs
1,CUTLASS,gemm,cutlass_tensorop_s8_i8816gemm_s8_256x128_64x2_tn_align16,passed,success,universal,8192,8192,8192,s8:row,s8:column,s8:column,s8:column,1,0,serial,1,1,heuristic,invalid,invalid,false,false,1,tensorop,s32,256,128,64,1,1,1,1,1,1,2,4,2,1,8,8,16,75,1024,201326592,1099645845504,5462,71.6366,2.61738,15350.3

Hi,

Please test different m/n/k to find the sweet spot for the benchmark.
For example, below is an example:

Thanks.