This post is a follow up on How to verify Orin the TOPS performance - #13
I’d like to reproduce the maximum performance as stated on the product description.
Currently the maximum performance I can achieve are shown below.
./tools/profiler/cutlass_profiler --gemm_kind=spgemm --m=512 --n=512 --k=8192 --A=s8:row --B=s8:column --C=s8:row --E=u32:nk2 --alpha=1 --beta=0 --split_k_slices=1 --batch_count=1 --op_class=tensorop --accum=s32 --cta_m=256 --cta_n=128 --cta_k=128 --cluster_m=1 --cluster_n=1 --cluster_k=1 --stages=3 --warps_m=4 --warps_n=2 --warps_k=1 --inst_m=16 --inst_n=8 --inst_k=64 --min_cc=80 --max_cc=1024
Problem ID: 1
Provider: CUTLASS
OperationKind: spgemm
Operation: cutlass_tensorop_s8_i16864spgemm_s8_256x128_128x3_tn_align16
Status: Success
Verification: ON
Disposition: Not verified
reference_device: Not run
cuBLAS: Not run
cuDNN: Not run
Arguments: --gemm_kind=spgemm --m=512 --n=512 --k=8192 --A=s8:row --B=s8:column --C=s8:row --E=u32:nk2 --alpha=1 \
--beta=0 --split_k_slices=1 --batch_count=1 --op_class=tensorop --accum=s32 --cta_m=256 --cta_n=128 \
--cta_k=128 --cluster_m=1 --cluster_n=1 --cluster_k=1 --stages=3 --warps_m=4 --warps_n=2 --warps_k=1 \
--inst_m=16 --inst_n=8 --inst_k=64 --min_cc=80 --max_cc=1024
Bytes: 7077888 bytes
FLOPs: 4295491584 flops
FLOPs/Byte: 606
Runtime: 0.126824 ms
Memory: 51.9758 GiB/s
Math: 33869.6 GFLOP/s
I’m using a Jetson Orin NX 8GB with JetPack 6.1 installed via the instructions on the Getting Started guide: Jetson Orin Nano Developer Kit Getting Started Guide | NVIDIA Developer
System information
Kernel: 5.15.148-tegra
JetPack Version: 36.4.2-20241212160716
Commands to reproduce environment
# clone source and prepare to build
git clone https://github.com/NVIDIA/cutlass.git
cd cutlass/
mkdir build && cd build
# bump configs for better perf
sudo nvpmodel -m 2 # set to MAXN mode
sudo jetson_clocks
# build
CUDACXX=/usr/local/cuda-12.6/bin/nvcc cmake .. -DCUTLASS_NVCC_ARCHS=87 -DCUTLASS_LIBRARY_KERNELS=i16864spgemm
make cutlass_profiler -j12
# run all profiles
./tools/profiler/cutlass_profiler
Is it possible to verify the 40TOPs performance?
33836.9 GFLOPs
is ~84.5% of 40 assuming that GFLOP and TOPs are referring to the same metric, however I may be confused about the units…
Please let me know if there is any way to improve the performance further/verify the performance.
Thank you!