Verifying TOPS with Jetson Orin Nano

This post is a follow up on How to verify Orin the TOPS performance - #13

I’d like to reproduce the maximum performance as stated on the product description.

Currently the maximum performance I can achieve are shown below.

./tools/profiler/cutlass_profiler --gemm_kind=spgemm --m=512 --n=512 --k=8192 --A=s8:row --B=s8:column --C=s8:row --E=u32:nk2 --alpha=1                    --beta=0 --split_k_slices=1 --batch_count=1 --op_class=tensorop --accum=s32 --cta_m=256 --cta_n=128                    --cta_k=128 --cluster_m=1 --cluster_n=1 --cluster_k=1 --stages=3 --warps_m=4 --warps_n=2 --warps_k=1                    --inst_m=16 --inst_n=8 --inst_k=64 --min_cc=80 --max_cc=1024
  Problem ID: 1

        Provider: CUTLASS
   OperationKind: spgemm
       Operation: cutlass_tensorop_s8_i16864spgemm_s8_256x128_128x3_tn_align16

          Status: Success
    Verification: ON
     Disposition: Not verified

reference_device: Not run
          cuBLAS: Not run
           cuDNN: Not run

       Arguments: --gemm_kind=spgemm --m=512 --n=512 --k=8192 --A=s8:row --B=s8:column --C=s8:row --E=u32:nk2 --alpha=1  \
                  --beta=0 --split_k_slices=1 --batch_count=1 --op_class=tensorop --accum=s32 --cta_m=256 --cta_n=128  \
                  --cta_k=128 --cluster_m=1 --cluster_n=1 --cluster_k=1 --stages=3 --warps_m=4 --warps_n=2 --warps_k=1  \
                  --inst_m=16 --inst_n=8 --inst_k=64 --min_cc=80 --max_cc=1024

           Bytes: 7077888  bytes
           FLOPs: 4295491584  flops
           FLOPs/Byte: 606

         Runtime: 0.126824  ms
          Memory: 51.9758 GiB/s

            Math: 33869.6 GFLOP/s

I’m using a Jetson Orin NX 8GB with JetPack 6.1 installed via the instructions on the Getting Started guide: Jetson Orin Nano Developer Kit Getting Started Guide | NVIDIA Developer

System information

Kernel: 5.15.148-tegra
JetPack Version: 36.4.2-20241212160716

Commands to reproduce environment

# clone source and prepare to build
git clone https://github.com/NVIDIA/cutlass.git
cd cutlass/
mkdir build && cd build

# bump configs for better perf
sudo nvpmodel -m 2 # set to MAXN mode
sudo jetson_clocks

# build
CUDACXX=/usr/local/cuda-12.6/bin/nvcc cmake .. -DCUTLASS_NVCC_ARCHS=87 -DCUTLASS_LIBRARY_KERNELS=i16864spgemm
make cutlass_profiler -j12

# run all profiles
./tools/profiler/cutlass_profiler

Is it possible to verify the 40TOPs performance?

33836.9 GFLOPs is ~84.5% of 40 assuming that GFLOP and TOPs are referring to the same metric, however I may be confused about the units…

Please let me know if there is any way to improve the performance further/verify the performance.

Thank you!

Hi,

Do you want to get the TOPs value for “Orin Nano”? (as the title mentioned)
Need to confirm the device first as here is the Orin NX board.

Thanks.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.