How to verify Orin the TOPS performance

Continuing the discussion from The performance of the Jetson Orin Nano module does not match the data provided on the official website:

Hi
I read the topic The performance of the Jetson Orin Nano module does not match the data provided on the official website
But still confused on the last reply " get the #operations per cycle and the #cycles per nsecond from the profiler".
Could someone tell how to calculate from the profiler or Nsight.

I’ve run CUDA samples like matrixMulCUBLAS on Orin NX platform but the log didn’t show it hit the max TOPS marked from Orin datasheet.
Per that topic I think there is a method to calcute the realtime TOPS from Nsight log or something else.

I basic know how to calculate the TOPS from topic The tensor core performance detail of Jetson AGX Orin 32GB
But I’d like to view the real results.

Many thanks for answering the question.

Hi,

We recommended trying our CUTLASS library to benchmark peak performance.
You can find more details in the below topic:

Thanks.

Hi Aastalll,
Thanks for your help. After read the topic, I ran the CULTASS as below

$ git clone https://github.com/NVIDIA/cutlass.git
$ cd cutlass/
$ mkdir build && cd build
$ cmake .. -DCUTLASS_NVCC_ARCHS=87 -DCUTLASS_LIBRARY_KERNELS=i16864spgemm
$ make cutlass_profiler -j12

The result is bleow


=============================
  Problem ID: 1

        Provider: CUTLASS
   OperationKind: spgemm
       Operation: cutlass_tensorop_s8_i16864spgemm_s8_256x128_128x3_tn_align16

          Status: Success
    Verification: ON
     Disposition: Not verified

reference_device: Not run
          cuBLAS: Not run
           cuDNN: Not run

       Arguments: --gemm_kind=spgemm --m=16384 --n=16384 --k=16384 --A=s8:row --B=s8:column --C=s8:row --E=u32:nk2 --alpha=1  \
                  --beta=0 --split_k_slices=1 --batch_count=1 --op_class=tensorop --accum=s32 --cta_m=256 --cta_n=128  \
                  --cta_k=128 --cluster_m=1 --cluster_n=1 --cluster_k=1 --stages=3 --warps_m=4 --warps_n=2 --warps_k=1  \
                  --inst_m=16 --inst_n=8 --inst_k=64 --min_cc=80 --max_cc=1024

           Bytes: 704643072  bytes
           FLOPs: 8796629893120  flops
           FLOPs/Byte: 12483

         Runtime: 473.512  ms
          Memory: 1.38592 GiB/s

            Math: 18577.4 GFLOP/s


=============================

CSV Results:

The hareware I’m using is Jetson Orin NX 8GB, Jetpack 6.0. Per the datasheet, “Jetson Orin NX 8GB: Up to 70 (Sparse) INT8 TOPs and 35 (Dense) INT8 TOPs”, after redcuing the DLA 20 TOPS, it should be Sparse INT8 50 TOPs. I tried use other m,n,k, the 18577.4 GFLOP/s is what I can get maximum.

Hi,

Just want to double-confirm, have you maximized the device performance first?

$ sudo nvpmodel -m 0
$ sudo jetson_clocks

Since Orin NX has less memory, could you try Identity4, Identity2, and Identity1 in the below line to see if it helps?

Thanks.

Hi Aastalll,
Sorry I forget to to sudo jetson_clocks.
After sudo jetson_clocks, the max tops can be higher. to 20456.4 GFLOP/s.
I also tried Identity8 Identity4 Identity2 Identity1. After changed, also make cutlass_profiler -j12.
Now 21060.5 GFLOP/s is what I can get maximum.

Hi,

Thanks for the info.
We will give it a check and provide more info to you later.

Thanks.

Hi

We test this with Orin NX 16GB, which expects to get 60TOPs on GPU.
(Total 100 TOPs and 2x DLA can reach 40TOPs)

Then the peak sparse INT8 performance we can get is 34.122 TOPs, around 56 %SOL.

m=512, n=512, k=8192 with Identity2

=============================
  Problem ID: 1

        Provider: CUTLASS
   OperationKind: spgemm
       Operation: cutlass_tensorop_s8_i16864spgemm_s8_256x128_128x3_tn_align16

          Status: Success
    Verification: ON
     Disposition: Not verified

reference_device: Not run
          cuBLAS: Not run
           cuDNN: Not run

       Arguments: --gemm_kind=spgemm --m=512 --n=512 --k=8192 --A=s8:row --B=s8:column --C=s8:row --E=u32:nk2 --alpha=1  \
                  --beta=0 --split_k_slices=1 --batch_count=1 --op_class=tensorop --accum=s32 --cta_m=256 --cta_n=128  \
                  --cta_k=128 --cluster_m=1 --cluster_n=1 --cluster_k=1 --stages=3 --warps_m=4 --warps_n=2 --warps_k=1  \
                  --inst_m=16 --inst_n=8 --inst_k=64 --min_cc=80 --max_cc=1024

           Bytes: 7077888  bytes
           FLOPs: 4295491584  flops
           FLOPs/Byte: 606

         Runtime: 0.125884  ms
          Memory: 52.364 GiB/s

            Math: 34122.6 GFLOP/s

Thanks.