How to verify Orin the TOPS performance

ztm · August 31, 2024, 9:15am

Continuing the discussion from The performance of the Jetson Orin Nano module does not match the data provided on the official website:

Hi
I read the topic The performance of the Jetson Orin Nano module does not match the data provided on the official website
But still confused on the last reply " get the #operations per cycle and the #cycles per nsecond from the profiler".
Could someone tell how to calculate from the profiler or Nsight.

I’ve run CUDA samples like matrixMulCUBLAS on Orin NX platform but the log didn’t show it hit the max TOPS marked from Orin datasheet.
Per that topic I think there is a method to calcute the realtime TOPS from Nsight log or something else.

I basic know how to calculate the TOPS from topic The tensor core performance detail of Jetson AGX Orin 32GB
But I’d like to view the real results.

Many thanks for answering the question.

AastaLLL · September 2, 2024, 4:54am

Hi,

We recommended trying our CUTLASS library to benchmark peak performance.
You can find more details in the below topic:

Thanks.

ztm · September 3, 2024, 3:19am

Hi Aastalll,
Thanks for your help. After read the topic, I ran the CULTASS as below

$ git clone https://github.com/NVIDIA/cutlass.git
$ cd cutlass/
$ mkdir build && cd build
$ cmake .. -DCUTLASS_NVCC_ARCHS=87 -DCUTLASS_LIBRARY_KERNELS=i16864spgemm
$ make cutlass_profiler -j12

The result is bleow


=============================
  Problem ID: 1

        Provider: CUTLASS
   OperationKind: spgemm
       Operation: cutlass_tensorop_s8_i16864spgemm_s8_256x128_128x3_tn_align16

          Status: Success
    Verification: ON
     Disposition: Not verified

reference_device: Not run
          cuBLAS: Not run
           cuDNN: Not run

       Arguments: --gemm_kind=spgemm --m=16384 --n=16384 --k=16384 --A=s8:row --B=s8:column --C=s8:row --E=u32:nk2 --alpha=1  \
                  --beta=0 --split_k_slices=1 --batch_count=1 --op_class=tensorop --accum=s32 --cta_m=256 --cta_n=128  \
                  --cta_k=128 --cluster_m=1 --cluster_n=1 --cluster_k=1 --stages=3 --warps_m=4 --warps_n=2 --warps_k=1  \
                  --inst_m=16 --inst_n=8 --inst_k=64 --min_cc=80 --max_cc=1024

           Bytes: 704643072  bytes
           FLOPs: 8796629893120  flops
           FLOPs/Byte: 12483

         Runtime: 473.512  ms
          Memory: 1.38592 GiB/s

            Math: 18577.4 GFLOP/s


=============================

CSV Results:

The hareware I’m using is Jetson Orin NX 8GB, Jetpack 6.0. Per the datasheet, “Jetson Orin NX 8GB: Up to 70 (Sparse) INT8 TOPs and 35 (Dense) INT8 TOPs”, after redcuing the DLA 20 TOPS, it should be Sparse INT8 50 TOPs. I tried use other m,n,k, the 18577.4 GFLOP/s is what I can get maximum.

AastaLLL · September 4, 2024, 7:10am

Hi,

Just want to double-confirm, have you maximized the device performance first?

$ sudo nvpmodel -m 0
$ sudo jetson_clocks

Since Orin NX has less memory, could you try Identity4, Identity2, and Identity1 in the below line to see if it helps?

github.com

NVIDIA/cutlass/blob/main/python/cutlass_library/generator.py#L131


      
            def product(X, identity = 1):
              result = identity
              for item in X:
                result *= item
              return result
          
            elements_per_thread = product(tile.threadblock_shape[:-1]) // product(tile.warp_count) // 32 // epilogue_steps
            return min(max_alignment, elements_per_thread)
          
          def DefaultSwizzlingFunctor():
              return SwizzlingFunctor.Identity8
              # To use StreamK decomposition for basic GEMMs, set `swizzling_functor = SwizzlingFunctor.StreamK`
          
          #
          def CreateGemmOperator(manifest, layouts, tile_descriptions, data_type, \
            alignment_constraints, complex_transforms = None, epilogue_functor = EpilogueFunctor.LinearCombination, \
            swizzling_functor = DefaultSwizzlingFunctor()):
          
            if complex_transforms is None:
              complex_transforms = [(ComplexTransform.none, ComplexTransform.none),]

Thanks.

ztm · September 4, 2024, 8:22am

Hi Aastalll,
Sorry I forget to to sudo jetson_clocks.
After sudo jetson_clocks, the max tops can be higher. to 20456.4 GFLOP/s.
I also tried Identity8 Identity4 Identity2 Identity1. After changed, also make cutlass_profiler -j12.
Now 21060.5 GFLOP/s is what I can get maximum.

AastaLLL · September 5, 2024, 6:27am

Hi,

Thanks for the info.
We will give it a check and provide more info to you later.

Thanks.

AastaLLL · September 9, 2024, 7:27am

Hi

We test this with Orin NX 16GB, which expects to get 60TOPs on GPU.
(Total 100 TOPs and 2x DLA can reach 40TOPs)

Then the peak sparse INT8 performance we can get is 34.122 TOPs, around 56 %SOL.

m=512, n=512, k=8192 with Identity2

=============================
  Problem ID: 1

        Provider: CUTLASS
   OperationKind: spgemm
       Operation: cutlass_tensorop_s8_i16864spgemm_s8_256x128_128x3_tn_align16

          Status: Success
    Verification: ON
     Disposition: Not verified

reference_device: Not run
          cuBLAS: Not run
           cuDNN: Not run

       Arguments: --gemm_kind=spgemm --m=512 --n=512 --k=8192 --A=s8:row --B=s8:column --C=s8:row --E=u32:nk2 --alpha=1  \
                  --beta=0 --split_k_slices=1 --batch_count=1 --op_class=tensorop --accum=s32 --cta_m=256 --cta_n=128  \
                  --cta_k=128 --cluster_m=1 --cluster_n=1 --cluster_k=1 --stages=3 --warps_m=4 --warps_n=2 --warps_k=1  \
                  --inst_m=16 --inst_n=8 --inst_k=64 --min_cc=80 --max_cc=1024

           Bytes: 7077888  bytes
           FLOPs: 4295491584  flops
           FLOPs/Byte: 606

         Runtime: 0.125884  ms
          Memory: 52.364 GiB/s

            Math: 34122.6 GFLOP/s

Thanks.

ztm · September 10, 2024, 12:27am

Thanks AastaLLL.
I test it on Orin NX 8GB, which expects to get 50TOPS. m=512, n=512, k=8192 with Identity2 result is 28.5T is ~57%.

Problem ID: 1

        Provider: CUTLASS
   OperationKind: spgemm
       Operation: cutlass_tensorop_s8_i16864spgemm_s8_256x128_128x3_tn_align16

          Status: Success
    Verification: ON
     Disposition: Not verified

reference_device: Not run
          cuBLAS: Not run
           cuDNN: Not run

       Arguments: --gemm_kind=spgemm --m=512 --n=512 --k=8192 --A=s8:row --B=s8:column --C=s8:row --E=u32:nk2 --alpha=1  \
                  --beta=0 --split_k_slices=1 --batch_count=1 --op_class=tensorop --accum=s32 --cta_m=256 --cta_n=128  \
                  --cta_k=128 --cluster_m=1 --cluster_n=1 --cluster_k=1 --stages=3 --warps_m=4 --warps_n=2 --warps_k=1  \
                  --inst_m=16 --inst_n=8 --inst_k=64 --min_cc=80 --max_cc=1024

           Bytes: 7077888  bytes
           FLOPs: 4295491584  flops
           FLOPs/Byte: 606

         Runtime: 0.150457  ms
          Memory: 43.8118 GiB/s

            Math: 28549.6 GFLOP/s

May I know how to test the DLA performance, also marked with TOPS? So I can add DLA and GPU.

AastaLLL · September 10, 2024, 3:18am

Hi,

Since you have less memory, maybe use a smaller matrix size, ex. m=256, n=256, will help.

DLA doesn’t support CUTLASS. Please use TensorRT API instead.
Below is some DLA sample for your reference:

OrinNX 8GB only has one DLA but uses the same frequency as OrinNX 16GB.
So the sparse INT8 peak performance should be 20TOPs.

https://docs.nvidia.com/jetson/archives/r36.3/DeveloperGuide/SD/PlatformPowerAndPerformance/JetsonOrinNanoSeriesJetsonOrinNxSeriesAndJetsonAgxOrinSeries.html#supported-modes-and-power-efficiency

Thanks.

ztm · September 10, 2024, 4:19am

Thanks so much!
On my board, m=256,n=256 didn’t help.


=============================
  Problem ID: 1

        Provider: CUTLASS
   OperationKind: spgemm
       Operation: cutlass_tensorop_s8_i16864spgemm_s8_256x128_128x3_tn_align16

          Status: Success
    Verification: ON
     Disposition: Not verified

reference_device: Not run
          cuBLAS: Not run
           cuDNN: Not run

       Arguments: --gemm_kind=spgemm --m=256 --n=256 --k=8192 --A=s8:row --B=s8:column --C=s8:row --E=u32:nk2 --alpha=1  \
                  --beta=0 --split_k_slices=1 --batch_count=1 --op_class=tensorop --accum=s32 --cta_m=256 --cta_n=128  \
                  --cta_k=128 --cluster_m=1 --cluster_n=1 --cluster_k=1 --stages=3 --warps_m=4 --warps_n=2 --warps_k=1  \
                  --inst_m=16 --inst_n=8 --inst_k=64 --min_cc=80 --max_cc=1024

           Bytes: 3473408  bytes
           FLOPs: 1073872896  flops
           FLOPs/Byte: 309

         Runtime: 0.126499  ms
          Memory: 25.5723 GiB/s

            Math: 8489.21 GFLOP/s


=============================

system · October 9, 2024, 5:35am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Discrepancy Between Claimed and Actual Sparse INT8 Performance of Tensor Cores on Jetson AGX Orin Jetson AGX Orin tensorrt , performance	15	322	September 11, 2024
The performance of the Jetson Orin Nano module does not match the data provided on the official website Jetson AGX Orin cuda , performance	15	2522	September 28, 2023
Verifying TOPS with Jetson Orin Nano Jetson Orin NX cudnn	2	101	December 30, 2024
How to test GPU performance Jetson AGX Orin gpu	2	179	January 10, 2025
Jetson AGX Orin TOPs / CUDA Cores Explained Jetson AGX Orin jetson-inference	8	5740	May 24, 2023
How to Limit AI Performance to Specific TOPS on Jetson Orin Nano Developer Kit Jetson Orin Nano jetson-inference	6	66	November 11, 2024
Why I get much higher TFLOPS in Orin AGX than what claimed in the document IGX Developer Kit kernel , jetson-inference , documentation	7	283	November 4, 2024
Jetson orin nano fp16/int8 performance Jetson Orin Nano jetson-inference	8	194	March 18, 2025
GeMM performance on Orin DLA Jetson AGX Orin tensorrt , cuda , jetson-inference	10	911	February 21, 2024
How to measure 200 TOPS of AI Performance of Orin 32GB? CUDA Programming and Performance jetson	7	2931	July 25, 2023

How to verify Orin the TOPS performance

Related topics