GPU stress test for Orin NX

• Hardware Platform (Jetson / GPU)
Jetson Orin NX 16GB
• JetPack Version (valid for Jetson only)
6.2

Hi,

I used the gpu stress test in 2 minutes with my custom carrier board.

https://elinux.org/Jetson/L4T/TRT_Customized_Example#GPU_Stress_Test

The power can achieve 37W (super mode), and the temperature is closed to 80-90C. But I got 0.137 TOPS only, that is far less than the 157 TOPS which the official announced. Could you give me some advice?

Performance= 375417490.63 GFlop/s, Time= 0.000 msec, Size= 137438953472 Ops

1 Like

Hi,

How do you calculate the GFlop?
For TOPS benchmarking, it’s recommended to try our CUTLASS library.

Thanks.

Hi AastaLLL,

The stress test tool I refer to the topic below.

And the method that the program calculates the GFlop :
GFlop = (2* $Matrix_size *10e-9) / (operation time / 10e-3)

Oh, it seem like Ops is matrix size 240924092*4092 Ops=137438953472 Ops. I misunderstanded the value …
But is the 375417490.63 GFlop/s also too larger right?

Hi AastaLLL,

I found the error of the modified code.
Now, the GFlop is correct (13-15 TOPS) in float16 case.

Could you provide the tool to test GPU stress in int 8?
Does the CUTLASS library can test it ?

Hi,

Yes, you can find the below topic for some info:

Thanks.

Hi AastaLLL,

I tried the method you provided, the steps are:

  1. Changing the below line from Identity8 to Identity4:
    cutlass/python/cutlass_library/generator.py at main · NVIDIA/cutlass · GitHub

  2. Build cutlass library
    $ git clone GitHub - NVIDIA/cutlass: CUDA Templates for Linear Algebra Subroutines
    $ cd cutlass/
    $ mkdir build && cd build
    $ cmake .. -DCUTLASS_NVCC_ARCHS=87 /
    -DCUTLASS_LIBRARY_KERNELS=i16864spgemm
    $ make cutlass_profiler -j12

  3. ./tools/profiler/cutlass_profiler --gemm_kind=sgemm --m=1024 --n=1024
    –k=8192 --A=s8:row --B=s8:column --C=s8:row --E=u32:nk2 --alpha=1
    –beta=0 --split_k_slices=1 --batch_count=1 --op_class=tensorop
    –accum=s32 --cta_m=256 --cta_n=128 --cta_k=128 --cluster_m=1
    –cluster_n=1 --cluster_k=1 --stages=3 --warps_m=4 --warps_n=2
    –warps_k=1 --inst_m=16 --inst_n=8 --inst_k=64 --min_cc=80 --max_cc=1024

And I got the result:

The jetson clock is running, and the power mode is already MAXN_SUPER.
But, the performance is still only 37 TOPS, which is significantly below the sparse INT8 performance (100 TOPS) written in the datasheet.
Could you give me some advise?
BTW, my jetson is ORIN NX, not AGX ORIN. Should the parameters be changed for my case?
Thanks !!!

Hi,

To test TOPS, you will need a test that computation >> memory transfer.
So would you mind trying different (k, m, n)?

You can test it with an argument like k=8192:16384:128.

Thanks.

Hi,

I tested k=8192:16384:128, and I got 39 TOPS.

I also saw your response and tried: How to verify Orin the TOPS performance - #8 by AastaLLL

I got the 46909 GFLOP/s = 46.9TOPS in m=512, n=512, k=16256, Identity2.
But, I can’t understand that the SOL is 46.9/60 or 46.9/100 ?
BTW, I don’t know how to test DLA’s TOPS, does NV provide the test tool recently?

Hi,

You will need to use TensorRT to run operations on the DLA.
Please check the below link for more information:

Thanks.