Hello, everyone.
We ordered several Jetson Orin Nano modules and conducted performance testing on their TOPS. According to the official website, the Jetson Orin Nano should deliver 20 TOPS. However, when we used the CUDA sample code “immaTensorCoreGemm” for testing, the achieved TOPS ranged only from 2.5 to 4.5. We’re curious to know what might be causing this significant discrepancy in performance.
Thank you.
Hi,
Have you maximized the device performance first?
$ sudo nvpmodel -m 0
$ sudo jetson_clocks
If the above command doesn’t help, please share your testing source and command with us.
Thanks.
The above command doesn’t help.
This is my testing source
$ make
$ ./immaTensorCoreGemm
This is all we used
Hi,
Could you also share how you calculate the TOPs?
In general, we recommend MLPerf or jetson_benchmark to test the Jetson performance.
Jetson_benchmark GitHub:
MLPerf results:
Thanks.
Hi,
You can also try cuBLAS sample below:
https://elinux.org/Jetson/L4T/TRT_Customized_Example#GPU_Stress_Test
Thanks.
Hello,
We have previously given the benchmark a try and agreed that the Jetson Orin Nano does indeed meet these benchmarks. We also agree that the test effectively evaluates the performance of the Jetson. However, upon reviewing Nvidia’s Technical Specifications on their website (Jetson Orin for Next-Gen Robotics | NVIDIA), particularly under the section titled “See the Jetson Orin compute performance comparison,” we noticed that the “Jetson Orin Nano 8GB” achieves “40 SPARSE INT8 TOPs” and “20 DENSE INT8 TOPs” in terms of “GPU Tensor Core INT8 Performance.”
This has sparked our interest in understanding how the Orin Nano achieves this level of performance. We stumbled upon CUDA sample code “immaTensorCoreGemm,” which is intended to display the TOPS of the device being used. While I’m not very familiar with CUDA and I might have misunderstood the functionality of the code. Based on my understanding, these TOPS are computed by multiplying two int8 matrices and determining the number of calculations executed per second.
However, our own attempts in this direction have yielded results of only 2.5 to 4.5 TOPS, which differs from the figures presented on the website. This suggests the possibility of an error somewhere. As a result, we are keen to understand Nvidia’s approach to calculating TOPS and how we can achieve the benchmark they’ve provided.
Thank you.
Hi,
When running the benchmark, our profiler (ex. Nsight Compute) can generate the TOPs score at runtime.
Is this acceptable to you?
If not, please try the cuBLAS result shared above.
Thanks.
Hi,
I have been trying the code above. It seems that cuBLAS only supports float32 and float16 matrix multiplication, and the output is in FLOPS.
I have tried both codes and made some changes to them.
The first is the original cuBLAS output, and the performance is very low. This issue may be caused by the small matrix size.
In the second code, I changed the original cuBLAS matrix size from MatrixA: 640x480, MatrixB: 480x320 to MatrixA: 2560x1920, MatrixB: 1920x1280. The performance was improved, but it still remains below 1 TFLOPS.
the third, I do the “half” cuBLAS sample that shown in your link. I made 3 changes. The first change is reducing the iteration count from the original 9999999 to a more manageable 10 times for testing purposes. The second change I implemented was adjusting the matrix sizes. I set both matrices to 8192x8192, as I’ve observed that matrix size has a significant impact on the performance of the test. The third change I made was related to the multiplication correctness check. Instead of comparing the results with the CPU, which can be time-consuming, I focused on the calculation ability. I decided to skip the CPU-based correctness test, as I had already confirmed basic correctness through small matrix multiplications. This allowed us to prioritize the assessment of calculation performance.
output is shown below.
These three options didn’t display the TOPS data. I attempted to use Nsight Compute to examine the output of immaTensorCoreGemm, and I found that the output from immaTensorCoreGemm matches the output from Nsight. However, the TOPS performance remains in the range of only 2.5 to 4.5.
The final, I really interested to know how Nvidia check “OPS” ability in website. 20 TOPS is realy hard to achieve, so please tell about how to achieve that benchmark in orin nano, without giving us how to check FLOPS.
Thanks.
Hi,
Have you checked the smsp__inst_executed.sum which is the total value across all SM sub-partitions?
Thanks.
I’m not sure what “smsp__inst_execulated.sum” is, if it’s a parameter to Nsight, I’ve tried that and the reports no change.
Like this :
sudo nsys nvprof --matrics smsp__inst_executed.sum ./immaTensorCoreGemm
Could you please explain how to use this command(parameter)?
Hi,
Sorry for the unclear comment.
The metric can be found in Nsight Compute:
Thanks.
Hi
My orin might have some issues. It has Nsight Compute CLI but no Nsight Compute GUI, so I can’t easily find where “smsp__inst_executed.sum” is. Is there any way to find a solution using Nsight Compute CLI or other tools?
Thanks.
Hi,
Sorry for the late update.
You can get it like below:
$ sudo /tmp/var/target/linux-v4l_l4t-t210-a64/ncu --metrics sm__inst_executed_pipe_tensor.sum.peak_sustained /usr/local/cuda-11.4/samples/0_Simple/vectorAdd/vectorAdd
[Vector addition of 50000 elements]
==PROF== Connected to process 9786 (/usr/local/cuda-11.4/samples/0_Simple/vectorAdd/vectorAdd)
Copy input data from the host memory to the CUDA device
CUDA kernel launch with 196 blocks of 256 threads
==PROF== Profiling "vectorAdd" - 0: 0%....50%....100% - 1 pass
Copy output data from the CUDA device to the host memory
Test PASSED
Done
==PROF== Disconnected from process 9786
[9786] vectorAdd@127.0.0.1
vectorAdd(const float *, const float *, float *, int), 2023-Aug-08 15:40:31, Context 1, Stream 7
Section: Command line profiler metrics
---------------------------------------------------------------------- --------------- ------------------------------
sm__inst_executed_pipe_tensor.sum.peak_sustained inst/cycle 32
---------------------------------------------------------------------- --------------- ------------------------------
Thanks.
Hi,
Thanks for the reply. I received the message below:
---------------------------------------------------------------------- --------------- ------------------------------
sm__inst_executed_pipe_tensor.sum.peak_sustained inst/cycle 16
---------------------------------------------------------------------- --------------- ------------------------------
However, I’m not sure how to convert this information into TOPS (tera operations per second). Is it calculated as (GPU Max Frequency) * (GPU cores) * (inst/cycle)?
Additionally, should we include tensor cores and CUDA cores in this calculation, or should they not be counted in this way?
Thank you.
Hi,
Sorry for the late update.
In general, you can get the #operations per cycle and the #cycles per nsecond from the profiler.
So you can calculate the TOPs value.
TOPs indicate Tensor Operations rather than the tera operations.
Thanks
This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.