Regarding the issue of the GPU compute power test results being significantly lower than expected

qinqin52 · August 22, 2024, 3:06am

Using the Nano 8G device, the product’s compute specification is 40 TOPS, but the actual test result is only 0.737 TOPS. Below are the test data. Could you please help analyze the reasons for the insufficient compute power? If the testing method is incorrect, could you provide a correct and effective testing method?

[2024-08-12 12:21:02] root@edge-computer:/home/ec5000/gpu-burn# ./gpu_burn 7200
[2024-08-12 12:21:05]
Using compare file: compare.ptx
[2024-08-12 12:21:05] Burning for 7200 seconds.
[2024-08-12 12:21:05] GPU 0: Orin (nvgpu) (UUID: 08a045c1-dd56-52d6-8474-7004270a7311)
[2024-08-12 12:21:05] Initialized device 0 with 7620 MB of memory (5083 MB available, using 4575 MB of it), using FLOATS
[2024-08-12 12:21:11] Results are 268435456 bytes each, thus performing 15 iterations
[2024-08-12 12:21:11]
0.3% proc’d: 15 (695 Gflop/s) errors: 0 temps: –
0.3% proc’d: 15 (695 Gflop/s) errors: 0 temps: –
0.3% proc’d: 15 (695 Gflop/s) errors: 0 temps: –
0.3% proc’d: 15 (695 Gflop/s) errors: 0 temps: –
0.3% proc’d: 15 (695 Gflop/s) errors: 0 temps: –
0.3% proc’d: 15 (695 Gflop/s) errors: 0 temps: –
0.3% proc’d: 15 (695 Gflop/s) errors: 0 temps: –
0.3% proc’d: 15 (695 Gflop/s) errors: 0 temps: –

[2024-08-05 10:27:28] root@edge-computer:/usr/local/cuda/cuda-samples/Samples/1_Utilities/deviceQuery# ./deviceQuery
[2024-08-05 10:27:30]
./deviceQuery Starting…
[2024-08-05 10:27:30]
[2024-08-05 10:27:30] CUDA Device Query (Runtime API) version (CUDART static linking)
[2024-08-05 10:27:30]
[2024-08-05 10:27:30] Detected 1 CUDA Capable device(s)
[2024-08-05 10:27:30]
[2024-08-05 10:27:30] Device 0: “Orin”
[2024-08-05 10:27:30] CUDA Driver Version / Runtime Version 12.2 / 12.6
[2024-08-05 10:27:30] CUDA Capability Major/Minor version number: 8.7
[2024-08-05 10:27:30] Total amount of global memory: 15656 MBytes (16416673792 bytes)
[2024-08-05 10:27:30] (008) Multiprocessors, (128) CUDA Cores/MP: 1024 CUDA Cores
[2024-08-05 10:27:30] GPU Max Clock rate: 918 MHz (0.92 GHz)
[2024-08-05 10:27:30] Memory Clock rate: 408 Mhz
[2024-08-05 10:27:30] Memory Bus Width: 256-bit
[2024-08-05 10:27:30] L2 Cache Size: 2097152 bytes
[2024-08-05 10:27:30] Maximum Texture Dimension Size (x,y,z) 1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
[2024-08-05 10:27:30] Maximum Layered 1D Texture Size, (num) layers 1D=(32768), 2048 layers
[2024-08-05 10:27:30] Maximum Layered 2D Texture Size, (num) layers 2D=(32768, 32768), 2048 layers
[2024-08-05 10:27:30] Total amount of constant memory: 65536 bytes
[2024-08-05 10:27:30] Total amount of shared memory per block: 49152 bytes
[2024-08-05 10:27:30] Total shared memory per multiprocessor: 167936 bytes
[2024-08-05 10:27:30] Total number of registers available per block: 65536
[2024-08-05 10:27:30] Warp size: 32
[2024-08-05 10:27:30] Maximum number of threads per multiprocessor: 1536
[2024-08-05 10:27:30] Maximum number of threads per block: 1024
[2024-08-05 10:27:30] Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
[2024-08-05 10:27:30] Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
[2024-08-05 10:27:30] Maximum memory pitch: 2147483647 bytes
[2024-08-05 10:27:30] Texture alignment: 512 bytes
[2024-08-05 10:27:30] Concurrent copy and kernel execution: Yes with 2 copy engine(s)
[2024-08-05 10:27:30] Run time limit on kernels: No
[2024-08-05 10:27:30] Integrated GPU sharing Host Memory: Yes
[2024-08-05 10:27:30] Support host page-locked memory mapping: Yes
[2024-08-05 10:27:30] Alignment requirement for Surfaces: Yes
[2024-08-05 10:27:30] Device has ECC support: Disabled
[2024-08-05 10:27:30] Device supports Unified Addressing (UVA): Yes
[2024-08-05 10:27:30] Device supports Managed Memory: Yes
[2024-08-05 10:27:30] Device supports Compute Preemption: Yes
[2024-08-05 10:27:30] Supports Cooperative Kernel Launch: Yes
[2024-08-05 10:27:30] Supports MultiDevice Co-op Kernel Launch: Yes
[2024-08-05 10:27:30] Device PCI Domain ID / Bus ID / location ID: 0 / 0 / 0
[2024-08-05 10:27:30] Compute Mode:
[2024-08-05 10:27:30] < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

root@edge-computer:/home/edge/cuda-samples/Samples/1_Utilities/bandwidthTest# b./debandwidthTest
[2024-08-05 10:42:40]
[CUDA Bandwidth Test] - Starting…
[2024-08-05 10:42:40] Running on…
[2024-08-05 10:42:40]
[2024-08-05 10:42:40] Device 0: Orin
[2024-08-05 10:42:40] Quick Mode
[2024-08-05 10:42:40]
[2024-08-05 10:42:40] Host to Device Bandwidth, 1 Device(s)
[2024-08-05 10:42:41] PINNED Memory Transfers
[2024-08-05 10:42:41] Transfer Size (Bytes)Bandwidth(GB/s)
[2024-08-05 10:42:41] 32000000 11.2
[2024-08-05 10:42:41]
[2024-08-05 10:42:41] Device to Host Bandwidth, 1 Device(s)
[2024-08-05 10:42:41] PINNED Memory Transfers
[2024-08-05 10:42:41] Transfer Size (Bytes)Bandwidth(GB/s)
[2024-08-05 10:42:41] 32000000 11.1
[2024-08-05 10:42:41]
[2024-08-05 10:42:41] Device to Device Bandwidth, 1 Device(s)
[2024-08-05 10:42:41] PINNED Memory Transfers
[2024-08-05 10:42:41] Transfer Size (Bytes)Bandwidth(GB/s)
[2024-08-05 10:42:41] 32000000 42.8
[2024-08-05 10:42:41]
[2024-08-05 10:42:41] Result = PASS
[2024-08-05 10:42:41]
[2024-08-05 10:42:41] NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.

Robert_Crovella · August 22, 2024, 3:48pm

I suggest asking this question on the relevant Orin Nano forum. Note that TOPs and GFLOP/s are not the same thing. Based on the specifications, I would expect that in order to achieve the highest TOPs you would need an operation that uses tensorcore with probably a low precision such as INT8 (or perhaps INT4). Such a test would not be measuring flops. You might want to pass the -tc option to gpu_burn, but I would not expect that that would be doing INT8 ops.

Topic		Replies	Views
Regarding the issue of the GPU compute power test results being significantly lower than expected Jetson Orin Nano nvbugs , gpu-computing	1	41	August 23, 2024
What is the Compute Capability for the Orion? Update your page? Jetson AGX Orin documentation	3	4964	April 18, 2022
The performance of the Jetson Orin Nano module does not match the data provided on the official website Jetson AGX Orin cuda , performance	15	2475	September 28, 2023
Can I get the TOPS through a command Jetson AGX Orin jetson-inference	2	667	May 18, 2022
Orin Shared Memory size documentation Jetson AGX Orin cuda	3	3186	June 14, 2022
CUDA error, bandwithTest.exe CUDA Setup and Installation	12	2489	January 21, 2019
Why does my Jetson Orin Nano Only Has About 6.5GB Memory? Jetson Orin Nano	4	311	April 24, 2024
How to verify Orin the TOPS performance Jetson Orin NX cuda	10	807	October 9, 2024
Driver for GTX 1080 Ti CUDA Programming and Performance	21	19483	June 22, 2017
Memory bandwidth CUDA Programming and Performance	31	38397	October 5, 2007

Regarding the issue of the GPU compute power test results being significantly lower than expected

Related topics