Regarding the issue of the GPU compute power test results being significantly lower than expected

Using the Nano 8G device, the product’s compute specification is 40 TOPS, but the actual test result is only 0.737 TOPS. Below are the test data. Could you please help analyze the reasons for the insufficient compute power? If the testing method is incorrect, could you provide a correct and effective testing method?

[2024-08-12 12:21:02] root@edge-computer:/home/ec5000/gpu-burn# ./gpu_burn 7200
[2024-08-12 12:21:05]
Using compare file: compare.ptx
[2024-08-12 12:21:05] Burning for 7200 seconds.
[2024-08-12 12:21:05] GPU 0: Orin (nvgpu) (UUID: 08a045c1-dd56-52d6-8474-7004270a7311)
[2024-08-12 12:21:05] Initialized device 0 with 7620 MB of memory (5083 MB available, using 4575 MB of it), using FLOATS
[2024-08-12 12:21:11] Results are 268435456 bytes each, thus performing 15 iterations
[2024-08-12 12:21:11]
0.3% proc’d: 15 (695 Gflop/s) errors: 0 temps: –
0.3% proc’d: 15 (695 Gflop/s) errors: 0 temps: –
0.3% proc’d: 15 (695 Gflop/s) errors: 0 temps: –
0.3% proc’d: 15 (695 Gflop/s) errors: 0 temps: –
0.3% proc’d: 15 (695 Gflop/s) errors: 0 temps: –
0.3% proc’d: 15 (695 Gflop/s) errors: 0 temps: –
0.3% proc’d: 15 (695 Gflop/s) errors: 0 temps: –
0.3% proc’d: 15 (695 Gflop/s) errors: 0 temps: –

[2024-08-05 10:27:28] root@edge-computer:/usr/local/cuda/cuda-samples/Samples/1_Utilities/deviceQuery# ./deviceQuery
[2024-08-05 10:27:30]
./deviceQuery Starting…
[2024-08-05 10:27:30]
[2024-08-05 10:27:30] CUDA Device Query (Runtime API) version (CUDART static linking)
[2024-08-05 10:27:30]
[2024-08-05 10:27:30] Detected 1 CUDA Capable device(s)
[2024-08-05 10:27:30]
[2024-08-05 10:27:30] Device 0: “Orin”
[2024-08-05 10:27:30] CUDA Driver Version / Runtime Version 12.2 / 12.6
[2024-08-05 10:27:30] CUDA Capability Major/Minor version number: 8.7
[2024-08-05 10:27:30] Total amount of global memory: 15656 MBytes (16416673792 bytes)
[2024-08-05 10:27:30] (008) Multiprocessors, (128) CUDA Cores/MP: 1024 CUDA Cores
[2024-08-05 10:27:30] GPU Max Clock rate: 918 MHz (0.92 GHz)
[2024-08-05 10:27:30] Memory Clock rate: 408 Mhz
[2024-08-05 10:27:30] Memory Bus Width: 256-bit
[2024-08-05 10:27:30] L2 Cache Size: 2097152 bytes
[2024-08-05 10:27:30] Maximum Texture Dimension Size (x,y,z) 1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
[2024-08-05 10:27:30] Maximum Layered 1D Texture Size, (num) layers 1D=(32768), 2048 layers
[2024-08-05 10:27:30] Maximum Layered 2D Texture Size, (num) layers 2D=(32768, 32768), 2048 layers
[2024-08-05 10:27:30] Total amount of constant memory: 65536 bytes
[2024-08-05 10:27:30] Total amount of shared memory per block: 49152 bytes
[2024-08-05 10:27:30] Total shared memory per multiprocessor: 167936 bytes
[2024-08-05 10:27:30] Total number of registers available per block: 65536
[2024-08-05 10:27:30] Warp size: 32
[2024-08-05 10:27:30] Maximum number of threads per multiprocessor: 1536
[2024-08-05 10:27:30] Maximum number of threads per block: 1024
[2024-08-05 10:27:30] Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
[2024-08-05 10:27:30] Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
[2024-08-05 10:27:30] Maximum memory pitch: 2147483647 bytes
[2024-08-05 10:27:30] Texture alignment: 512 bytes
[2024-08-05 10:27:30] Concurrent copy and kernel execution: Yes with 2 copy engine(s)
[2024-08-05 10:27:30] Run time limit on kernels: No
[2024-08-05 10:27:30] Integrated GPU sharing Host Memory: Yes
[2024-08-05 10:27:30] Support host page-locked memory mapping: Yes
[2024-08-05 10:27:30] Alignment requirement for Surfaces: Yes
[2024-08-05 10:27:30] Device has ECC support: Disabled
[2024-08-05 10:27:30] Device supports Unified Addressing (UVA): Yes
[2024-08-05 10:27:30] Device supports Managed Memory: Yes
[2024-08-05 10:27:30] Device supports Compute Preemption: Yes
[2024-08-05 10:27:30] Supports Cooperative Kernel Launch: Yes
[2024-08-05 10:27:30] Supports MultiDevice Co-op Kernel Launch: Yes
[2024-08-05 10:27:30] Device PCI Domain ID / Bus ID / location ID: 0 / 0 / 0
[2024-08-05 10:27:30] Compute Mode:
[2024-08-05 10:27:30] < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

root@edge-computer:/home/edge/cuda-samples/Samples/1_Utilities/bandwidthTest# b./debandwidthTest
[2024-08-05 10:42:40]
[CUDA Bandwidth Test] - Starting…
[2024-08-05 10:42:40] Running on…
[2024-08-05 10:42:40]
[2024-08-05 10:42:40] Device 0: Orin
[2024-08-05 10:42:40] Quick Mode
[2024-08-05 10:42:40]
[2024-08-05 10:42:40] Host to Device Bandwidth, 1 Device(s)
[2024-08-05 10:42:41] PINNED Memory Transfers
[2024-08-05 10:42:41] Transfer Size (Bytes)Bandwidth(GB/s)
[2024-08-05 10:42:41] 32000000 11.2
[2024-08-05 10:42:41]
[2024-08-05 10:42:41] Device to Host Bandwidth, 1 Device(s)
[2024-08-05 10:42:41] PINNED Memory Transfers
[2024-08-05 10:42:41] Transfer Size (Bytes)Bandwidth(GB/s)
[2024-08-05 10:42:41] 32000000 11.1
[2024-08-05 10:42:41]
[2024-08-05 10:42:41] Device to Device Bandwidth, 1 Device(s)
[2024-08-05 10:42:41] PINNED Memory Transfers
[2024-08-05 10:42:41] Transfer Size (Bytes)Bandwidth(GB/s)
[2024-08-05 10:42:41] 32000000 42.8
[2024-08-05 10:42:41]
[2024-08-05 10:42:41] Result = PASS
[2024-08-05 10:42:41]
[2024-08-05 10:42:41] NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.

I suggest asking this question on the relevant Orin Nano forum. Note that TOPs and GFLOP/s are not the same thing. Based on the specifications, I would expect that in order to achieve the highest TOPs you would need an operation that uses tensorcore with probably a low precision such as INT8 (or perhaps INT4). Such a test would not be measuring flops. You might want to pass the -tc option to gpu_burn, but I would not expect that that would be doing INT8 ops.