Jetson AGX Orin 32GB: Measured Memory Bandwidth Much Lower Than Theoretical Spec

Hi:
During our evaluation of the Jetson AGX Orin 32GB platform, we observed that the actual measured memory bandwidth is significantly lower than the official specification of 204.8 GB/s. The specific test results are as follows:

  • When testing memory copy performance using the STREAM benchmark, the measured bandwidth was approximately 54.5 GB/s;
  • When using the bandwidthTest tool:
    • Device-to-Host (D2H) bandwidth was measured at approximately 26.3 GB/s,
    • Host-to-Device (H2D) bandwidth was measured at approximately 26.2 GB/s,
    • Device-to-Device (D2D) bandwidth was measured at approximately 151.7 GB/s.

Given the noticeable gap between the measured results and the theoretical bandwidth, we would like to know whether there are optimizations, system configurations, or other adjustments we should consider in order to achieve performance closer to the expected specification.
Thanks!

We are using jetson linux r36.3.0, and the detailed system information is as follows:

# cat /etc/nv_tegra_release
# R36 (release), REVISION: 3.0, GCID: 36191598, BOARD: generic, EABI: aarch64, DATE: Mon May  6 17:34:21 UTC 2024
# KERNEL_VARIANT: oot
TARGET_USERSPACE_LIB_DIR=nvidia
TARGET_USERSPACE_LIB_DIR_PATH=usr/lib/aarch64-linux-gnu/nvidia

# cat /proc/cmdline
root=/dev/mmcblk0p1 rw rootwait rootfstype=ext4 mminit_loglevel=4 console=ttyTCU0,115200 console=ttyAMA0,115200 firmware_class.path=/etc/firmware fbcon=map:0 net.ifnames=0 nospectre_bhb video=efifb:off console=tty0 iommu.strict=0 arm-smmu.disable_bypass=0 arm-smmu.force_stage=1 bl_prof_dataptr=2031616@0x82C610000 bl_prof_ro_ptr=65536@0x82C600000

# uname -a
Linux test 5.15.136-tegra #1 SMP PREEMPT Tue Apr 22 03:03:52 Asia 2025 aarch64 aarch64 aarch64 GNU/Linux

# ./deviceQuery/deviceQuery
./deviceQuery/deviceQuery Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 1 CUDA Capable device(s)

Device 0: "Orin"
  CUDA Driver Version / Runtime Version          12.2 / 12.8
  CUDA Capability Major/Minor version number:    8.7
  Total amount of global memory:                 30692 MBytes (32182566912 bytes)
  (014) Multiprocessors, (128) CUDA Cores/MP:    1792 CUDA Cores
  GPU Max Clock rate:                            930 MHz (0.93 GHz)
  Memory Clock rate:                             930 Mhz
  Memory Bus Width:                              256-bit
  L2 Cache Size:                                 4194304 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
  Maximum Layered 1D Texture Size, (num) layers  1D=(32768), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(32768, 32768), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total shared memory per multiprocessor:        167936 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  1536
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 2 copy engine(s)
  Run time limit on kernels:                     No
  Integrated GPU sharing Host Memory:            Yes
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Disabled
  Device supports Unified Addressing (UVA):      Yes
  Device supports Managed Memory:                Yes
  Device supports Compute Preemption:            Yes
  Supports Cooperative Kernel Launch:            Yes
  Supports MultiDevice Co-op Kernel Launch:      Yes
  Device PCI Domain ID / Bus ID / location ID:   0 / 0 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 12.2, CUDA Runtime Version = 12.8, NumDevs = 1
Result = PASS

We are using the latest version of the cuda-samples tool, and test command as follows:

# jetson_clocks
# ./bandwidthTest -memory pinned
[CUDA Bandwidth Test] - Starting...
Running on...

 Device 0: Orin
 Quick Mode

 Host to Device Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)        Bandwidth(GB/s)
   32000000                     26.2

 Device to Host Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)        Bandwidth(GB/s)
   32000000                     26.3

 Device to Device Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)        Bandwidth(GB/s)
   32000000                     151.7

Result = PASS

Hi,

The spec is a theoretical bandwidth.
Real memory performance depends on the memory type and usage.

For example:
In your test, the device-to-device transfer is much higher (151.7 GB/s).

Please also note that you can maximize the EMC clock with the below commands:

$ sudo nvpmodel -m 0
$ sudo jetson_clocks

Thanks.

Hi,

Thanks for your reply.

Before running the tests, we had already applied the suggested settings to maximize EMC clock:

$ sudo nvpmodel -m 0
$ sudo jetson_clocks

However, even with those settings, the highest bandwidth we observed — device-to-device (D2D) at 151.7 GB/s — is only about 74.3% of the theoretical 204.8 GB/s.

We’d like to understand whether this level of performance is expected under typical conditions, or if there are further tuning methods or considerations we should explore to get closer to the theoretical peak.

Hi,

Sorry for the late update.

We don’t have public H2D, D2H D2D memory bandwidth values but only the theoretical spec.
It’s expected that GPU does not retain 100% bandwidth.
The memory controller has an arbiter that allows other cores like CPUs, USB/PCIe root complex access to the memory also.

Thanks.

Hi,

According to the official specifications of Jetson Orin published by NVIDIA:

🔗 NVIDIA Jetson AGX Orin

The listed peak memory bandwidth is 204 GB/s, but in our actual measurements using the STREAM benchmark, the highest observed bandwidth is only 54.5 GB/s, which is significantly lower than the theoretical value.

We understand that part of the bandwidth may be consumed by other system components—such as the CPU, USB, or PCIe Root Complex—due to arbitration by the memory controller. However, could you please provide more insight into why the observed bandwidth is consistently much lower than the spec?

Additionally, are there any recommended optimizations or tuning methods to improve the achievable bandwidth?

Thank you.

Hi,

Could you share how you measure the memory bandwidth?
We will check with our internal team to gather more info.

Thanks.

Hi,

Our internal team need more information first.

Could you share which bandwidth app you used for testing ?
Is the 54 GB/s is D2D/H2D or D2H bandwidth ?

Thanks.

Hi
We used the STREAM benchmark to test memory copy performance, and the result was approximately 54 GB/s.

The detailed test steps are as follows:

# nvpmodel -m 0
# jetson_clocks
# git clone https://github.com/jeffhammond/STREAM.git
# cd STREAM
# gcc -O3 -fopenmp -march=native -DNTIMES=20 stream.c -o stream
# export OMP_NUM_THREADS=8
# ./stream

Thanks.

Hi,

We tested the sample and we can get 72GB/s for the ‘Copy’ Test.

But please note that the sample is an H2H memory copy.
So it doesn’t relate to CUDA and GPU buffer.

Thanks.

Hi,

Thank you for your explanation.

We ran the same sample but only measured around 54 GB/s in the “Copy” test.
Could you please let us know if there were any specific optimizations or system configurations applied on your side?

Also, regarding the officially stated 204.8 GB/s bandwidth, may I ask what type of memory transfer this refers to (e.g., D2D, H2H, or D2H)? Test with 72 GB/s, there’s still a considerable gap compared to the theoretical 204.8 GB/s. We’re trying to better understand this difference and would appreciate any insights or recommendations for more accurate testing.

Thanks again!

Hi

The 204.8 GB/s is theoretical bandwidth.
So not the D2D, H2D, H2D, and D2H transfer as these will have software impact.

Based on the testing data you shared above, would you mind finding some benchmarks for D2D transfer?
As we are still discussing this with our internal team, the benchmark score between the GPU buffer should provide some info as well.

Thanks.

Hi,

Thanks for your feedback!

For measuring device-to-device (D2D) bandwidth, you might consider using the bandwidthTest utility from the CUDA SDK — it could provide a more direct measurement of D2D transfers.
Also, you mentioned achieving 72 GB/s using STREAM. I’m curious — was that result due to any specific optimizations on your side, or could there be differences in how we’re testing?

Looking forward to your insights. Appreciate your help!

Hi,

Unfortunately, the CUDA bandwidth sample app is mainly for demonstration and not a perf app. The values of the bandwidth app cannot be relied upon.

The perf from our side is measured with AGX Orin 64GB.
Although the memory bandwidth specs between 32GB and 64GB are the same, the memory size seems to have some impact on this OMP H2H copy test.

Thanks.

Hi

Thanks for the reply.

Would it be possible to run the same test on an AGX Orin 32GB device as well?

Although the AGX Orin 64GB and 32GB versions share the same memory bandwidth specifications, there are still some hardware differences — such as the CPU model, CPU frequency, and the GPU’s maximum frequency — which might affect the performance results.

Of course, this is just our assumption at this point, but running the test on the 32GB version could help us better understand its achievable bandwidth limits.

Appreciate your support!

Hi,

The sample uses OpenCL instead of CUDA.

To measure the practical memory bandwidth, we recommend a D2D CUDA bandwidth app.
We do have an app to test this internally but the sample cannot be shared publically.

Thanks.

Hi:

Could you help run the same STREAM benchmark on the AGX Orin 32GB?

We’d like to compare the results and see where the performance differences come from.

Thanks

Hi,

The sample uses OpenCL.
To measure the memory bandwidth, a D2D CUDA bandwidth app is recommended.

Thanks.

Hi,

Thanks for your explanation.

We would like to compare results and evaluate if there are any differences in memory bandwidth between our measurements.

Thanks again!

Hi,

We got below on an AGX Orin 32GB kit:

Function    Best Rate MB/s
Copy:           55075.0
Copy:           55247.3
Copy:           55192.7
Avg:             55 GB/s

Thanks.

1 Like