Memory bandwidth on Orin

Hi.
The advertised memory bandwidth on Orin is 204.8GB/s per my understanding of Orin’s documentation.

When i measure it using nvidia memory bandwidth-test sample code I see huge difference between host<->device memory throughput to device<->device memory throughput.

For host-to-device and the other way around the throughput is ~35GB/s. Far from advertised throughput.
For device-to-device, the measured throughput is ~170GB/s. which comes along with the advertised number.

what can explain this huge difference given both CPU and GPU use same memory ?
i saw this question asked several times in the forum so far, but didn’t see any clear answer/explanation so far.

Thank you.

Hi,
Do you mean the throughput of cudaMemcpy() does not meet target performance? Would like to confirm what issues you are facing.

And latest Jetpack release is 5.1.2. It would be great if you can use latest version.

Hi.
I’m referring to the “bandwidthTest” in cuda samples. Currently using Jetpack 5.1.1.

As i understand, this test checks (async) copy-throughput : from host to device, from device to host and from device to device.

My question is why do I see such x5 throughput difference when comparing host<->device transfers vs. device<->device transfers?

Moreover, when looking only on the host<->device memory copies, the throughput seems at the magnitude of PCIe (Gen4) transfer. Far from throughput I tend to expect from on-SoM memory. why is that?

Thank you.

Is this still an issue to support? Any result can be shared?

Hi. It is not a support issue but rather request for info/explanation. I still can not explain the findings above. I hoped Nvidia or other forum member could.
Thanks.

Hi,

Please find the below document for some explanation.

First, since Jetson is a shared memory system, pinned memory bandwidth is much better than pageable memory.
This can be tested via the below command:

 $ ./bandwidthTest -memory pinned
 $ ./bandwidthTest -memory pageable

However, the buffer is accessible by the CPU so it won’t be as fast as the GPU buffer that all owned by the GPU itself.

Thanks.

Thank you for the detailed answer. is it possible to elaborate on below part please ?
“However, the buffer is accessible by the CPU so it won’t be as fast as the GPU buffer that all owned by the GPU itself.”

Hi,

D2H is different from the D2D task.
D2D is a pure GPU task so it can be done fast if GPU resources are available.

Unfortunately, we are not able to disclose further implementation details here.

Thanks.

1 Like

Thank you for the explanation.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.