Hi.
The advertised memory bandwidth on Orin is 204.8GB/s per my understanding of Orin’s documentation.
When i measure it using nvidia memory bandwidth-test sample code I see huge difference between host<->device memory throughput to device<->device memory throughput.
For host-to-device and the other way around the throughput is ~35GB/s. Far from advertised throughput.
For device-to-device, the measured throughput is ~170GB/s. which comes along with the advertised number.
what can explain this huge difference given both CPU and GPU use same memory ?
i saw this question asked several times in the forum so far, but didn’t see any clear answer/explanation so far.
Hi.
I’m referring to the “bandwidthTest” in cuda samples. Currently using Jetpack 5.1.1.
As i understand, this test checks (async) copy-throughput : from host to device, from device to host and from device to device.
My question is why do I see such x5 throughput difference when comparing host<->device transfers vs. device<->device transfers?
Moreover, when looking only on the host<->device memory copies, the throughput seems at the magnitude of PCIe (Gen4) transfer. Far from throughput I tend to expect from on-SoM memory. why is that?
Hi. It is not a support issue but rather request for info/explanation. I still can not explain the findings above. I hoped Nvidia or other forum member could.
Thanks.
Thank you for the detailed answer. is it possible to elaborate on below part please ?
“However, the buffer is accessible by the CPU so it won’t be as fast as the GPU buffer that all owned by the GPU itself.”