Unified Memory has poor performance on Jetson AGX Xavier

I tried run the cuda sample “UnifiedMemoryPerf” which in the folder “/usr/local/cuda/samples/1_Utilities/” to test the unified memory on Jetson AGX Xavier 32GB within mode 0 (MAXN), here is the result:


As you see, the CpPglAs (use host pagelocked and device memory async) is the fastest and the UMeasy (unified memory with no hints) is the slowest. I thought perhaps it due to the test buffers are too small, so I tried another benchmark cuda-benchmarks. Following picture shows the result:

where “simpleMemcpy” use ordinary “cudaMemcpy”, “simpleDMA” use pinned memory (AKA zero memory) , “simpleManaged” use unified memory and “400000000” means 400 MB buffer. It shows that the unified memory perform well in computing but cost much more time in accessing all arrays than “simpleMemcpy” and “simpleDMA”. But according to the nvidia offical mannual CUDA for Tegra :: CUDA Toolkit Documentation and some answers in this forum, people always recommend user to use unified memory. Anyone can explain these results? Thanks

Here is another test, I ran the same sample on a computer with i5-8500 3.00 GHz*6, 16 GB, GTX 1660, Ubuntu 18.04 :


in this condition, zero-copy is the slowest, unified memory without hints performance is not bad. So I am very confused that consider CPU and GPU shared the same physical memory on Jetson, unified memory should perform better than common CPU-GPU-seperated computer, right?

Hi,

Have you maximized the device performance first?

$ sudo nvpmodel -m 0
$ sudo jetson_clocks

In previous Jetson, the pinned memory is not ideal since no cache support.
From Xavier, there is an I/O coherency feature that can improve the performance.
So you might find out that the pinned memory is competitive compared to the unified memory.

But please note the I/O coherency is a one-way feature, which indicates GPU can read CPU caches but CPU cannot.
In some cases, for example, GPU uses the buffer as output (write data).
Unified memory will be a better choice although it does take some overhead in the initial.

Thanks.

Yes, I am sure the jetson is running under mode 0 and I turn on the fan.

Just in case, I ran these commands as you mentioned and tested the benchmark again, but the output was the same as before.

Thanks for your explaining! I want to know why the Unified memory doesn’t work well on my Jetson although it has these advantages as you mentioned

Is somebody else has answer for my questions?

Hi,

Please note that pinned memory is zero-copy memory.
That’s why you can get much better performance for zero-copy memory.

Thanks.