I tried run the cuda sample “UnifiedMemoryPerf” which in the folder “/usr/local/cuda/samples/1_Utilities/” to test the unified memory on Jetson AGX Xavier 32GB within mode 0 (MAXN), here is the result:
As you see, the CpPglAs (use host pagelocked and device memory async) is the fastest and the UMeasy (unified memory with no hints) is the slowest. I thought perhaps it due to the test buffers are too small, so I tried another benchmark cuda-benchmarks. Following picture shows the result:
where “simpleMemcpy” use ordinary “cudaMemcpy”, “simpleDMA” use pinned memory (AKA zero memory) , “simpleManaged” use unified memory and “400000000” means 400 MB buffer. It shows that the unified memory perform well in computing but cost much more time in accessing all arrays than “simpleMemcpy” and “simpleDMA”. But according to the nvidia offical mannual https://docs.nvidia.com/cuda/cuda-for-tegra-appnote/index.html#pinned-memory and some answers in this forum, people always recommend user to use unified memory. Anyone can explain these results? Thanks
in this condition, zero-copy is the slowest, unified memory without hints performance is not bad. So I am very confused that consider CPU and GPU shared the same physical memory on Jetson, unified memory should perform better than common CPU-GPU-seperated computer, right?
In previous Jetson, the pinned memory is not ideal since no cache support.
From Xavier, there is an I/O coherency feature that can improve the performance.
So you might find out that the pinned memory is competitive compared to the unified memory.
But please note the I/O coherency is a one-way feature, which indicates GPU can read CPU caches but CPU cannot.
In some cases, for example, GPU uses the buffer as output (write data).
Unified memory will be a better choice although it does take some overhead in the initial.