Unified memory and CUDA-aware MPI


I have been reading unified memory for a while. It seems to me that unified memory can greatly reduce the programming effort by eliminating explicit memory transfers between host and device. Is there any performance benefit provided by using unified memory? Have you ever seen cases when using unified memory yields faster code than using explicit memory control? For example, the reply in the following post indicates that unified memory is for fast-prototyping rather than getting performance.

I have also been reading CUDA-aware MPI for a while. It seems to me that using unified memory will be faster only when it is used together with CUDA-aware MPI.

This post suggests that inter-node device to device transfer can get rid of allocating the host buffer and network fabric buffer only when the program uses unified memory.


Note there is a difference between the unified virtual address space and managed memory, even though the appendix describing the the latter is headlined “Unified Memory Programming”.

In my experience managed memory is mostly for convenience, but performance using data prefetching and data usage hints can sometimes approach the performance of a program explicitly managing memory allocation and data movement.

In contrast, I don’t believe CUDA-aware MPI has anything to do with managed memory. It is however reliant on the unified address space.


The unified memory model was conceived originally for the programmer’s convenience. It administrates pagination and copies between CPU and GPU automatically so that virtually you see only one address space. However, physical memory is still on their respective devices. On NVIDIA embedded devices (TX1, TX2, Xavier, and Nano) this is not the case, there is a physical unified memory space that can be shared between CPU and GPU.

In our experience, unified memory yields better performance on NVIDIA embedded platforms. But this is also due to the fact that you can capture directly to unified memory and also display from unified memory, achieving a 0 memcpy pipeline from capture to display, while processing buffers with CUDA.

If one of those elements is missing, for example, if you can’t capture directly to unified memory, the performance drops significantly. On this case, using a multi-thread approach for simultaneous memcpy and processing is faster than unified memory.

Is there any CUDA-aware MPI for NVIDIA embedded devices?

Hi Tera and Miguel,

Thanks very much for pointing out the difference between UVA and managed memory and when using unified memory model will have the performance advantage over using the model with explicit memory control!


Jumping in one Year Later…

Hi tera,

TL;DR from what miguel said:

CUDA-aware MPI for NVIDIA embedded devices isn’t supported,
because the hardware does what the API emulates. i.e the embedded devices have the best Access to GPU memory.

That had been a rhetorical question.