I have been reading unified memory for a while. It seems to me that unified memory can greatly reduce the programming effort by eliminating explicit memory transfers between host and device. Is there any performance benefit provided by using unified memory? Have you ever seen cases when using unified memory yields faster code than using explicit memory control? For example, the reply in the following post indicates that unified memory is for fast-prototyping rather than getting performance.
I have also been reading CUDA-aware MPI for a while. It seems to me that using unified memory will be faster only when it is used together with CUDA-aware MPI.
This post suggests that inter-node device to device transfer can get rid of allocating the host buffer and network fabric buffer only when the program uses unified memory.