Direct GPU-to-GPU data transfer with OpenACC+managed+MPI


I am exploring OpenACC with managed memory, specifically I am compiling with NVHPC using flags "-acc -ta=tesla:managed -Minfo=all,intensity". In general demand paging works as expected, and is very convenient. I have run into one issue though I cannot figure out when using this in combination with MPI. Specifically on the system I use OpenMPI 4.1.1, which I think is CUDA aware.

Basically the code is using MPI_Send(A,...)/MPI_Recv(B,...) to send data from array A on one task to array B on another task. Both arrays A and B were first (and last) touched by an !$acc kernels loop, and I have this building with the above flags for managed memory. Each task is mapped to a separate GPU, and all this is under a single compute node (in case that matters).

The code works fine, and delivers correct results, however, I see that there is a large performance hit during Send/Recv compared to when arrays A/B are on the host. I also see large times for the “Gpu page fault groups” in the nvprof output, on both GPUs, as well as “Host To Device” and “Device To Host” times under the “== Unified Memory profiling result:” profile section.

I am assuming that what is happening is that MPI_Send() and MPI_Recv() trigger page faults on the host, and thus data is migrated through the managed memory mechanism. But I also have read that CUDA aware MPI allows “zero copy” transfer between GPUs. I tried using !$acc host_data use_device(A) on the send side, and same for the B on the recv side. But that had no affect. Not sure that is even considered with managed memory.

Is there a way to do direct GPU-to-GPU data transfers when using managed memory under OpenACC?


It’s not specific to OpenACC but rather a limitation of the current MPI implementations where they don’t handle unified memory in a performant manner. NVIDIA’s UCX team is aware of the issue and are investigating fixes, but I have no idea on when/if a fix will be available.

I’ve been told recent releases of MPICH2-GDR does have this fixed, but I have not confirmed this or tested it myself so don’t know for sure…

For now, you’ll need to port your code to use data directives to manage memory and then encapsulate the MPI calls within a “host_data” region. I use CUDA Aware MPI quite often myself and know this setup works well. Though if using a UCX based MPI (like the OpenMPI we ship), I sometimes need to set the following environment variables to prevent a runtime error:



Hi Mat,

My tests are on UCAR’s Casper system (just in case you are familiar with it). From what I understand, it is using InfiniBand. I tried setting the UCX_ environment variable, but it then failed with

[casper25:11597] PML ucx cannot be selected

which makes sense because I don’t think it is going through UCX.

Just to be sure I understand, when you say “port your code to use data directives to manage memory”, you are basically saying it won’t work with managed memory, right? Unless maybe I end up on an installation that has one of the few fixed MPI installations, like the MPICH2-GDR you reference.

If the above is correct, then I wonder, is it possible to compile some files of a project with “managed memory” and some other portions without? Can that even make sense? Just wondering out loud.


It works, but just isn’t performant. To get the performance our of CUDA Aware MPI, you’ll want to pass in a CUDA device pointer to the MPI calls. To do that, you need move to using OpenACC data and host_data directives.

is it possible to compile some files of a project with “managed memory” and some other portions without?

When using the CUDA Unified Memory (-gpu=managed), the compiler basically replaces ALLOCATE calls with calls to cudaMallocManaged. You can then compile on a file by file basis so only allocates within that file are then managed.

However, the better way to selectively use UM is to use the CUDA Fortran “managed” attribute on the allocatable arrays you wish to be managed. Though also remove “-gpu=managed” else all allocates will be managed.


Okay, I have this working now. And it turns out I did need the UCX_MEMTYPE_CACHE=n environment setting. However, if I also set UCX_TLS=self,shm,cuda_copy then I get the error quoted earlier. Anyway, great to have it working. Thanks again for all the valuable advice!