Direct GPU-to-GPU data transfer with OpenACC+managed+MPI

gjt · April 11, 2022, 9:19pm

Hi,

I am exploring OpenACC with managed memory, specifically I am compiling with NVHPC using flags "-acc -ta=tesla:managed -Minfo=all,intensity". In general demand paging works as expected, and is very convenient. I have run into one issue though I cannot figure out when using this in combination with MPI. Specifically on the system I use OpenMPI 4.1.1, which I think is CUDA aware.

Basically the code is using MPI_Send(A,...)/MPI_Recv(B,...) to send data from array A on one task to array B on another task. Both arrays A and B were first (and last) touched by an !$acc kernels loop, and I have this building with the above flags for managed memory. Each task is mapped to a separate GPU, and all this is under a single compute node (in case that matters).

The code works fine, and delivers correct results, however, I see that there is a large performance hit during Send/Recv compared to when arrays A/B are on the host. I also see large times for the “Gpu page fault groups” in the nvprof output, on both GPUs, as well as “Host To Device” and “Device To Host” times under the “== Unified Memory profiling result:” profile section.

I am assuming that what is happening is that MPI_Send() and MPI_Recv() trigger page faults on the host, and thus data is migrated through the managed memory mechanism. But I also have read that CUDA aware MPI allows “zero copy” transfer between GPUs. I tried using !$acc host_data use_device(A) on the send side, and same for the B on the recv side. But that had no affect. Not sure that is even considered with managed memory.

Is there a way to do direct GPU-to-GPU data transfers when using managed memory under OpenACC?

Thanks!
-Gerhard

MatColgrove · April 11, 2022, 9:35pm

It’s not specific to OpenACC but rather a limitation of the current MPI implementations where they don’t handle unified memory in a performant manner. NVIDIA’s UCX team is aware of the issue and are investigating fixes, but I have no idea on when/if a fix will be available.

I’ve been told recent releases of MPICH2-GDR does have this fixed, but I have not confirmed this or tested it myself so don’t know for sure…

For now, you’ll need to port your code to use data directives to manage memory and then encapsulate the MPI calls within a “host_data” region. I use CUDA Aware MPI quite often myself and know this setup works well. Though if using a UCX based MPI (like the OpenMPI we ship), I sometimes need to set the following environment variables to prevent a runtime error:

UCX_MEMTYPE_CACHE=n
UCX_TLS=self,shm,cuda_copy

-Mat

gjt · April 12, 2022, 12:11am

Hi Mat,

My tests are on UCAR’s Casper system (just in case you are familiar with it). From what I understand, it is using InfiniBand. I tried setting the UCX_ environment variable, but it then failed with

[casper25:11597] PML ucx cannot be selected

which makes sense because I don’t think it is going through UCX.

Just to be sure I understand, when you say “port your code to use data directives to manage memory”, you are basically saying it won’t work with managed memory, right? Unless maybe I end up on an installation that has one of the few fixed MPI installations, like the MPICH2-GDR you reference.

If the above is correct, then I wonder, is it possible to compile some files of a project with “managed memory” and some other portions without? Can that even make sense? Just wondering out loud.

-Gerhard

MatColgrove · April 12, 2022, 5:04pm

It works, but just isn’t performant. To get the performance our of CUDA Aware MPI, you’ll want to pass in a CUDA device pointer to the MPI calls. To do that, you need move to using OpenACC data and host_data directives.

is it possible to compile some files of a project with “managed memory” and some other portions without?

When using the CUDA Unified Memory (-gpu=managed), the compiler basically replaces ALLOCATE calls with calls to cudaMallocManaged. You can then compile on a file by file basis so only allocates within that file are then managed.

However, the better way to selectively use UM is to use the CUDA Fortran “managed” attribute on the allocatable arrays you wish to be managed. Though also remove “-gpu=managed” else all allocates will be managed.

-Mat

gjt · April 12, 2022, 10:30pm

Okay, I have this working now. And it turns out I did need the UCX_MEMTYPE_CACHE=n environment setting. However, if I also set UCX_TLS=self,shm,cuda_copy then I get the error quoted earlier. Anyway, great to have it working. Thanks again for all the valuable advice!
-Gerhard

Topic		Replies	Views
OpenACC directives to transfer data between GPUs Legacy PGI Compilers	3	881	May 7, 2021
Halo Exchange updates MPI with OpenACC nvc, nvc++ and nvfortran cuda , gpu , openmpi	1	952	June 13, 2022
pgf90 + openacc & managed memory / um-evaluation package Legacy PGI Compilers	8	8978	June 16, 2015
Beginning Question about CUDA-aware MPI nvc, nvc++ and nvfortran	6	207	May 10, 2025
Call to collective mpi subroutine with openacc host_data directive Legacy PGI Compilers	8	1131	March 26, 2021
Overload when using managed mempory flag in openacc nvc, nvc++ and nvfortran hpc , gpu	3	740	May 31, 2022
Possible direct memcpy between CPU (multiple process on one node) and GPU (unified memory on one card) under MPI? CUDA Programming and Performance openmpi	6	288	June 7, 2024
OpenACC code using gpu=managed is faster than explicit data management for multiple gpu computation nvc, nvc++ and nvfortran	3	86	January 14, 2025
CUDA-aware MPI on 1 GPU transferring data to host? Legacy PGI Compilers	7	5314	October 2, 2017
Has anyone addressed GPU memory limitations in OpenACC-ported scientific applications? nvc, nvc++ and nvfortran kernel , gpu	6	94	December 5, 2025

Direct GPU-to-GPU data transfer with OpenACC+managed+MPI

Related topics