Possible direct memcpy between CPU (multiple process on one node) and GPU (unified memory on one card) under MPI?

Is it possible for allocations by cudaMallocManaged to gain benefit from MPI + CUDA on one node and one GPU?

Memory Copy Model

Memory Initialization: Unified memory on GPU and some counterparts on CPU for multiple processes.

Memory Space: Only one PC involved (one GPU: RTX 4070 Ti, and an Intel CPU).

OS: Ubuntu 20.04, 64bits, cuda 12.2, gcc 9.4, OpenMPI 4.0.3

Task:

Ranks from 0 to N are initialized by MPI.

a. Ranks from 1 to N will memcpy to Rank 0 (using MPI reduce or gather).

b. Rank 0 memcpy from CPU to GPU and runs some calculations.

c. Rank 0 memcpy the new results from GPU to CPU.

d. Rank 0 memcpy results to Ranks 1 to N (using MPI scatter).

Solution Candidates

My current implementation follows the steps above [a - d], but I believe a better approach may exist.

Some rough guesses include:

a. CUDA IPC (unfortunately not supported for unified memory).

b. CUDA-AWARE MPI (I haven’t tried this yet; the default apt-get OpenMPI does not support GPU direct MPI).

c. Perhaps some third-party package could help?

Would you mind providing some suggestions on whether I should switch to a CUDA-supporting OpenMPI or change the unified allocations to cudaMalloc to leverage CUDA IPC performance?

Thank you!

Does the Cuda code access the input more than once? If not, zero copy with pinned memory can be an option. Apart from that cudaMallocManaged tends to be much slower than explicit copies.

In the program, there is a loop where CPU-GPU-CPU data transfer is required every few seconds.

The total elapsed time of the program can be up to days.

My concern is about the high frequency of the data transfer. Maybe a more direct approach, say integrating GPU direct into MPI, is better ?

Every few seconds is not a particularly high frequency.

If the data is actually being accessed each iteration on the CPU and GPU, and it is not accessed in a very sparse way (e.g. only 10 bytes per MB per iteration), I would just use cudaMalloc and the normal block copies. For a small speedup the host (CPU) buffer should always in each iteration be the same buffer and be pinned beforehand.

Under some conditions you could also use the unified zero-copy access method (which is also different from cudaAllocManaged).

Depending on the parallelicity of your overall tasks you can use two streams to overlap data transfers and computations (and perhaps CPU computations overlapping GPU computations). In this case you can use asynchronous memory copies. But your task could also be a strictly serial one.

Thank you, Curefab, for the detailed reply.

Currently, I might not be able to rearrange my CPU code to use pinned or mapped pinned memory (which may be what you mentioned as a zero-copy plan by cudaHostAlloc with the mapped flag). It is a long story involving a collaborator’s code with my colleagues, and the code structure is too complicated for a pinned memory abstraction. The data counterpart on the GPU kernel is some raw data type (float2/3/4) compared to some template C++ class.

Does this mean the data situation in C++ limits me to asynchronous memcpy? I guess (not sure) that Host to Device cudaMemcpyAsync related to pageable memory on the CPU may not actually behave as expected asynchronously, but still block the host and device under some situations.

Some documentation also mentions that if my memory allocated with cudaMallocManaged is not touched by the CPU and GPU interleavedly, but resides on the GPU all the time, the performance of unified memory may not be that degraded compared to native cudaMalloc allocations. (UVA’s CPU semantic is only used on debug stage, for a release deployment, it is totally treated as a device memory.)

So far, I am a little worried about making big changes to the code on the CPU side. Whether using pinned zero-copy or async memcpy requires appropriate data consistency between the CPU and GPU for my code.

Just (manually) copying ‘normal’ memory between host and device for each iteration (only in debug stage) is probably faster than cudaMallocManaged. Have you measured the time needed for the copies?

No profiling analysis for the memcpy part has been done yet.

I will minimize the code and benchmark the differences.