Memory increase in GPU-aware non-blocking MPI communications

Dear staff,

I am working on a code having two versions: one is GPU-aware and one is not GPU-aware. In the first case the device buffer is passed to non blocking communications of the kind MPI_Isend+MPI_Irecv+MPI_Waitall; in the second case the code does the buffer copy to/from the GPU and communicates the datum on the host. The code implements these communications in an iterative pattern.

We are using both OpenACC and CUDA, so I am setting PGI_ACC_MEMORY_MANAGE=0 to avoid OpenACC to do some optimizations on memory management.

If I compare the GPU memory used on the device in the two cases, GPU aware and not, I see an increase in the memory used by the GPU only for the GPU aware case, and at each step of my iteration. I noticed this with nvidia-smi and I confirmed this with nsight systems by using --cuda-memory-usage=true.

Thus the GPU aware version is more prone to crash due to out of memory after some steps. This is a pity because it prevents running the whole simulation (up to convergence) on a smaller number of nodes, unless switching off the GPU awareness.

By comparing the traces for the two cases, I tried to understand which event triggers the increase in the GPU memory. As far as I see, there are no differences in the allocation/deallocation managed explicitely in the code. I have noticed an increase in corrispondence with these culpcOpenMemHandle, and by looking at the call stack (picture below) it seems that these are triggered by the rendez vous protocol of the non blocking communications. Is this expected, or are we missing something in the implementation of the CUDA-aware MPI?

I am using nvhpc/23.1 and OpenMPI-4.1.4 and cuda/11.8

Thank you for your time,

Laura

Hello,

I tested a small reproducer doing a number of MPI_Isend + MPI_Irecv + MPI_Wait between all processes as in the code, and I noticed the same cuipcopenmemhandle routines using GPU memory and called from the MPI_Wait.

I compiled then the reproducer with the openmpi/3.1.5 inside the nvhpc/23.1 compiler, and in this case there are also cuipcclosememhandle routines, called from MPI_FInalize, which deallocate memory on the GPU.

Thank you,

Laura

Hello,
I think that I’m encountering the same issue in my code. Is there any update on this one ?

Hello Timothee,

I have started using export UCX_TLS=^cuda_ipc when more than one node is used, as long as time spent in intra-node communications is not a bottleneck with respect to time spent in inter-node. However this is probably not the best thing to do, I hope nvidia support can providea better reply :)

Btw you can also try binding against hpcx, I have seen better performances without all these exports

Let me know it this helps?

See you,

Laura

Hi,

I’ve just tried with UCX_TLS=^cuda_ipc, but the same memory growth issue is observed.
And just for confirmation that this is indeed the same issue here are the logs from openmpi

[dgx:524865] CUDA: cuMemGetAddressRange passed: addr=0x7f1454000000, size=867297128, pbase=0x7f1454000000, psize=867297128 
[dgx:524865] CUDA: cuMemGetAddressRange passed: addr=0x7f12bc000000, size=861938456, pbase=0x7f12bc000000, psize=861938456 
[dgx:524872] CUDA: cuMemGetAddressRange passed: addr=0x7fb1e2000000, size=867297128, pbase=0x7fb1e2000000, psize=867297128 
[dgx:524872] CUDA: cuIpcOpenMemHandle passed: base=0x7fb1ae000000 (remote base=0x7f1454000000,size=867297128)
[dgx:524872] CUDA: cuMemGetAddressRange passed: addr=0x7fb17a000000, size=861938456, pbase=0x7fb17a000000, psize=861938456 
[dgx:524872] CUDA: cuIpcOpenMemHandle passed: base=0x7fb146000000 (remote base=0x7f12bc000000,size=861938456)
[dgx:524872] CUDA: cuEventQuery returned 0
[dgx:524872] CUDA: cuda_ungetmemhandle (no-op): base=0x7fb1e2000000
[dgx:524865] CUDA: cuda_ungetmemhandle (no-op): base=0x7f1454000000
[dgx:524872] CUDA: cuEventQuery returned 0
[dgx:524872] CUDA: cuda_ungetmemhandle (no-op): base=0x7fb17a000000
[dgx:524865] CUDA: cuda_ungetmemhandle (no-op): base=0x7f12bc000000

Similarely to the OP case cuIpcCloseMemHandle is rarely called resulting in a memory growth

you might get better help by posting on the HPC compilers forum. CUDA-aware MPI is generally provided by the MPI provider, not CUDA toolkit directly, but Laura/OP mentions using nvhpc which is our HPC SDK and includes an MPI curated by NVIDIA. Support for that will probably be better on the HPC compilers forum.