Memory increase in GPU-aware non-blocking MPI communications

l.bellentani · December 11, 2023, 3:11pm

Dear staff,

I am working on a code having two versions: one is GPU-aware and one is not GPU-aware. In the first case the device buffer is passed to non blocking communications of the kind MPI_Isend+MPI_Irecv+MPI_Waitall; in the second case the code does the buffer copy to/from the GPU and communicates the datum on the host. The code implements these communications in an iterative pattern.

We are using both OpenACC and CUDA, so I am setting PGI_ACC_MEMORY_MANAGE=0 to avoid OpenACC to do some optimizations on memory management.

If I compare the GPU memory used on the device in the two cases, GPU aware and not, I see an increase in the memory used by the GPU only for the GPU aware case, and at each step of my iteration. I noticed this with nvidia-smi and I confirmed this with nsight systems by using --cuda-memory-usage=true.

Thus the GPU aware version is more prone to crash due to out of memory after some steps. This is a pity because it prevents running the whole simulation (up to convergence) on a smaller number of nodes, unless switching off the GPU awareness.

By comparing the traces for the two cases, I tried to understand which event triggers the increase in the GPU memory. As far as I see, there are no differences in the allocation/deallocation managed explicitely in the code. I have noticed an increase in corrispondence with these culpcOpenMemHandle, and by looking at the call stack (picture below) it seems that these are triggered by the rendez vous protocol of the non blocking communications. Is this expected, or are we missing something in the implementation of the CUDA-aware MPI?

I am using nvhpc/23.1 and OpenMPI-4.1.4 and cuda/11.8

Thank you for your time,

Laura

l.bellentani · December 12, 2023, 11:03am

Hello,

I tested a small reproducer doing a number of MPI_Isend + MPI_Irecv + MPI_Wait between all processes as in the code, and I noticed the same cuipcopenmemhandle routines using GPU memory and called from the MPI_Wait.

I compiled then the reproducer with the openmpi/3.1.5 inside the nvhpc/23.1 compiler, and in this case there are also cuipcclosememhandle routines, called from MPI_FInalize, which deallocate memory on the GPU.

Thank you,

Laura

timothee.davidcleris · October 8, 2024, 3:16pm

Hello,
I think that I’m encountering the same issue in my code. Is there any update on this one ?

l.bellentani · October 8, 2024, 3:26pm

Hello Timothee,

I have started using export UCX_TLS=^cuda_ipc when more than one node is used, as long as time spent in intra-node communications is not a bottleneck with respect to time spent in inter-node. However this is probably not the best thing to do, I hope nvidia support can providea better reply :)

Btw you can also try binding against hpcx, I have seen better performances without all these exports

Let me know it this helps?

See you,

Laura

timothee.davidcleris · October 8, 2024, 3:49pm

Hi,

I’ve just tried with UCX_TLS=^cuda_ipc, but the same memory growth issue is observed.
And just for confirmation that this is indeed the same issue here are the logs from openmpi

[dgx:524865] CUDA: cuMemGetAddressRange passed: addr=0x7f1454000000, size=867297128, pbase=0x7f1454000000, psize=867297128 
[dgx:524865] CUDA: cuMemGetAddressRange passed: addr=0x7f12bc000000, size=861938456, pbase=0x7f12bc000000, psize=861938456 
[dgx:524872] CUDA: cuMemGetAddressRange passed: addr=0x7fb1e2000000, size=867297128, pbase=0x7fb1e2000000, psize=867297128 
[dgx:524872] CUDA: cuIpcOpenMemHandle passed: base=0x7fb1ae000000 (remote base=0x7f1454000000,size=867297128)
[dgx:524872] CUDA: cuMemGetAddressRange passed: addr=0x7fb17a000000, size=861938456, pbase=0x7fb17a000000, psize=861938456 
[dgx:524872] CUDA: cuIpcOpenMemHandle passed: base=0x7fb146000000 (remote base=0x7f12bc000000,size=861938456)
[dgx:524872] CUDA: cuEventQuery returned 0
[dgx:524872] CUDA: cuda_ungetmemhandle (no-op): base=0x7fb1e2000000
[dgx:524865] CUDA: cuda_ungetmemhandle (no-op): base=0x7f1454000000
[dgx:524872] CUDA: cuEventQuery returned 0
[dgx:524872] CUDA: cuda_ungetmemhandle (no-op): base=0x7fb17a000000
[dgx:524865] CUDA: cuda_ungetmemhandle (no-op): base=0x7f12bc000000

Similarely to the OP case cuIpcCloseMemHandle is rarely called resulting in a memory growth

Robert_Crovella · October 8, 2024, 4:00pm

you might get better help by posting on the HPC compilers forum. CUDA-aware MPI is generally provided by the MPI provider, not CUDA toolkit directly, but Laura/OP mentions using nvhpc which is our HPC SDK and includes an MPI curated by NVIDIA. Support for that will probably be better on the HPC compilers forum.

Topic		Replies	Views
Direct GPU-to-GPU data transfer with OpenACC+managed+MPI nvc, nvc++ and nvfortran	4	1041	April 12, 2022
GPU Inter-Process Communications(IPC) question CUDA Programming and Performance	13	14844	January 4, 2023
Unusually slow MPI communication between GPUs nvc, nvc++ and nvfortran	1	462	September 5, 2023
MPI and CUDA mixed programming General CUDA programming CUDA Programming and Performance	22	23578	July 27, 2010
Possible direct memcpy between CPU (multiple process on one node) and GPU (unified memory on one card) under MPI? CUDA Programming and Performance openmpi	6	208	June 7, 2024
Got out of memory from cudaMemcpy CUDA Programming and Performance	13	3841	January 28, 2022
CUDA+MPI = Unexplained Issues... Random Crashes, Errenous Output?!? CUDA Programming and Performance	5	3247	July 7, 2008
CUDA/MPI interoperability problem CUDA Programming and Performance	3	2025	December 20, 2013
Unified memory and CUDA-aware MPI CUDA Programming and Performance	6	1514	February 28, 2020
Device Memory Mangement CUDA Programming and Performance	14	3407	December 5, 2008

Memory increase in GPU-aware non-blocking MPI communications

Related topics