Nvshmem4py Buffer can not be freed by python lifetime

houqi1993 · July 3, 2025, 3:12am

Here I use nvshmem_python-source-0.1.0.36132199_cuda12-archive. But I found that with this implementation, Buffer allocated by NvshmemResource will not be released as buffer gets out of lifetime, as Buffer is kept by NvshmemResource.

My question is: is this by design? I have to release the nvshmem Tensor manually, which is very not pythonic.

this is a sample code to verify this (copy of nvshmem_python-source-0.1.0.36132199_cuda12-archive/examples/torch_triton_interop.py):

if __name__ == '__main__':
    torchrun_uid_init()

    """
    Allocate 3 tensors on the NVSHMEM symmetric heap
    We will add tensor1 to tensor2, and store that to tensor_out
    Then, we will use nvshmem.core to sum-reduce all PEs' copies of tensor_out
    """
    n_elements = 867530

    nvshmem.core.utils._configure_logging(level="DEBUG")

    for n in range(10):
        print(f"iter {n}", flush=True)
        tensor = nvshmem.core.tensor((n_elements,), dtype=torch.float32)

the log:

$ LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/data01/houqi.1993/micromamba/envs/houqi/lib/python3.11/site-packages/nvidia/nvshmem/lib torchrun --node_rank=0 --nproc_per_node=8 --nnodes=1 --rdzv_endpoint=127.0.0.1:12345 ~/ProgramFiles/nvshmem_python-source-0.1.0.36132199_cuda12-archive/examples/torch_triton_interop.py
[W703 10:53:52.783941591 ProcessGroupGloo.cpp:727] Warning: Unable to resolve hostname to a (local) address. Using the loopback address as fallback. Manually set the network interface to bind to with GLOO_SOCKET_IFNAME. (function operator())
[W703 10:53:52.788321305 ProcessGroupGloo.cpp:727] Warning: Unable to resolve hostname to a (local) address. Using the loopback address as fallback. Manually set the network interface to bind to with GLOO_SOCKET_IFNAME. (function operator())
[W703 10:53:52.803465628 ProcessGroupGloo.cpp:727] Warning: Unable to resolve hostname to a (local) address. Using the loopback address as fallback. Manually set the network interface to bind to with GLOO_SOCKET_IFNAME. (function operator())
[W703 10:53:52.803650140 ProcessGroupGloo.cpp:727] Warning: Unable to resolve hostname to a (local) address. Using the loopback address as fallback. Manually set the network interface to bind to with GLOO_SOCKET_IFNAME. (function operator())
[W703 10:53:52.807265264 ProcessGroupGloo.cpp:727] Warning: Unable to resolve hostname to a (local) address. Using the loopback address as fallback. Manually set the network interface to bind to with GLOO_SOCKET_IFNAME. (function operator())
[W703 10:53:52.811320509 ProcessGroupGloo.cpp:727] Warning: Unable to resolve hostname to a (local) address. Using the loopback address as fallback. Manually set the network interface to bind to with GLOO_SOCKET_IFNAME. (function operator())
[W703 10:53:52.821334481 ProcessGroupGloo.cpp:727] Warning: Unable to resolve hostname to a (local) address. Using the loopback address as fallback. Manually set the network interface to bind to with GLOO_SOCKET_IFNAME. (function operator())
[W703 10:53:52.824206420 ProcessGroupGloo.cpp:727] Warning: Unable to resolve hostname to a (local) address. Using the loopback address as fallback. Manually set the network interface to bind to with GLOO_SOCKET_IFNAME. (function operator())
iter 0
iter 0
iter 0
iter 0
iter 0
H800-1-docker-n122-200-178:2227842:2227842 [3] NVSHMEM DEBUG : Creating NvshmemResource for device 3
H800-1-docker-n122-200-178:2227846:2227846 [7] NVSHMEM DEBUG : Creating NvshmemResource for device 7
H800-1-docker-n122-200-178:2227844:2227844 [5] NVSHMEM DEBUG : Creating NvshmemResource for device 5
H800-1-docker-n122-200-178:2227841:2227841 [2] NVSHMEM DEBUG : Creating NvshmemResource for device 2
H800-1-docker-n122-200-178:2227840:2227840 [1] NVSHMEM DEBUG : Creating NvshmemResource for device 1
iter 0
H800-1-docker-n122-200-178:2227845:2227845 [6] NVSHMEM DEBUG : Creating NvshmemResource for device 6
iter 0
H800-1-docker-n122-200-178:2227843:2227843 [4] NVSHMEM DEBUG : Creating NvshmemResource for device 4
iter 0
H800-1-docker-n122-200-178:2227839:2227839 [0] NVSHMEM DEBUG : Creating NvshmemResource for device 0
H800-1-docker-n122-200-178:2227845:2227845 [6] NVSHMEM DEBUG : Created Buffer on resource <NvshmemResource device=<Device 6 (NVIDIA H800)>> at address 1440699386368 with size 3470120 on stream None
H800-1-docker-n122-200-178:2227839:2227839 [0] NVSHMEM DEBUG : Created Buffer on resource <NvshmemResource device=<Device 0 (NVIDIA H800)>> at address 1405802777088 with size 3470120 on stream None
H800-1-docker-n122-200-178:2227844:2227844 [5] NVSHMEM DEBUG : Created Buffer on resource <NvshmemResource device=<Device 5 (NVIDIA H800)>> at address 1440699386368 with size 3470120 on stream None
H800-1-docker-n122-200-178:2227840:2227840 [1] NVSHMEM DEBUG : Created Buffer on resource <NvshmemResource device=<Device 1 (NVIDIA H800)>> at address 1440699386368 with size 3470120 on stream None
H800-1-docker-n122-200-178:2227843:2227843 [4] NVSHMEM DEBUG : Created Buffer on resource <NvshmemResource device=<Device 4 (NVIDIA H800)>> at address 1440699386368 with size 3470120 on stream None
H800-1-docker-n122-200-178:2227841:2227841 [2] NVSHMEM DEBUG : Created Buffer on resource <NvshmemResource device=<Device 2 (NVIDIA H800)>> at address 1440699386368 with size 3470120 on stream None
H800-1-docker-n122-200-178:2227846:2227846 [7] NVSHMEM DEBUG : Created Buffer on resource <NvshmemResource device=<Device 7 (NVIDIA H800)>> at address 1440699386368 with size 3470120 on stream None
H800-1-docker-n122-200-178:2227842:2227842 [3] NVSHMEM DEBUG : Created Buffer on resource <NvshmemResource device=<Device 3 (NVIDIA H800)>> at address 1440699386368 with size 3470120 on stream None
iter 1
iter 1iter 1iter 1iter 1iter 1iter 1





iter 1
H800-1-docker-n122-200-178:2227841:2227841 [2] NVSHMEM DEBUG : Created Buffer on resource <NvshmemResource device=<Device 2 (NVIDIA H800)>> at address 1440702856704 with size 3470120 on stream None
H800-1-docker-n122-200-178:2227845:2227845 [6] NVSHMEM DEBUG : Created Buffer on resource <NvshmemResource device=<Device 6 (NVIDIA H800)>> at address 1440702856704 with size 3470120 on stream None
H800-1-docker-n122-200-178:2227843:2227843 [4] NVSHMEM DEBUG : Created Buffer on resource <NvshmemResource device=<Device 4 (NVIDIA H800)>> at address 1440702856704 with size 3470120 on stream None
H800-1-docker-n122-200-178:2227844:2227844 [5] NVSHMEM DEBUG : Created Buffer on resource <NvshmemResource device=<Device 5 (NVIDIA H800)>> at address 1440702856704 with size 3470120 on stream None
H800-1-docker-n122-200-178:2227840:2227840 [1] NVSHMEM DEBUG : Created Buffer on resource <NvshmemResource device=<Device 1 (NVIDIA H800)>> at address 1440702856704 with size 3470120 on stream None
H800-1-docker-n122-200-178:2227839:2227839 [0] NVSHMEM DEBUG : Created Buffer on resource <NvshmemResource device=<Device 0 (NVIDIA H800)>> at address 1405806247424 with size 3470120 on stream None
H800-1-docker-n122-200-178:2227846:2227846 [7] NVSHMEM DEBUG : Created Buffer on resource <NvshmemResource device=<Device 7 (NVIDIA H800)>> at address 1440702856704 with size 3470120 on stream None
H800-1-docker-n122-200-178:2227842:2227842 [3] NVSHMEM DEBUG : Created Buffer on resource <NvshmemResource device=<Device 3 (NVIDIA H800)>> at address 1440702856704 with size 3470120 on stream None
iter 2
iter 2
iter 2iter 2

iter 2iter 2

iter 2iter 2

..... many other more logs, but no log from deallocate logic until call nvshmem.finalize and free ....

benjaming1 · July 3, 2025, 5:34am

Hi, thank you for your comment. Yes, this is by design.

The summary of why we made the decision to hold internal references to NVSHMEM symmetric memory is about preventing deadlocks. As you probably know, NVSHMEM requires a total global ordering of allocation and freeing and nvshmem_malloc and nvshmem_free are blocking operations with an internal barrier.

Unfortunately, due to Python’s use of mark-and-sweep for garbage collection, it is possible to end up in situations where, for example, you may have one PE calling free on buffer A before buffer B, and another PE calling free on buffer B before buffer A. This could lead to deadlocks.

The decision we made is to avoid any possibility of deadlocks by holding references to the objects until the user explicitly calls free() on them. Alternatively, if objects are left open until the Python process’s completion, nvshmem.core.finalize() will take care of the freeing for you. However, if you do not call free, those buffers are never returned back to the NVSHMEM symmetric heap and they cannot be re-allocated to another buffer.

houqi1993 · July 3, 2025, 10:02am

thanks. got it

houqi1993 · July 5, 2025, 12:07am

Im not an expert of python lifetime. and i use python variable lifetime to free the memory allocated by nvshmem_malloc, with a torch.cuda.synchronize() before/after the malloc/dealloc and it seems that all work well. can you share an example that this implememtation will fail?

benjaming1 · July 6, 2025, 1:49am

Because the Python garbage collector is non deterministic, it’s very hard to produce a sample that will always cause this issue, and it’s very easy to show a sample that will sometimes cause this issue. Essentially any non trivial Python program which allocates memory and then has it go out of scope can show this problem.

Topic		Replies	Views
[nvshmem4py] nvshmem.core.finalize() does not handle everything GPU-Accelerated Libraries nvshmem	6	159	July 7, 2025
Potential NVSHMEM allocated memory performance issue GPU-Accelerated Libraries nvshmem	19	1739	May 10, 2024
Nvshmem fails to finalize GPU-Accelerated Libraries cuda , nvshmem	4	1334	January 16, 2024
NVSHMEM 3.3.9 version will segmentation fault when finalize Virtualization For Infiniband And Ethernet	0	58	July 21, 2025
BUG: call cudaFree(0) before nvshmem_init() makes nvshmem_barrier_all() fails GPU-Accelerated Libraries nvshmem	6	162	April 19, 2025
[nvshmem4py] nvshmem.core.get_peer_tensor with reuse got error GPU-Accelerated Libraries nvshmem	3	115	July 7, 2025
Possible GPU memory leak GPU-Accelerated Libraries nvshmem	10	314	September 4, 2025
[nvshmem4py] GPU-Accelerated Libraries nvshmem	1	72	July 21, 2025
Raise error when link nvshmem in my application Legacy PGI Compilers cuda , cudnn	13	1779	January 2, 2024
NVSHMEM isuues with nvshmem_TYPENAME_put GPU-Accelerated Libraries nvshmem	6	721	November 21, 2023

Nvshmem4py Buffer can not be freed by python lifetime

Related topics