Nvshmem4py Buffer can not be freed by python lifetime

Here I use nvshmem_python-source-0.1.0.36132199_cuda12-archive. But I found that with this implementation, Buffer allocated by NvshmemResource will not be released as buffer gets out of lifetime, as Buffer is kept by NvshmemResource.

My question is: is this by design? I have to release the nvshmem Tensor manually, which is very not pythonic.

this is a sample code to verify this (copy of nvshmem_python-source-0.1.0.36132199_cuda12-archive/examples/torch_triton_interop.py):

if __name__ == '__main__':
    torchrun_uid_init()

    """
    Allocate 3 tensors on the NVSHMEM symmetric heap
    We will add tensor1 to tensor2, and store that to tensor_out
    Then, we will use nvshmem.core to sum-reduce all PEs' copies of tensor_out
    """
    n_elements = 867530

    nvshmem.core.utils._configure_logging(level="DEBUG")

    for n in range(10):
        print(f"iter {n}", flush=True)
        tensor = nvshmem.core.tensor((n_elements,), dtype=torch.float32)

the log:

$ LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/data01/houqi.1993/micromamba/envs/houqi/lib/python3.11/site-packages/nvidia/nvshmem/lib torchrun --node_rank=0 --nproc_per_node=8 --nnodes=1 --rdzv_endpoint=127.0.0.1:12345 ~/ProgramFiles/nvshmem_python-source-0.1.0.36132199_cuda12-archive/examples/torch_triton_interop.py
[W703 10:53:52.783941591 ProcessGroupGloo.cpp:727] Warning: Unable to resolve hostname to a (local) address. Using the loopback address as fallback. Manually set the network interface to bind to with GLOO_SOCKET_IFNAME. (function operator())
[W703 10:53:52.788321305 ProcessGroupGloo.cpp:727] Warning: Unable to resolve hostname to a (local) address. Using the loopback address as fallback. Manually set the network interface to bind to with GLOO_SOCKET_IFNAME. (function operator())
[W703 10:53:52.803465628 ProcessGroupGloo.cpp:727] Warning: Unable to resolve hostname to a (local) address. Using the loopback address as fallback. Manually set the network interface to bind to with GLOO_SOCKET_IFNAME. (function operator())
[W703 10:53:52.803650140 ProcessGroupGloo.cpp:727] Warning: Unable to resolve hostname to a (local) address. Using the loopback address as fallback. Manually set the network interface to bind to with GLOO_SOCKET_IFNAME. (function operator())
[W703 10:53:52.807265264 ProcessGroupGloo.cpp:727] Warning: Unable to resolve hostname to a (local) address. Using the loopback address as fallback. Manually set the network interface to bind to with GLOO_SOCKET_IFNAME. (function operator())
[W703 10:53:52.811320509 ProcessGroupGloo.cpp:727] Warning: Unable to resolve hostname to a (local) address. Using the loopback address as fallback. Manually set the network interface to bind to with GLOO_SOCKET_IFNAME. (function operator())
[W703 10:53:52.821334481 ProcessGroupGloo.cpp:727] Warning: Unable to resolve hostname to a (local) address. Using the loopback address as fallback. Manually set the network interface to bind to with GLOO_SOCKET_IFNAME. (function operator())
[W703 10:53:52.824206420 ProcessGroupGloo.cpp:727] Warning: Unable to resolve hostname to a (local) address. Using the loopback address as fallback. Manually set the network interface to bind to with GLOO_SOCKET_IFNAME. (function operator())
iter 0
iter 0
iter 0
iter 0
iter 0
H800-1-docker-n122-200-178:2227842:2227842 [3] NVSHMEM DEBUG : Creating NvshmemResource for device 3
H800-1-docker-n122-200-178:2227846:2227846 [7] NVSHMEM DEBUG : Creating NvshmemResource for device 7
H800-1-docker-n122-200-178:2227844:2227844 [5] NVSHMEM DEBUG : Creating NvshmemResource for device 5
H800-1-docker-n122-200-178:2227841:2227841 [2] NVSHMEM DEBUG : Creating NvshmemResource for device 2
H800-1-docker-n122-200-178:2227840:2227840 [1] NVSHMEM DEBUG : Creating NvshmemResource for device 1
iter 0
H800-1-docker-n122-200-178:2227845:2227845 [6] NVSHMEM DEBUG : Creating NvshmemResource for device 6
iter 0
H800-1-docker-n122-200-178:2227843:2227843 [4] NVSHMEM DEBUG : Creating NvshmemResource for device 4
iter 0
H800-1-docker-n122-200-178:2227839:2227839 [0] NVSHMEM DEBUG : Creating NvshmemResource for device 0
H800-1-docker-n122-200-178:2227845:2227845 [6] NVSHMEM DEBUG : Created Buffer on resource <NvshmemResource device=<Device 6 (NVIDIA H800)>> at address 1440699386368 with size 3470120 on stream None
H800-1-docker-n122-200-178:2227839:2227839 [0] NVSHMEM DEBUG : Created Buffer on resource <NvshmemResource device=<Device 0 (NVIDIA H800)>> at address 1405802777088 with size 3470120 on stream None
H800-1-docker-n122-200-178:2227844:2227844 [5] NVSHMEM DEBUG : Created Buffer on resource <NvshmemResource device=<Device 5 (NVIDIA H800)>> at address 1440699386368 with size 3470120 on stream None
H800-1-docker-n122-200-178:2227840:2227840 [1] NVSHMEM DEBUG : Created Buffer on resource <NvshmemResource device=<Device 1 (NVIDIA H800)>> at address 1440699386368 with size 3470120 on stream None
H800-1-docker-n122-200-178:2227843:2227843 [4] NVSHMEM DEBUG : Created Buffer on resource <NvshmemResource device=<Device 4 (NVIDIA H800)>> at address 1440699386368 with size 3470120 on stream None
H800-1-docker-n122-200-178:2227841:2227841 [2] NVSHMEM DEBUG : Created Buffer on resource <NvshmemResource device=<Device 2 (NVIDIA H800)>> at address 1440699386368 with size 3470120 on stream None
H800-1-docker-n122-200-178:2227846:2227846 [7] NVSHMEM DEBUG : Created Buffer on resource <NvshmemResource device=<Device 7 (NVIDIA H800)>> at address 1440699386368 with size 3470120 on stream None
H800-1-docker-n122-200-178:2227842:2227842 [3] NVSHMEM DEBUG : Created Buffer on resource <NvshmemResource device=<Device 3 (NVIDIA H800)>> at address 1440699386368 with size 3470120 on stream None
iter 1
iter 1iter 1iter 1iter 1iter 1iter 1





iter 1
H800-1-docker-n122-200-178:2227841:2227841 [2] NVSHMEM DEBUG : Created Buffer on resource <NvshmemResource device=<Device 2 (NVIDIA H800)>> at address 1440702856704 with size 3470120 on stream None
H800-1-docker-n122-200-178:2227845:2227845 [6] NVSHMEM DEBUG : Created Buffer on resource <NvshmemResource device=<Device 6 (NVIDIA H800)>> at address 1440702856704 with size 3470120 on stream None
H800-1-docker-n122-200-178:2227843:2227843 [4] NVSHMEM DEBUG : Created Buffer on resource <NvshmemResource device=<Device 4 (NVIDIA H800)>> at address 1440702856704 with size 3470120 on stream None
H800-1-docker-n122-200-178:2227844:2227844 [5] NVSHMEM DEBUG : Created Buffer on resource <NvshmemResource device=<Device 5 (NVIDIA H800)>> at address 1440702856704 with size 3470120 on stream None
H800-1-docker-n122-200-178:2227840:2227840 [1] NVSHMEM DEBUG : Created Buffer on resource <NvshmemResource device=<Device 1 (NVIDIA H800)>> at address 1440702856704 with size 3470120 on stream None
H800-1-docker-n122-200-178:2227839:2227839 [0] NVSHMEM DEBUG : Created Buffer on resource <NvshmemResource device=<Device 0 (NVIDIA H800)>> at address 1405806247424 with size 3470120 on stream None
H800-1-docker-n122-200-178:2227846:2227846 [7] NVSHMEM DEBUG : Created Buffer on resource <NvshmemResource device=<Device 7 (NVIDIA H800)>> at address 1440702856704 with size 3470120 on stream None
H800-1-docker-n122-200-178:2227842:2227842 [3] NVSHMEM DEBUG : Created Buffer on resource <NvshmemResource device=<Device 3 (NVIDIA H800)>> at address 1440702856704 with size 3470120 on stream None
iter 2
iter 2
iter 2iter 2

iter 2iter 2

iter 2iter 2

..... many other more logs, but no log from deallocate logic until call nvshmem.finalize and free ....

Hi, thank you for your comment. Yes, this is by design.

The summary of why we made the decision to hold internal references to NVSHMEM symmetric memory is about preventing deadlocks. As you probably know, NVSHMEM requires a total global ordering of allocation and freeing and nvshmem_malloc and nvshmem_free are blocking operations with an internal barrier.

Unfortunately, due to Python’s use of mark-and-sweep for garbage collection, it is possible to end up in situations where, for example, you may have one PE calling free on buffer A before buffer B, and another PE calling free on buffer B before buffer A. This could lead to deadlocks.

The decision we made is to avoid any possibility of deadlocks by holding references to the objects until the user explicitly calls free() on them. Alternatively, if objects are left open until the Python process’s completion, nvshmem.core.finalize() will take care of the freeing for you. However, if you do not call free, those buffers are never returned back to the NVSHMEM symmetric heap and they cannot be re-allocated to another buffer.

thanks. got it

Im not an expert of python lifetime. and i use python variable lifetime to free the memory allocated by nvshmem_malloc, with a torch.cuda.synchronize() before/after the malloc/dealloc and it seems that all work well. can you share an example that this implememtation will fail?

Because the Python garbage collector is non deterministic, it’s very hard to produce a sample that will always cause this issue, and it’s very easy to show a sample that will sometimes cause this issue. Essentially any non trivial Python program which allocates memory and then has it go out of scope can show this problem.