what I expect about nvshmem.core.finalize: find all not released nvshmem.core.tensor and release them. nvshmem.core.finalize will take care of everything without an error or exception.
but it’s not the case.
if you create
A = nshmem.core.tensor
and then call
B = nvshmem.core.get_peer_tensor(A, rank)
to increase the tensor reference. And you forget to call nvshmem.core.free_tensor(A) and just call nvshmem.core.finialize(). Then you get punished by nvshmem4py: nvshmem.core.finalize will only call deallocate on the buffer once, but the buffer has 2 references, so it only decreases the reference to 1 and does not free it.
after nvshmem.core.finalize() thinks it frees all buffer, it called nvshmem_finalize.
then python destruction goes on, and clean all not destructed buffers, then deallocate all again. This time it tried to call nvshmem_free, but now nvshmem is already finalized! then we got a panic.
this is the sample code:
import torch.distributed as dist
import torch
import nvshmem.core
import os
from cuda.core.experimental import Device, system
import nvshmem.core.utils
class PyTorchStreamWrapper:
def __init__(self, pt_stream):
self.pt_stream = pt_stream
self.handle = pt_stream.cuda_stream
def __cuda_stream__(self):
stream_id = self.pt_stream.cuda_stream
return (0, stream_id) # Return format required by CUDA Python
def torchrun_uid_init():
"""
Initialize NVSHMEM using UniqueID with `torchrun` as the launcher
"""
# Set Torch device
local_rank = int(os.environ['LOCAL_RANK'])
torch.cuda.set_device(local_rank)
device = torch.device("cuda", local_rank)
# nvshmem4py requires a cuda.core Device at init time
global dev
dev = Device(device.index)
dev.set_current()
global stream
# Get PyTorch's current stream
pt_stream = torch.cuda.current_stream()
stream = PyTorchStreamWrapper(pt_stream)
# Initialize torch.distributed process group
world_size = torch.cuda.device_count()
dist.init_process_group(
backend="cpu:gloo,cuda:nccl",
rank=local_rank,
world_size=world_size,
device_id=device
)
# Extract rank, nranks from process group
num_ranks = dist.get_world_size()
rank_id = dist.get_rank()
# Create an empty uniqueid for all ranks
uniqueid = nvshmem.core.get_unique_id(empty=True)
if rank_id == 0:
# Rank 0 gets a real uniqueid
uniqueid = nvshmem.core.get_unique_id()
broadcast_objects = [uniqueid]
else:
broadcast_objects = [None]
# We use torch.distributed.broadcast_object_list to send the UID to all ranks
dist.broadcast_object_list(broadcast_objects, src=0)
dist.barrier()
nvshmem.core.init(device=dev, uid=broadcast_objects[0], rank=rank_id, nranks=num_ranks, initializer_method="uid")
if __name__ == '__main__':
torchrun_uid_init()
n_elements = 867530
nvshmem.core.utils._configure_logging(level="DEBUG")
tensor_out = nvshmem.core.tensor((n_elements,), dtype=torch.float32)
ts = [nvshmem.core.get_peer_tensor(tensor_out, peer) for peer in range(nvshmem.core.n_pes())]
# nvshmem.core.free_tensor(tensor_out)
nvshmem.core.finalize()
dist.destroy_process_group()
this is the log
[W704 06:55:14.322782761 ProcessGroupGloo.cpp:727] Warning: Unable to resolve hostname to a (local) address. Using the loopback address as fallback. Manually set the network interface to bind to with GLOO_SOCKET_IFNAME. (function operator())
H800-1-docker-n122-200-178:1585472:1585472 [0] NVSHMEM DEBUG : Creating NvshmemResource for device 0
H800-1-docker-n122-200-178:1585472:1585472 [0] NVSHMEM DEBUG : Created Buffer on resource <NvshmemResource device=<Device 0 (NVIDIA H800)>> at address 1406077503488 with size 3470120 on stream None
H800-1-docker-n122-200-178:1585472:1585472 [0] NVSHMEM DEBUG : Found already tracked peer buffer with address 1406077503488. Returning it. Ref count 2
H800-1-docker-n122-200-178:1585472:1585472 [0] NVSHMEM DEBUG : Did not find peer buffer with address 1543516456960. Creating a new one.
H800-1-docker-n122-200-178:1585472:1585472 [0] NVSHMEM DEBUG : Did not find peer buffer with address 1680955410432. Creating a new one.
H800-1-docker-n122-200-178:1585472:1585472 [0] NVSHMEM DEBUG : Did not find peer buffer with address 1818394363904. Creating a new one.
H800-1-docker-n122-200-178:1585472:1585472 [0] NVSHMEM DEBUG : Did not find peer buffer with address 1955833317376. Creating a new one.
H800-1-docker-n122-200-178:1585472:1585472 [0] NVSHMEM DEBUG : Did not find peer buffer with address 2093272270848. Creating a new one.
H800-1-docker-n122-200-178:1585472:1585472 [0] NVSHMEM DEBUG : Did not find peer buffer with address 2230711224320. Creating a new one.
H800-1-docker-n122-200-178:1585472:1585472 [0] NVSHMEM DEBUG : Did not find peer buffer with address 2368150177792. Creating a new one.
H800-1-docker-n122-200-178:1585472:1585472 [0] NVSHMEM DEBUG : nvshmem_finalize() called
H800-1-docker-n122-200-178:1585472:1585472 [0] NVSHMEM ERROR : Found un-freed memory object with address 1406077503488 at fini time
H800-1-docker-n122-200-178:1585472:1585472 [0] NVSHMEM ERROR : Found 1 un-freed memory objects at fini time
H800-1-docker-n122-200-178:1585472:1585472 [0] NVSHMEM INFO : Found object open at pointer 1406077503488 and ref count 2. Freeing it.
H800-1-docker-n122-200-178:1585472:1585472 [0] NVSHMEM DEBUG : Free called on buffer with address 1406077503488
H800-1-docker-n122-200-178:1585472:1585472 [0] NVSHMEM DEBUG : New ref count on buf 1406077503488 1
H800-1-docker-n122-200-178:1585472:1585472 [0] NVSHMEM INFO : Found object open at pointer 1543516456960 and ref count 1. Freeing it.
H800-1-docker-n122-200-178:1585472:1585472 [0] NVSHMEM DEBUG : Free called on buffer with address 1543516456960
H800-1-docker-n122-200-178:1585472:1585472 [0] NVSHMEM DEBUG : New ref count on peer buf 1543516456960 0
H800-1-docker-n122-200-178:1585472:1585472 [0] NVSHMEM DEBUG : free() requested on a peer buffer. Not calling free()
H800-1-docker-n122-200-178:1585472:1585472 [0] NVSHMEM INFO : Found object open at pointer 1680955410432 and ref count 1. Freeing it.
H800-1-docker-n122-200-178:1585472:1585472 [0] NVSHMEM DEBUG : Free called on buffer with address 1680955410432
H800-1-docker-n122-200-178:1585472:1585472 [0] NVSHMEM DEBUG : New ref count on peer buf 1680955410432 0
H800-1-docker-n122-200-178:1585472:1585472 [0] NVSHMEM DEBUG : free() requested on a peer buffer. Not calling free()
H800-1-docker-n122-200-178:1585472:1585472 [0] NVSHMEM INFO : Found object open at pointer 1818394363904 and ref count 1. Freeing it.
H800-1-docker-n122-200-178:1585472:1585472 [0] NVSHMEM DEBUG : Free called on buffer with address 1818394363904
H800-1-docker-n122-200-178:1585472:1585472 [0] NVSHMEM DEBUG : New ref count on peer buf 1818394363904 0
H800-1-docker-n122-200-178:1585472:1585472 [0] NVSHMEM DEBUG : free() requested on a peer buffer. Not calling free()
H800-1-docker-n122-200-178:1585472:1585472 [0] NVSHMEM INFO : Found object open at pointer 1955833317376 and ref count 1. Freeing it.
H800-1-docker-n122-200-178:1585472:1585472 [0] NVSHMEM DEBUG : Free called on buffer with address 1955833317376
H800-1-docker-n122-200-178:1585472:1585472 [0] NVSHMEM DEBUG : New ref count on peer buf 1955833317376 0
H800-1-docker-n122-200-178:1585472:1585472 [0] NVSHMEM DEBUG : free() requested on a peer buffer. Not calling free()
H800-1-docker-n122-200-178:1585472:1585472 [0] NVSHMEM INFO : Found object open at pointer 2093272270848 and ref count 1. Freeing it.
H800-1-docker-n122-200-178:1585472:1585472 [0] NVSHMEM DEBUG : Free called on buffer with address 2093272270848
H800-1-docker-n122-200-178:1585472:1585472 [0] NVSHMEM DEBUG : New ref count on peer buf 2093272270848 0
H800-1-docker-n122-200-178:1585472:1585472 [0] NVSHMEM DEBUG : free() requested on a peer buffer. Not calling free()
H800-1-docker-n122-200-178:1585472:1585472 [0] NVSHMEM INFO : Found object open at pointer 2230711224320 and ref count 1. Freeing it.
H800-1-docker-n122-200-178:1585472:1585472 [0] NVSHMEM DEBUG : Free called on buffer with address 2230711224320
H800-1-docker-n122-200-178:1585472:1585472 [0] NVSHMEM DEBUG : New ref count on peer buf 2230711224320 0
H800-1-docker-n122-200-178:1585472:1585472 [0] NVSHMEM DEBUG : free() requested on a peer buffer. Not calling free()
H800-1-docker-n122-200-178:1585472:1585472 [0] NVSHMEM INFO : Found object open at pointer 2368150177792 and ref count 1. Freeing it.
H800-1-docker-n122-200-178:1585472:1585472 [0] NVSHMEM DEBUG : Free called on buffer with address 2368150177792
H800-1-docker-n122-200-178:1585472:1585472 [0] NVSHMEM DEBUG : New ref count on peer buf 2368150177792 0
H800-1-docker-n122-200-178:1585472:1585472 [0] NVSHMEM DEBUG : free() requested on a peer buffer. Not calling free()
H800-1-docker-n122-200-178:1585472:1585472 [0] NVSHMEM DEBUG : Free called on buffer with address 2368150177792
H800-1-docker-n122-200-178:1585472:1585472 [0] NVSHMEM DEBUG : Ref count on 2368150177792 is already 0. Already freed.
H800-1-docker-n122-200-178:1585472:1585472 [0] NVSHMEM DEBUG : Free called on buffer with address 2230711224320
H800-1-docker-n122-200-178:1585472:1585472 [0] NVSHMEM DEBUG : Ref count on 2230711224320 is already 0. Already freed.
H800-1-docker-n122-200-178:1585472:1585472 [0] NVSHMEM DEBUG : Free called on buffer with address 2093272270848
H800-1-docker-n122-200-178:1585472:1585472 [0] NVSHMEM DEBUG : Ref count on 2093272270848 is already 0. Already freed.
H800-1-docker-n122-200-178:1585472:1585472 [0] NVSHMEM DEBUG : Free called on buffer with address 1955833317376
H800-1-docker-n122-200-178:1585472:1585472 [0] NVSHMEM DEBUG : Ref count on 1955833317376 is already 0. Already freed.
H800-1-docker-n122-200-178:1585472:1585472 [0] NVSHMEM DEBUG : Free called on buffer with address 1818394363904
H800-1-docker-n122-200-178:1585472:1585472 [0] NVSHMEM DEBUG : Ref count on 1818394363904 is already 0. Already freed.
H800-1-docker-n122-200-178:1585472:1585472 [0] NVSHMEM DEBUG : Free called on buffer with address 1680955410432
H800-1-docker-n122-200-178:1585472:1585472 [0] NVSHMEM DEBUG : Ref count on 1680955410432 is already 0. Already freed.
H800-1-docker-n122-200-178:1585472:1585472 [0] NVSHMEM DEBUG : Free called on buffer with address 1543516456960
H800-1-docker-n122-200-178:1585472:1585472 [0] NVSHMEM DEBUG : Ref count on 1543516456960 is already 0. Already freed.
H800-1-docker-n122-200-178:1585472:1585472 [0] NVSHMEM DEBUG : Free called on buffer with address 1406077503488
H800-1-docker-n122-200-178:1585472:1585472 [0] NVSHMEM DEBUG : New ref count on buf 1406077503488 0
/dvs/p4/build/sw/rel/gpgpu/toolkit/r12.8/main_nvshmem/src/host/mem/mem_heap.cpp:nvshmem_free:1702: NVSHMEM API called before NVSHMEM initialization has completed
/dvs/p4/build/sw/rel/gpgpu/toolkit/r12.8/main_nvshmem/src/host/util/cs.cpp:21: non-zero status: 16: Device or resource busy, exiting... mutex destroy failed