Description
PyTorch RPC RuntimeError: CUDA IPC UUID Check Failure on Jetson AGX Orin
Environment
GPU Type: Orin
Nvidia Driver Version: 540.4.0
CUDA Version: V12.6.68
CUDNN Version: 9.3.0
Operating System + Version: Ubuntu 22.04.5 LTS
Python Version (if applicable): Python 3.10.18
PyTorch Version (if applicable): 2.8.0
Baremetal or Container (if container which image + tag): Baremetal
Relevant Files
import os
import sys
import time
import torch
import torch.distributed.rpc as rpc
MASTER_ADDR = “192.168.1.XXX” # Use your actual Master IP
MASTER_PORT = “29500”
WORLD_SIZE = 2
os.environ[“MASTER_ADDR”] = MASTER_ADDR
os.environ[“MASTER_PORT”] = MASTER_PORT
os.environ[“WORLD_SIZE”] = str(WORLD_SIZE)
os.environ[“GLOO_SOCKET_IFNAME”] = “eno1”
os.environ[“GLOO_SOCKET_FAMILY”] = “AF_INET”
def remote_double(x):
return x * 2
def main():
if len(sys.argv) != 2 or sys.argv[1].lower() not in (“a”, “b”):
sys.exit(1)
role = sys.argv[1].lower()
rank = 0 if role == "a" else 1
name = f"device{role.upper()}"
# Initialize RPC with default (TensorPipe) backend
rpc.init_rpc(
name=name,
rank=rank,
world_size=WORLD_SIZE,
)
if role == "a":
# Rank 0: Perform remote call
result = rpc.rpc_sync("deviceB", remote_double, args=(7,))
else:
# Rank 1: Wait to serve requests
time.sleep(5)
rpc.shutdown()
if name == “main”:
main()
The following is the error message of rank0:
RuntimeError: In getGlobalUuidsAndP2pSupport at tensorpipe/channel/cuda_ipc/context_impl.cc:65 "uuidStr.substr(0, 4) != "GPU-“Couldn’t obtain valid UUID for GPU #0 from CUDA driver. Got: f16b04b3-9cdd-57b1-81bd-7830cf48c42a”
The following is the error message of rank1:
torch.distributed.DistNetworkError: Failed to recv, got 0 bytes. Connection was likely closed. Did the remote server shutdown or crash?