Jetson AGX Orin 64G: PyTorch RPC RuntimeError: CUDA IPC UUID Check Failure

Description

PyTorch RPC RuntimeError: CUDA IPC UUID Check Failure on Jetson AGX Orin

Environment

GPU Type: Orin
Nvidia Driver Version: 540.4.0
CUDA Version: V12.6.68
CUDNN Version: 9.3.0
Operating System + Version: Ubuntu 22.04.5 LTS
Python Version (if applicable): Python 3.10.18
PyTorch Version (if applicable): 2.8.0
Baremetal or Container (if container which image + tag): Baremetal

Relevant Files

import os
import sys
import time
import torch
import torch.distributed.rpc as rpc

MASTER_ADDR = “192.168.1.XXX” # Use your actual Master IP
MASTER_PORT = “29500”
WORLD_SIZE = 2

os.environ[“MASTER_ADDR”] = MASTER_ADDR
os.environ[“MASTER_PORT”] = MASTER_PORT
os.environ[“WORLD_SIZE”] = str(WORLD_SIZE)
os.environ[“GLOO_SOCKET_IFNAME”] = “eno1”
os.environ[“GLOO_SOCKET_FAMILY”] = “AF_INET”

def remote_double(x):
return x * 2

def main():
if len(sys.argv) != 2 or sys.argv[1].lower() not in (“a”, “b”):
sys.exit(1)

role = sys.argv[1].lower()
rank = 0 if role == "a" else 1
name = f"device{role.upper()}"

# Initialize RPC with default (TensorPipe) backend
rpc.init_rpc(
    name=name,
    rank=rank,
    world_size=WORLD_SIZE,
)

if role == "a":
    # Rank 0: Perform remote call
    result = rpc.rpc_sync("deviceB", remote_double, args=(7,))
else:
    # Rank 1: Wait to serve requests
    time.sleep(5)

rpc.shutdown()

if name == “main”:
main()

The following is the error message of rank0:

RuntimeError: In getGlobalUuidsAndP2pSupport at tensorpipe/channel/cuda_ipc/context_impl.cc:65 "uuidStr.substr(0, 4) != "GPU-“Couldn’t obtain valid UUID for GPU #0 from CUDA driver. Got: f16b04b3-9cdd-57b1-81bd-7830cf48c42a”

The following is the error message of rank1:

torch.distributed.DistNetworkError: Failed to recv, got 0 bytes. Connection was likely closed. Did the remote server shutdown or crash?

Hi, i am confused with the similar issue. Have you found the way to solve that. THX.

@Cragrape