Collective operations timeout on dual spark during distributed training

Setup

I have 2 dgx sparks (Gigabyte model) connected with this QSFP56 cable. I am running distributed training (DDP) with sparks in a ray cluster. Repo.

Problem

I am encountering a NCCL timeout error when running distributed training across the cluster:

(RayTrainWorker pid=327294, ip=192.168.200.13) [rank0]:[E409 04:26:47.344717386 ProcessGroupNCCL.cpp:689] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1155874, OpType=ALLREDUCE, NumelIn=9445376, NumelOut=9445376, Timeout(ms)=1800000) ran for 1800002 milliseconds before timing out.
(RayTrainWorker pid=327294, ip=192.168.200.13) ount 11a31e sendbuff 0xfd35fb41a000 recvbuff 0xfd35fb41a000 count 7351296 datatype 7 op 0 root 0 comm 0xfd3c95b04b00 [nranks=2] stream 0xfd3c96ba5e10
(RayTrainWorker pid=327294, ip=192.168.200.13) aitopatom-0512:327294:329134 [0] NCCL INFO AllReduce: opCount 11a32f sendbuff 0xfd36285c6000 recvbuff 0xfd36285c6000 count 52516864 datatype 7 op 0 root 0 comm 0xfd3c95b04b00 [nranks=2] stream 0xfd3c96ba5e10 [repeated 36x across cluster]
(RayTrainWorker pid=327294, ip=192.168.200.13) [rank0]:[E409 04:26:47.352963809 ProcessGroupNCCL.cpp:2495] Rank 0
(RayTrainWorker pid=327294, ip=192.168.200.13) 	 - [0] Timeout at collective: ALLREDUCE, #1155888
(RayTrainWorker pid=327294, ip=192.168.200.13) 	 - To our best knowledge, the lagging/dead/mismatched ranks that caused the desync are:
(RayTrainWorker pid=327294, ip=192.168.200.13) 	     [1] finished collective #1155873, but didn't join collective #1155874 (count from 1)
(RayTrainWorker pid=327294, ip=192.168.200.13) 	 - Snapshot of ranks' latest states:
(RayTrainWorker pid=327294, ip=192.168.200.13) 	   #1155873 finished ranks:
(RayTrainWorker pid=327294, ip=192.168.200.13) 	     [1] finished ALLREDUCE
(RayTrainWorker pid=327294, ip=192.168.200.13) 	   #1155888 started ranks:
(RayTrainWorker pid=327294, ip=192.168.200.13) 	     [0] started ALLREDUCE
...

This occurs mid epoch after training has been running successfully for 36 - 72 hours and already completed a few epochs. It is not tied to rank or physical host; either rank can fail to join a collective operation regardless of which host is assigned rank 0/1.

Logs:

NCCL startup/NET/IB logs - nccl_timeout_logs_head.txt (35.6 KB)
Full error logs - nccl_timeout_logs_trunk.txt (48.0 KB)

Notes:

The nccl playbook says to ignore enP2p<...> interfaces. Does that mean I need to set NCCL_SOCKET_IFNAME directly to the enp1<...> interface instead of allowing nccl to discover and use both interfaces?
I did not follow the playbook to install nccl; it was already installed. Do I need to reinstall it with the instructions provided?

Thanks in advance!

Edit: I have already kicked off another run with "NCCL_SOCKET_IFNAME": "enp1s0f1np1" and "NCCL_IB_HCA": "rocep1s0f1" but will be at least a few days until I know if it still crashes.

Edit 2: I am setting these env vars:

    "TORCH_FR_BUFFER_SIZE": "1048576",
    "TORCH_NCCL_TRACE_BUFFER_SIZE": "1048576",
    "TORCH_NCCL_DUMP_ON_TIMEOUT": "1",
    "TORCH_NCCL_DESYNC_DEBUG": "1",
    "NCCL_DEBUG": "INFO",
    "NCCL_DEBUG_SUBSYS": "INIT,NET,COLL",

but still seeing

Stack trace of the failed collective not found, potentially because FlightRecorder is disabled. You can enable it by setting TORCH_NCCL_TRACE_BUFFER_SIZE to a non-zero value.

Not sure why the buffer size envs are not enabling FlightRecorder, or if stack trace was not found for another reason.

Root cause is visible in both log files.

GPU Direct RDMA disabled — init log:

NET/IB : GPU Direct RDMA Disabled for HCA 0 'rocep1s0f1'
NET/IB : GPU Direct RDMA Disabled for HCA 1 'roceP2p1s0f1'

Two missing MLX5 symbols on both nodes:

dlvsym failed on mlx5dv_reg_dmabuf_mr (MLX5_1.25)
dlvsym failed on mlx5dv_get_data_direct_sysfs_path

mlx5dv_reg_dmabuf_mr is the DMA-BUF GPU memory registration call NCCL requires for GPU Direct RDMA. Without it NCCL falls back to CPU-path for all collectives. Confirmed:

Connected all rings, use ring PXN 0 GDR 0

GDR=0. All 1,155,873 collectives ran CPU-path.

Crash — not a GIL deadlock — trunk log:

Rank 1 completed collective #1155873 and never joined #1155874. Rank 1 was alive — it received the dump signal from rank 0. The failure is a SIGSEGV inside NCCL’s CPU-path proxy:

ncclLocalOpAppend()
  → SaveProxy()
  → ncclProxySaveOp()
  → uploadProxyOps()
  → hostStreamPlanTask()
  → ncclLaunchKernelAfter_NoCuda()
  → groupLaunch()
  → ncclGroupEndInternal()
  → ncclEnqueueCheck()
  → pncclAllReduce
  → c10d::ProcessGroupNCCL::allreduce()
  → c10d::Reducer::autograd_hook()
  → torch::autograd::Engine::thread_main()

PyTorch’s DDP autograd hook triggers AllReduce → NCCL enqueues through the CPU-path proxy → ncclLocalOpAppend writes to a proxy op buffer corrupted after ~1.1M collectives. Memory corruption. Rank 1 crashes. Rank 0 watchdog fires 61 seconds later.

FlightRecorder not capturing:

FlightRecorder cannot capture a SIGSEGV. The crash at ncclLocalOpAppend happens at the C level — the process dies before PyTorch’s ring buffer can write anything. The stack trace in the trunk log is the capture — it came from the aarch64 signal handler (absl::AbslFailureSignalHandler). No FlightRecorder configuration will change this.

Platform limitation — MLX5 symbols:

The missing symbols are not a configuration error. MLNX_OFED installation is not supported on DGX Spark — NVIDIA moderator confirmed March 9: DGX Spark Mini – Missing Mellanox OFED drivers and DGX Spark repo for Ubuntu 24.04 (ARM64)

“installing the drivers requires Secure Boot to be disabled so the drivers are not officially supported on DGX Spark platforms”

The in-kernel mlx5 on 6.17.0-1008-nvidia does not include mlx5dv_reg_dmabuf_mr. CPU-path is the currently supported NCCL transport on dual DGX Spark.

NCCL_SOCKET_IFNAME:

Setting NCCL_SOCKET_IFNAME=enp1s0f1np1 and excluding roceP2p1s0f1 aligns with the NVIDIA NCCL playbook for dual DGX Spark: NCCL for Two Sparks | DGX Spark

“You can disregard interfaces starting with the prefix enP2p<…> and only consider interfaces starting with enp1<…> instead.”

Bug report:

The SIGSEGV is a joint PyTorch/NCCL bug — memory corruption in the CPU-path proxy op buffer under extended operation. Worth filing in both: GitHub - NVIDIA/nccl: Optimized primitives for collective multi-GPU communication · GitHub GitHub - pytorch/pytorch: Tensors and Dynamic neural networks in Python with strong GPU acceleration · GitHub

Include the full stack trace and ~1.1M collective count as the reproduction condition.