Collective operations timeout on dual spark during distributed training

fnakhla23 · April 9, 2026, 11:42pm

Setup

I have 2 dgx sparks (Gigabyte model) connected with this QSFP56 cable. I am running distributed training (DDP) with sparks in a ray cluster. Repo.

Problem

I am encountering a NCCL timeout error when running distributed training across the cluster:

(RayTrainWorker pid=327294, ip=192.168.200.13) [rank0]:[E409 04:26:47.344717386 ProcessGroupNCCL.cpp:689] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1155874, OpType=ALLREDUCE, NumelIn=9445376, NumelOut=9445376, Timeout(ms)=1800000) ran for 1800002 milliseconds before timing out.
(RayTrainWorker pid=327294, ip=192.168.200.13) ount 11a31e sendbuff 0xfd35fb41a000 recvbuff 0xfd35fb41a000 count 7351296 datatype 7 op 0 root 0 comm 0xfd3c95b04b00 [nranks=2] stream 0xfd3c96ba5e10
(RayTrainWorker pid=327294, ip=192.168.200.13) aitopatom-0512:327294:329134 [0] NCCL INFO AllReduce: opCount 11a32f sendbuff 0xfd36285c6000 recvbuff 0xfd36285c6000 count 52516864 datatype 7 op 0 root 0 comm 0xfd3c95b04b00 [nranks=2] stream 0xfd3c96ba5e10 [repeated 36x across cluster]
(RayTrainWorker pid=327294, ip=192.168.200.13) [rank0]:[E409 04:26:47.352963809 ProcessGroupNCCL.cpp:2495] Rank 0
(RayTrainWorker pid=327294, ip=192.168.200.13) 	 - [0] Timeout at collective: ALLREDUCE, #1155888
(RayTrainWorker pid=327294, ip=192.168.200.13) 	 - To our best knowledge, the lagging/dead/mismatched ranks that caused the desync are:
(RayTrainWorker pid=327294, ip=192.168.200.13) 	     [1] finished collective #1155873, but didn't join collective #1155874 (count from 1)
(RayTrainWorker pid=327294, ip=192.168.200.13) 	 - Snapshot of ranks' latest states:
(RayTrainWorker pid=327294, ip=192.168.200.13) 	   #1155873 finished ranks:
(RayTrainWorker pid=327294, ip=192.168.200.13) 	     [1] finished ALLREDUCE
(RayTrainWorker pid=327294, ip=192.168.200.13) 	   #1155888 started ranks:
(RayTrainWorker pid=327294, ip=192.168.200.13) 	     [0] started ALLREDUCE
...

This occurs mid epoch after training has been running successfully for 36 - 72 hours and already completed a few epochs. It is not tied to rank or physical host; either rank can fail to join a collective operation regardless of which host is assigned rank 0/1.

Logs:

NCCL startup/NET/IB logs - nccl_timeout_logs_head.txt (35.6 KB)
Full error logs - nccl_timeout_logs_trunk.txt (48.0 KB)

Notes:

The nccl playbook says to ignore enP2p<...> interfaces. Does that mean I need to set NCCL_SOCKET_IFNAME directly to the enp1<...> interface instead of allowing nccl to discover and use both interfaces?
I did not follow the playbook to install nccl; it was already installed. Do I need to reinstall it with the instructions provided?

Thanks in advance!

Edit: I have already kicked off another run with "NCCL_SOCKET_IFNAME": "enp1s0f1np1" and "NCCL_IB_HCA": "rocep1s0f1" but will be at least a few days until I know if it still crashes.

Edit 2: I am setting these env vars:

    "TORCH_FR_BUFFER_SIZE": "1048576",
    "TORCH_NCCL_TRACE_BUFFER_SIZE": "1048576",
    "TORCH_NCCL_DUMP_ON_TIMEOUT": "1",
    "TORCH_NCCL_DESYNC_DEBUG": "1",
    "NCCL_DEBUG": "INFO",
    "NCCL_DEBUG_SUBSYS": "INIT,NET,COLL",

but still seeing

Stack trace of the failed collective not found, potentially because FlightRecorder is disabled. You can enable it by setting TORCH_NCCL_TRACE_BUFFER_SIZE to a non-zero value.

Not sure why the buffer size envs are not enabling FlightRecorder, or if stack trace was not found for another reason.

parallelArchitect · April 10, 2026, 8:18am

Root cause is visible in both log files.

GPU Direct RDMA disabled — init log:

NET/IB : GPU Direct RDMA Disabled for HCA 0 'rocep1s0f1'
NET/IB : GPU Direct RDMA Disabled for HCA 1 'roceP2p1s0f1'

Two missing MLX5 symbols on both nodes:

dlvsym failed on mlx5dv_reg_dmabuf_mr (MLX5_1.25)
dlvsym failed on mlx5dv_get_data_direct_sysfs_path

mlx5dv_reg_dmabuf_mr is the DMA-BUF GPU memory registration call NCCL requires for GPU Direct RDMA. Without it NCCL falls back to CPU-path for all collectives. Confirmed:

Connected all rings, use ring PXN 0 GDR 0

GDR=0. All 1,155,873 collectives ran CPU-path.

Crash — not a GIL deadlock — trunk log:

Rank 1 completed collective #1155873 and never joined #1155874. Rank 1 was alive — it received the dump signal from rank 0. The failure is a SIGSEGV inside NCCL’s CPU-path proxy:

ncclLocalOpAppend()
  → SaveProxy()
  → ncclProxySaveOp()
  → uploadProxyOps()
  → hostStreamPlanTask()
  → ncclLaunchKernelAfter_NoCuda()
  → groupLaunch()
  → ncclGroupEndInternal()
  → ncclEnqueueCheck()
  → pncclAllReduce
  → c10d::ProcessGroupNCCL::allreduce()
  → c10d::Reducer::autograd_hook()
  → torch::autograd::Engine::thread_main()

PyTorch’s DDP autograd hook triggers AllReduce → NCCL enqueues through the CPU-path proxy → ncclLocalOpAppend writes to a proxy op buffer corrupted after ~1.1M collectives. Memory corruption. Rank 1 crashes. Rank 0 watchdog fires 61 seconds later.

FlightRecorder not capturing:

FlightRecorder cannot capture a SIGSEGV. The crash at ncclLocalOpAppend happens at the C level — the process dies before PyTorch’s ring buffer can write anything. The stack trace in the trunk log is the capture — it came from the aarch64 signal handler (absl::AbslFailureSignalHandler). No FlightRecorder configuration will change this.

Platform limitation — MLX5 symbols:

The missing symbols are not a configuration error. MLNX_OFED installation is not supported on DGX Spark — NVIDIA moderator confirmed March 9: DGX Spark Mini – Missing Mellanox OFED drivers and DGX Spark repo for Ubuntu 24.04 (ARM64)

“installing the drivers requires Secure Boot to be disabled so the drivers are not officially supported on DGX Spark platforms”

The in-kernel mlx5 on 6.17.0-1008-nvidia does not include mlx5dv_reg_dmabuf_mr. CPU-path is the currently supported NCCL transport on dual DGX Spark.

NCCL_SOCKET_IFNAME:

Setting NCCL_SOCKET_IFNAME=enp1s0f1np1 and excluding roceP2p1s0f1 aligns with the NVIDIA NCCL playbook for dual DGX Spark: NCCL for Two Sparks | DGX Spark

“You can disregard interfaces starting with the prefix enP2p<…> and only consider interfaces starting with enp1<…> instead.”

Bug report:

The SIGSEGV is a joint PyTorch/NCCL bug — memory corruption in the CPU-path proxy op buffer under extended operation. Worth filing in both: GitHub - NVIDIA/nccl: Optimized primitives for collective multi-GPU communication · GitHub GitHub - pytorch/pytorch: Tensors and Dynamic neural networks in Python with strong GPU acceleration · GitHub

Include the full stack trace and ~1.1M collective count as the reproduction condition.

Topic		Replies	Views
Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels GPU-Accelerated Libraries cuda , pytorch , ai-training , a100 , infiniband	0	4437	February 16, 2024
PyTorch Lightning on 2 DGX Sparks crashes connected by QFSP DGX Spark / GB10 pytorch , spark	1	85	March 1, 2026
NCCL all-reduce deadlock on dual DGX Spark after successful channel establishment — affects both vLLM and TRT-LLM DGX Spark / GB10 nemotron	5	148	April 12, 2026
Troubles with NCCL connecting 2 DGX Sparks DGX Systems (Data Center)	0	26	April 4, 2026
Problems migrating to multi-gpu setting Deep Learning (Training & Inference) pytorch , python , cloud	1	1586	March 5, 2024
Code runs in RTX 3060 but not in 4xTesla T4 Azure cluster Microsoft Azure Image pytorch , python , cudnn	0	502	March 5, 2024
NCCL socket transport fails with pipeline parallelism (mesh_pp) on DGX Spark DGX Spark / GB10 pytorch , nemo , parallel-computing , llama	5	265	January 3, 2026
Failed to Run NCCL communication test DGX Spark / GB10	1	115	November 30, 2025
NCCL P2P hang on dual RTX PRO 6000 Blackwell Workstation Edition (WRX90E-SAGE SE) CUDA Programming and Performance	4	111	April 7, 2026
Test the sample about "Connect Three DGX Spark in a Ring Topology" DGX Spark / GB10 cuda	13	371	April 7, 2026

Collective operations timeout on dual spark during distributed training

Setup

Problem

Logs:

Notes:

Related topics