Hello everyone. I’ll get right to it.
Problem
Tensor parallelism (TP=2) across two DGX Sparks connected via the NVIDIA-shipped Amphenol QSFP112 DAC cable consistently deadlocks during the first NCCL all-reduce operation. This affects both vLLM and TensorRT-LLM identically, suggesting a systemic NCCL issue rather than a runtime bug.
Both Sparks and the Amphenol cable were purchased together directly from NVIDIA in a single order.
Hardware
- 2x DGX Spark (GB10 Grace Blackwell), 128GB LPDDR5x each
- ConnectX-7 DAC cable: Amphenol (shipped with Sparks by NVIDIA)
- CX7 interface:
enp1s0f0np0, MTU 9000, IPs: 192.168.200.1/24 and 192.168.200.2/24 - Ping latency: < 1ms
- Driver: 580.126.09, CUDA: 13.0
What works
- CX7 link up, ping < 1ms, rsync at ~260 MB/s
- Passwordless SSH between nodes via CX7
- Docker Swarm forms correctly (both nodes visible)
- Ray cluster forms correctly (2 nodes, 2 GPUs, 217 GiB memory)
- NCCL detects RoCE devices:
NET/IB : Using [0]rocep1s0f0:1/RoCE [1]roceP2p1s0f0:1/RoCE - NCCL establishes channels: “Connected all rings, Connected all trees, Connected binomial trees”
- Model weights load successfully across both GPUs (tested with 32B, 120B, and 235B models)
Where it hangs
Immediately after weight loading completes. The last log output is:
[TRT-LLM] [RANK 0] [I] Run generation only CUDA graph warmup for batch size=1, draft_len=0
[TRT-LLM] [RANK 0] [I] Memory used after loading model weights (inside torch): 62.66 GiB
[TRT-LLM] [RANK 0] [I] Memory used after loading model weights (outside torch): 15.42 GiB
Then silence. GPU utilization jumps to 96% on the head node and stays there indefinitely. No errors, no timeouts — just a permanent hang. The health endpoint never becomes available.
With vLLM, the same hang occurs during KV cache profiling (also an all-reduce operation):
Loading safetensors checkpoint shards: 100% Completed | 17/17
[NCCL INFO] Channel 00/0 : 0[0] -> 1[0] [send] via NET/Socket/0
[NCCL INFO] Connected all rings, Connected all trees, Connected binomial trees
[NCCL INFO] Loading weights took 116.59 seconds
Then silence. Same 96% GPU hang.
Runtimes and NCCL versions tested
| Runtime | Container | NCCL | Result |
|---|---|---|---|
| vLLM 0.10.1 | nvcr.io/nvidia/vllm:25.09-py3 | 2.27.7 | Hang (TCP sockets) |
| vLLM 0.11.0 | nvcr.io/nvidia/vllm:25.11-py3 | 2.28.8 | Hang (RDMA/RoCE) |
| vLLM 0.17.1 | nvcr.io/nvidia/vllm:26.03-py3 | 2.29.7 | Hang (RDMA/RoCE) |
| TRT-LLM 1.1.0rc3 | spark-single-gpu-dev | 2.27.7 | Hang |
Models tested
- Qwen2.5-32B-Instruct (standard Transformer)
- nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-FP8 (hybrid Mamba/Transformer/MoE)
- nvidia/Qwen3-235B-A22B-FP4 (MoE — the documented dual-Spark TRT-LLM model)
All three models hang at the same point. Single-Spark inference works perfectly for all of them.
Container configurations tried
--privileged(IB devices visible: uverbs0-3, umad0-3, rdma_cm)--device=/dev/infiniband/NCCL_IB_DISABLE=1(forces TCP sockets — still hangs)NCCL_SOCKET_IFNAME=enp1s0f0np0--enforce-eager(disables CUDA graphs in vLLM — still hangs)--disable-custom-all-reduce(vLLM — still hangs)use_cuda_graph: falsein TRT-LLM config--max_batch_size 1(minimal CUDA graph capture — still hangs)- Playbook-exact config from dgx-spark-playbooks TRT-LLM “Run on two Sparks”
Key observation
The hang occurs with both RDMA/RoCE AND TCP socket transport. This rules out an IB/RDMA-specific issue. The problem appears to be in NCCL’s all-reduce collective operation itself when executing across two GB10 GPUs over any network transport.
DGX OS
Ubuntu 25.10, kernel 6.17.0-1014-nvidia (aarch64)
Direct ask
The DGX Spark TRT-LLM playbook (dgx-spark-playbooks/nvidia/trt-llm/README.md, “Run on two Sparks”) documents nvidia/Qwen3-235B-A22B-FP4 as a supported dual-Spark model with --tp_size 2. We followed that playbook exactly on NVIDIA-shipped hardware with the NVIDIA-shipped cable. It hangs.
-
Has NVIDIA validated TP=2 on dual DGX Spark with driver 580.126.09 and DGX OS kernel 6.17.0-1014-nvidia? If yes, what specific software versions (NCCL, container image, DGX OS) were used in that validation?
-
Is there a known NCCL issue with the GB10 unified memory architecture during cross-node all-reduce? The hang is transport-agnostic (occurs with both RDMA/RoCE and TCP sockets) and runtime-agnostic (occurs in both vLLM and TRT-LLM), which points to the NCCL collective layer, not device initialization.
-
What is the confirmed working software stack for dual-Spark TP? We will match it exactly if it differs from what we are running.
-
Bottom line: how do we make this work?
Prior support contact
Filed a ticket with NVIDIA Customer Care (Zainab). Was directed to review the clustering docs, the NCCL troubleshooting page, and to post here. We reviewed all referenced resources before posting — the linked forum thread (356280) describes a device detection issue, not our scenario (our devices are detected and channels are established). The build.nvidia.com troubleshooting pages are inaccessible (timeout). The cable is the NVIDIA-shipped Amphenol that came with the Sparks in the same order.