NCCL all-reduce deadlock on dual DGX Spark after successful channel establishment — affects both vLLM and TRT-LLM

Hello everyone. I’ll get right to it.

Problem

Tensor parallelism (TP=2) across two DGX Sparks connected via the NVIDIA-shipped Amphenol QSFP112 DAC cable consistently deadlocks during the first NCCL all-reduce operation. This affects both vLLM and TensorRT-LLM identically, suggesting a systemic NCCL issue rather than a runtime bug.

Both Sparks and the Amphenol cable were purchased together directly from NVIDIA in a single order.

Hardware

  • 2x DGX Spark (GB10 Grace Blackwell), 128GB LPDDR5x each
  • ConnectX-7 DAC cable: Amphenol (shipped with Sparks by NVIDIA)
  • CX7 interface: enp1s0f0np0, MTU 9000, IPs: 192.168.200.1/24 and 192.168.200.2/24
  • Ping latency: < 1ms
  • Driver: 580.126.09, CUDA: 13.0

What works

  • CX7 link up, ping < 1ms, rsync at ~260 MB/s
  • Passwordless SSH between nodes via CX7
  • Docker Swarm forms correctly (both nodes visible)
  • Ray cluster forms correctly (2 nodes, 2 GPUs, 217 GiB memory)
  • NCCL detects RoCE devices: NET/IB : Using [0]rocep1s0f0:1/RoCE [1]roceP2p1s0f0:1/RoCE
  • NCCL establishes channels: “Connected all rings, Connected all trees, Connected binomial trees”
  • Model weights load successfully across both GPUs (tested with 32B, 120B, and 235B models)

Where it hangs

Immediately after weight loading completes. The last log output is:

[TRT-LLM] [RANK 0] [I] Run generation only CUDA graph warmup for batch size=1, draft_len=0
[TRT-LLM] [RANK 0] [I] Memory used after loading model weights (inside torch): 62.66 GiB
[TRT-LLM] [RANK 0] [I] Memory used after loading model weights (outside torch): 15.42 GiB

Then silence. GPU utilization jumps to 96% on the head node and stays there indefinitely. No errors, no timeouts — just a permanent hang. The health endpoint never becomes available.

With vLLM, the same hang occurs during KV cache profiling (also an all-reduce operation):

Loading safetensors checkpoint shards: 100% Completed | 17/17
[NCCL INFO] Channel 00/0 : 0[0] -> 1[0] [send] via NET/Socket/0
[NCCL INFO] Connected all rings, Connected all trees, Connected binomial trees
[NCCL INFO] Loading weights took 116.59 seconds

Then silence. Same 96% GPU hang.

Runtimes and NCCL versions tested

Runtime Container NCCL Result
vLLM 0.10.1 nvcr.io/nvidia/vllm:25.09-py3 2.27.7 Hang (TCP sockets)
vLLM 0.11.0 nvcr.io/nvidia/vllm:25.11-py3 2.28.8 Hang (RDMA/RoCE)
vLLM 0.17.1 nvcr.io/nvidia/vllm:26.03-py3 2.29.7 Hang (RDMA/RoCE)
TRT-LLM 1.1.0rc3 spark-single-gpu-dev 2.27.7 Hang

Models tested

  • Qwen2.5-32B-Instruct (standard Transformer)
  • nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-FP8 (hybrid Mamba/Transformer/MoE)
  • nvidia/Qwen3-235B-A22B-FP4 (MoE — the documented dual-Spark TRT-LLM model)

All three models hang at the same point. Single-Spark inference works perfectly for all of them.

Container configurations tried

  • --privileged (IB devices visible: uverbs0-3, umad0-3, rdma_cm)
  • --device=/dev/infiniband/
  • NCCL_IB_DISABLE=1 (forces TCP sockets — still hangs)
  • NCCL_SOCKET_IFNAME=enp1s0f0np0
  • --enforce-eager (disables CUDA graphs in vLLM — still hangs)
  • --disable-custom-all-reduce (vLLM — still hangs)
  • use_cuda_graph: false in TRT-LLM config
  • --max_batch_size 1 (minimal CUDA graph capture — still hangs)
  • Playbook-exact config from dgx-spark-playbooks TRT-LLM “Run on two Sparks”

Key observation

The hang occurs with both RDMA/RoCE AND TCP socket transport. This rules out an IB/RDMA-specific issue. The problem appears to be in NCCL’s all-reduce collective operation itself when executing across two GB10 GPUs over any network transport.

DGX OS

Ubuntu 25.10, kernel 6.17.0-1014-nvidia (aarch64)

Direct ask

The DGX Spark TRT-LLM playbook (dgx-spark-playbooks/nvidia/trt-llm/README.md, “Run on two Sparks”) documents nvidia/Qwen3-235B-A22B-FP4 as a supported dual-Spark model with --tp_size 2. We followed that playbook exactly on NVIDIA-shipped hardware with the NVIDIA-shipped cable. It hangs.

  1. Has NVIDIA validated TP=2 on dual DGX Spark with driver 580.126.09 and DGX OS kernel 6.17.0-1014-nvidia? If yes, what specific software versions (NCCL, container image, DGX OS) were used in that validation?

  2. Is there a known NCCL issue with the GB10 unified memory architecture during cross-node all-reduce? The hang is transport-agnostic (occurs with both RDMA/RoCE and TCP sockets) and runtime-agnostic (occurs in both vLLM and TRT-LLM), which points to the NCCL collective layer, not device initialization.

  3. What is the confirmed working software stack for dual-Spark TP? We will match it exactly if it differs from what we are running.

  4. Bottom line: how do we make this work?

Prior support contact

Filed a ticket with NVIDIA Customer Care (Zainab). Was directed to review the clustering docs, the NCCL troubleshooting page, and to post here. We reviewed all referenced resources before posting — the linked forum thread (356280) describes a device detection issue, not our scenario (our devices are detected and channels are established). The build.nvidia.com troubleshooting pages are inaccessible (timeout). The cable is the NVIDIA-shipped Amphenol that came with the Sparks in the same order.

Related thread

This is irregular behavior. The software you need should come with the system and can be updated through DGX Dashboard.
Can you try following the NCCL playbook and testing the NCCL communication without trt-llm or vllm?

Salut aniculescu and thank you for your response.

We followed the NCCL stacked-sparks playbook exactly as directed. N1 and N2 in the below refer to Spark 1 and Spark 2. Here’s what happened:

What we did

  1. Built NCCL v2.28.9-1 from source on both nodes per the playbook (make -j src.build NVCC_GENCODE="-gencode=arch=compute_121,code=sm_121"). Note: This required patching /usr/local/cuda/include/crt/math_functions.h on both nodes — the shipping DGX OS (Ubuntu 25.10, kernel 6.17.0-1014-nvidia) has a glibc rsqrt exception specification that is incompatible with CUDA 13.0’s math_functions.h. Without patching lines 629 and 653 to add noexcept, the NCCL build fails with:
/usr/include/aarch64-linux-gnu/bits/mathcalls.h(206): error: exception specification 
is incompatible with that of previous function "rsqrt"
  1. Built nccl-tests with make MPI=1 on both nodes — successful.

  2. Verified CX7 connectivity: enp1s0f0np0 UP on both nodes, IPs 192.168.200.1/24 and 192.168.200.2/24, ping < 1ms, passwordless SSH working (ssh 192.168.200.2 hostname returns “N2” correctly).

  3. Ran the playbook’s cross-node test exactly as specified:

export UCX_NET_DEVICES=enp1s0f0np0
export NCCL_SOCKET_IFNAME=enp1s0f0np0
export OMPI_MCA_btl_tcp_if_include=enp1s0f0np0

mpirun -np 2 -H 192.168.200.1:1,192.168.200.2:1 \
  --mca plm_rsh_agent "ssh -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no" \
  -x LD_LIBRARY_PATH=$LD_LIBRARY_PATH \
  $HOME/nccl-tests/build/all_reduce_perf -b 8 -e 128M -f 2 -g 1

Result: MPI does not distribute processes across nodes

The nccl-tests binary fails with:

Invalid number of GPUs: 2 requested but only 1 were found.

We isolated the issue to OpenMPI itself. Running mpirun -np 2 -H 192.168.200.1:1,192.168.200.2:1 hostname returns:

N1
N1

Both MPI ranks execute on N1 despite SSH to 192.168.200.2 completing successfully (ssh 192.168.200.2 hostname returns “N2” correctly from the same shell). We tested with:

  • -H flag and --host flag — same result
  • --hostfile with explicit slots=1 — same result
  • --map-by :OVERSUBSCRIBE — same result
  • With and without --mca plm_rsh_agent — same result
  • Both GPUs free (Ollama stopped on both nodes) — same result

OpenMPI version: 5.0.8 (shipping with DGX OS).

Two issues for NVIDIA engineering

Issue 1: NCCL build fails on shipping DGX OS. The CUDA 13.0 math_functions.h header is incompatible with Ubuntu 25.10’s glibc math headers. The playbook’s build command fails out of the box. A manual header patch is required.

Issue 2: OpenMPI 5.0.8 cannot distribute MPI ranks across DGX Spark nodes via CX7 IP addresses. This blocks the NCCL playbook’s cross-node test entirely, and is likely the root cause of the TP deadlocks we reported in the original post — if MPI can’t distribute processes correctly, the NCCL all-reduce will deadlock because both ranks try to use the same single GPU.

Environment (unchanged from original post)

  • 2x DGX Spark (GB10), driver 580.126.09, CUDA 13.0
  • DGX OS: Ubuntu 25.10, kernel 6.17.0-1014-nvidia
  • ConnectX-7 DAC: Amphenol (NVIDIA-shipped with the Sparks)
  • CX7 interface: enp1s0f0np0, MTU 9000

Questions

  1. Is there a known fix for OpenMPI 5.0.8 cross-node process distribution on DGX Spark?
  2. Is the CUDA header incompatibility with Ubuntu 25.10’s glibc a known issue?
  3. Can NVIDIA provide a confirmed working software configuration (OpenMPI version, DGX OS version, CUDA version) for the stacked-sparks NCCL playbook?

Mulțumesc!

Ah, I didn’t notice in your original post that you are running Ubuntu 25.10. This is not a supported OS for GB10 systems. Please reimage your Spark back to using the provided DGX OS System Recovery — DGX Spark User Guide

Thanks for the tips!

See important update in my message after this one. Some issues in this message are resolved, the focus of the problem is now much tighter.

Update: Reimaged to DGX OS — NCCL now works, TRT-LLM autotuner still hangs

Following your guidance, we reimaged both Sparks to DGX OS (Ubuntu 24.04.4 LTS) via USB recovery using FastOS v1.120.38. The root cause of the failures was my longstanding habit of regularly updating all software. I’ll not give the Ubuntu update utility that ships with Spark another look and, if I understand correctly, only apply those updates presented on the DGX Dashboard?

What the reimage fixed

NCCL builds cleanly on 24.04 — no header patches needed. The rsqrt exception specification error we reported previously was caused by Ubuntu 25.10’s newer glibc being incompatible with CUDA 13.0’s math_functions.h. On 24.04, make -j src.build NVCC_GENCODE="-gencode=arch=compute_121,code=sm_121" succeeds without modification.

MPI distributes processes correctly on 24.04mpirun -np 2 -H 192.168.200.1:1,192.168.200.2:1 hostname now returns N1 and N2 (two different nodes). On Ubuntu 25.10 with OpenMPI 5.0.8, both ranks ran on the same node.

NCCL cross-node all_reduce passes:

# nccl-tests v2.18.2, NCCL v2.28.9-1
#  Rank  0 Group  0 Pid  42602 on  N1 device  0 [000f:01:00] NVIDIA GB10
#  Rank  1 Group  0 Pid  39702 on  N2 device  0 [000f:01:00] NVIDIA GB10

       size      busbw (GB/s)
     8388608       14.44
    16777216       14.79
    33554432       14.23
    67108864       16.76
   134217728       17.32

# Avg bus bandwidth: 3.39 GB/s
# Out of bounds values: 0 OK

Peak 17.3 GB/s at 128MB buffer. Zero errors. NCCL cross-node communication is fully functional.

What still does not work

TRT-LLM TP=2 serving hangs during autotuner/CUDA graph warmup. We followed the TRT-LLM dual-Spark playbook exactly with nvidia/Qwen3-235B-A22B-FP4:

  • Docker Swarm: both nodes visible, replicas running
  • Inter-container SSH on port 2233: verified
  • Model weight loading: succeeds — 2171 shards loaded across both GPUs, 65 GB allocated per node
  • KV cache allocated: 0.19 GiB for 4128 tokens
  • Then: [Autotuner] Autotuning process starts ... — and the process hangs indefinitely

GPU utilization drops to 0%, CPU idle at 99%, all 49 Python processes sleeping. Log stops growing. Health endpoint never becomes available.

This is not an NCCL issue — the standalone all_reduce_perf test passes with 17 GB/s bandwidth. The hang occurs specifically inside TRT-LLM’s autotuner during its internal profiling sequence. We also observed the same hang at the CUDA graph warmup for batch size=4 step when the autotuner was skipped via config.

Environment (post-reimage)

  • DGX OS: Ubuntu 24.04.4 LTS, kernel 6.17.0-1014-nvidia
  • Driver: 580.126.09, CUDA: 13.0
  • TRT-LLM: 1.1.0rc3 (nvcr.io/nvidia/tensorrt-llm/release:spark-single-gpu-dev)
  • NCCL: 2.28.9-1 (built from source, also tested with container’s bundled version)
  • ConnectX-7: Amphenol DAC (shipped with Sparks), enp1s0f0np0, MTU 9000

Summary

The reimage resolved all NCCL and MPI issues. Cross-node GPU communication works at 17 GB/s. The remaining blocker is TRT-LLM’s autotuner/CUDA graph capture hanging during dual-Spark TP=2 serving. Single-Spark TRT-LLM serving works perfectly (we run Qwen3.5-122B at 51 tok/s on a single Spark via vLLM).

Questions

  1. Is the TRT-LLM autotuner hang on dual GB10 a known issue with spark-single-gpu-dev image v1.1.0rc3?
  2. Is there a flag to fully disable the autotuner AND CUDA graph capture for multi-node TP? We tried enable_autotuner: false in the extra config but it’s not a recognized key.
  3. Has NVIDIA successfully served nvidia/Qwen3-235B-A22B-FP4 with TP=2 on shipping DGX Spark hardware using the documented playbook? If so, what exact container image tag and config were used?

Update: Autotuner eliminated — hang isolated to CUDA graph capture on inter-node TP

Following the reimage to DGX OS (reported in our previous update), we ran a systematic test to isolate the TP=2 hang. Here are the findings.

What we tested

Using --extra_llm_api_options with a YAML config:

enable_autotuner: false

Launched via Docker Swarm (both replicas healthy, inter-container SSH on port 2233 verified bidirectional):

mpirun --allow-run-as-root \
  -np 2 -H 192.168.200.1:1,192.168.200.2:1 \
  -x NCCL_SOCKET_IFNAME=enp1s0f0np0 \
  -x UCX_NET_DEVICES=enp1s0f0np0 \
  --mca oob_tcp_if_include enp1s0f0np0 \
  --mca btl_tcp_if_include enp1s0f0np0 \
  trtllm-llmapi-launch trtllm-serve nvidia/Qwen3-235B-A22B-FP4 \
  --tp_size 2 \
  --backend pytorch \
  --max_num_tokens 4096 \
  --max_batch_size 1 \
  --extra_llm_api_options /tmp/extra-llm-api-config.yml \
  --trust_remote_code \
  --port 8355

What works

  • MPI distributes correctly — both ranks launch on separate nodes (N1 and N2)
  • enable_autotuner: false is accepted — confirmed in the PyTorchConfig dump: enable_autotuner=False
  • Autotuner uses fallback tactics instead of profiling — no hang at the autotuner stage:
    [AutoTunner] Using the fallback tactic, due to cache miss on input shapes=...
    
  • Inter-node TP detected: Detect inter-node TP between rank 0 and rank 1
  • Weight loading completes successfully — 2171 shards, 62.66 GiB per node (inside torch) + ~21 GiB outside torch
  • NCCL communication works — standalone all_reduce_perf still passes at 17.3 GB/s (unchanged from previous update)

Where it hangs

The process reaches this exact point and freezes indefinitely:

[TRT-LLM] [RANK 0] [I] Creating CUDA graph instances for 1 batch sizes.
[TRT-LLM] [RANK 0] [I] Run generation only CUDA graph warmup for batch size=1, draft_len=0

GPU utilization drops to 0%, all processes sleep, log stops growing. This is the same symptom as previously reported, but now precisely isolated: the hang is in CUDA graph capture during cross-node TP, not the autotuner.

What we tried to disable CUDA graphs

  1. cuda_graph_config: null in the YAML — ignored. CudaGraphConfig has default_factory=CudaGraphConfig, so null is overridden by the default constructor. PyTorchConfig dump still shows use_cuda_graph=True with 34 batch sizes.

  2. cuda_graph_config.enable: false — rejected by Pydantic:

    cuda_graph_config.enable
      Extra inputs are not permitted [type=extra_forbidden, ...]
    
  3. cuda_graph_config.max_batch_size: 0 and cuda_graph_config.batch_sizes: [] — ignored. The default factory still produces the full batch size list (1–32, 64, 128).

  4. --enforce-eager — not a TRT-LLM flag (No such option: --enforce-eager). This is a vLLM option.

Bottom line: there appears to be no supported way to disable CUDA graph capture in TRT-LLM 1.1.0rc3 via extra_llm_api_options. The cuda_graph_config field’s default_factory always constructs a CudaGraphConfig with the full batch size list regardless of what is passed in the YAML.

Diagnosis

The autotuner was a red herring. The actual blocker is CUDA graph capture (torch.cuda.CUDAGraph.capture() or equivalent) deadlocking when the model graph spans two GB10 nodes via NCCL. This works fine on a single node — single-Spark TRT-LLM with CUDA graphs runs without issue.

Single-node performance (for reference)

Qwen3.5-122B-A10B on a single DGX Spark via vLLM with MTP-2 speculative decoding: 43–52 tok/s. Single-node inference is fully operational on both our Sparks and our Ascent GX10.

Questions

  1. Is there a supported way to fully disable CUDA graph capture in TRT-LLM 1.1.0rc3 for the PyTorch backend? The cuda_graph_config default factory appears to bypass any YAML override.
  2. Is CUDA graph capture on inter-node TP a known issue on GB10? The capture works on single-node — the deadlock is specific to cross-node NCCL inside the graph.
  3. Is there a newer spark-single-gpu-dev container image that addresses this? We are on nvcr.io/nvidia/tensorrt-llm/release:spark-single-gpu-dev (TRT-LLM 1.1.0rc3).

Environment

  • DGX OS: Ubuntu 24.04.4 LTS, kernel 6.17.0-1014-nvidia
  • Driver: 580.126.09, CUDA: 13.0
  • TRT-LLM: 1.1.0rc3 (nvcr.io/nvidia/tensorrt-llm/release:spark-single-gpu-dev)
  • NCCL: 2.28.9-1
  • ConnectX-7: Amphenol DAC, MTU 9000
  • Model: nvidia/Qwen3-235B-A22B-FP4

Update: Multi-node inference working — SGLang PP=2 bypasses TRT-LLM CUDA graph hang

We have working multi-node inference on dual DGX Spark. Qwen3-235B-A22B-FP4 is serving across both nodes at 12 tok/s via SGLang with pipeline parallelism. The TRT-LLM CUDA graph capture issue is no longer a blocker — we switched runtimes entirely.

What works

SGLang v0.5.9 with pipeline parallelism (PP=2) across dual DGX Spark:

Container: scitrera/dgx-spark-sglang:0.5.9-t5
Model: nvidia/Qwen3-235B-A22B-FP4 (NVFP4, 125GB total)
Parallelism: PP=2 (pipeline, NOT tensor)
NCCL: 2.29.3 via CX7 400G DAC
CUDA graphs: Disabled (--disable-cuda-graph)
Memory: 69.6 GB model weights per node, 44 GB KV cache headroom

Launch configuration (bare-metal Docker, no Swarm required):

# N1 (head, node-rank 0)
docker run -d --name sglang-head \
  --runtime=nvidia --gpus all --network host --ipc=host \
  -v /home/dashbana/.cache/huggingface:/root/.cache/huggingface \
  -e NCCL_SOCKET_IFNAME=enp1s0f0np0 \
  -e GLOO_SOCKET_IFNAME=enp1s0f0np0 \
  scitrera/dgx-spark-sglang:0.5.9-t5 \
  python3 -m sglang.launch_server \
    --model-path nvidia/Qwen3-235B-A22B-FP4 \
    --tp-size 1 --pp-size 2 --nnodes 2 --node-rank 0 \
    --dist-init-addr <CX7-IP-node0>:50000 \
    --host 0.0.0.0 --port 8355 \
    --mem-fraction-static 0.75 \
    --trust-remote-code --disable-cuda-graph

# N2 (worker, node-rank 1) — same command with --node-rank 1

Performance:

  • Generation throughput: 12.0–12.3 tok/s (rock steady, zero variance)
  • Model load time: ~8 minutes (444s N1, similar N2)
  • KV cache: FP8 (torch.float8_e4m3fn), 44 GB available per node
  • Prompt prefill: ~12 tok/s (no CUDA graphs)
  • Complex reasoning prompts: 2,000+ token responses, fully coherent

Why pipeline parallelism works where tensor parallelism doesn’t

We tested TP=2 with SGLang (v0.5.4, v0.5.9, v0.5.10) — all three versions hung at the same point: after NCCL init, before weight loading. This is the same hang point as TRT-LLM. The issue appears to be in the NCCL collective initialization during model sharding for tensor parallelism on dual GB10.

Pipeline parallelism avoids this entirely. PP splits the model by layers (N1 gets layers 0–47, N2 gets layers 48–95). Each node loads its half independently with minimal cross-node coordination during init. NCCL is only used for activation passing between pipeline stages, not for the all-reduce operations that TP requires during model initialization.

The GPU utilization pattern on Grafana confirms PP behavior: N1 and N2 spike alternately (sequential layer processing) rather than simultaneously (which TP would show).

What we tried that didn’t work

Runtime Mode Version Result
TRT-LLM TP=2 1.1.0rc3 Hangs at CUDA graph warmup
SGLang TP=2 0.5.4 (spark tag) Hangs after NCCL init
SGLang TP=2 0.5.10.post1 (latest) Hangs after NCCL init
SGLang TP=2 0.5.9 (scitrera/spark) Hangs after NCCL init
SGLang PP=2 0.5.9 (scitrera/spark) WORKS — 12 tok/s

Key findings for other DGX Spark users

  1. Pipeline parallelism is the path for 2-node Spark clusters over CX7. TP=2 hangs across multiple runtimes. PP=2 works immediately.
  2. GLOO_SOCKET_IFNAME must be set alongside NCCL_SOCKET_IFNAME when using the CX7 interface. Without it, Gloo tries the LAN interface and times out.
  3. --mem-fraction-static 0.75 is safer than 0.85 on Spark unified memory. At 0.85, complex prompts can OOM the worker node.
  4. CUDA graphs disabled — we haven’t tested enabling them with PP yet. The 12 tok/s baseline is without CUDA graph acceleration. Enabling them may improve throughput.
  5. Docker Swarm is not needed. Bare-metal Docker containers with --network host work perfectly. Simpler to deploy and debug.
  6. The scitrera/dgx-spark-sglang:0.5.9-t5 community image worked where the official lmsysorg/sglang:spark (v0.5.4) did not. The Spark-specific community images are better tested on this hardware.

Environment

  • DGX OS: Ubuntu 24.04.4 LTS, kernel 6.17.0-1014-nvidia
  • Driver: 580.126.09, CUDA: 13.0 (host) / 13.1.1 (container)
  • SGLang: 0.5.9 (scitrera/dgx-spark-sglang:0.5.9-t5)
  • NCCL: 2.29.3
  • ConnectX-7: Amphenol DAC, MTU 9000
  • Model: nvidia/Qwen3-235B-A22B-FP4 (NVFP4, 125GB)

Summary

The CUDA graph hang on TP=2 remains an open issue across TRT-LLM and SGLang. But pipeline parallelism completely sidesteps it. For dual-Spark users with CX7: use SGLang PP=2. It works today, right now, with the weights already on your nodes.

I suggest you also look at our community Docker: GitHub - eugr/spark-vllm-docker: Docker configuration for running VLLM on dual DGX Sparks · GitHub and https://sparkrun.dev/ - an orchestration tool for vLLM (uses our community build) and SGLang.

You should also check Networking Guide - I noticed that you assign IPs from the same subnet to both CX7 interfaces - this is not a good networking practice and known to cause slowdowns.

Thank you, eugr — even before your reply your work was directly instrumental in getting our dual-Spark cluster operational.

We’ve been using your spark-vllm-docker SM121 build since day one. Our Qwen3.5-122B-A10B deployment (hybrid INT4+FP8 checkpoint, MTP-2 speculative decoding, FlashInfer backend) runs at 51 tok/s on a single Spark — built on top of your vLLM base image. It’s been rock-solid across both our DGX Sparks and an Ascent GX10.

On the CX7 subnet issue — good catch, and thank you for flagging it. That fix is underway as I type. We currently have both nodes on 192.168.200.x/24 using a single DAC cable. We’ll re-subnet per your networking guide recommendations (177.x / 178.x split) during our next maintenance window.

On the multi-node front: we hit the same TP=2 CUDA graph hang that others have reported (TRT-LLM and SGLang both deadlock after NCCL init). The breakthrough for us was switching to pipeline parallelism (PP=2) via SGLang. Qwen3-235B-A22B-FP4 is now serving at 12 tok/s across dual Sparks with scitrera/dgx-spark-sglang:0.5.9-t5, CUDA graphs disabled. Details in our latest forum update.

We’ll definitely look at sparkrun.dev — the YAML recipe approach would simplify our deployment significantly. Does it support pipeline parallelism configurations, or is it primarily TP-focused?

Thanks again for the community Docker images and the networking guide!

I am having the exact same issues you had, and i am gonna try sglang as you suggested, vllm are experiencing hang ssues at final profiling stage, and ttr llm are experiencing hang issues at final autotuner and cuda graph stage.

I really believed dgx spark need some major work from Nvidia.

I expect (hope, really) we’ll have all this resolved in the next few hours. Between my previous posts and the next, hopefully you’ll have a recipe that works.

It supports tp/pp/dp.

Spark-vllm-docker also has recipes support via run-recipe.sh, but it supports only v1 version of the recipes. Sparkrun is compatible with v1, but supports v2 that allows extra parameters such as min/max nodes, engine, etc.

Pipeline Parallelism doesn’t make much sense on Sparks, unless you have >2 nodes but not power of 2 (2, 4, 8) and you can’t fit the desired model on 2 nodes. TP lets you run bigger models and run them faster at the same time. PP will be slower than a single node for the same model.

Just use spark-vllm-docker (directly or via sparkrun) - most of the NVFP4 issues are now solved, and while it doesn’t fully utilize the hardware potential yet, it is pretty solid now. AWQ and Int4-Autoround quants are still faster though.

Update: Tensor parallelism resolved — 22.6 tok/s on Qwen3-235B across dual Sparks

TP=2 is working. Qwen3-235B-A22B-FP4 is serving at 22.6 tok/s across dual DGX Sparks with CUDA graphs enabled. The fix was eugr’s spark-vllm-docker with Ray as the distributed backend.

What fixed it

Ray. That’s the entire answer.

Every attempt we made with manual multi-node launches — --nnodes, --dist-init-addr, --master-addr — deadlocked after NCCL init. This was consistent across:

Runtime Version Backend Result
TRT-LLM 1.1.0rc3 MPI Hangs at CUDA graph warmup
SGLang 0.5.4 / 0.5.9 / 0.5.10 PyTorch multiproc Hangs after NCCL init
vLLM 0.19.1rc1 PyTorch multiproc Hangs after NCCL init
vLLM 0.19.1rc1 Ray WORKS — 22.6 tok/s

The only variable that changed was --distributed-executor-backend ray, launched through eugr’s launch-cluster.sh which starts Ray head and worker nodes before vLLM. The default PyTorch multiprocessing backend cannot coordinate distributed initialization on the single-GPU-per-node topology of dual DGX Spark. Ray handles the placement groups and worker coordination correctly.

Working configuration

# Clone eugr's repo on both nodes
git clone https://github.com/eugr/spark-vllm-docker.git

# Pull the community nightly image on both nodes
docker pull ghcr.io/spark-arena/dgx-vllm-eugr-nightly:latest
docker tag ghcr.io/spark-arena/dgx-vllm-eugr-nightly:latest vllm-node:latest

# Create .env on the head node
cat > .env << EOF
CLUSTER_NODES=192.168.177.1,192.168.177.2
LOCAL_IP=192.168.177.1
ETH_IF=enp1s0f0np0
IB_IF=rocep1s0f0
EOF

# Launch
bash run-recipe.sh recipes/qwen3-235b-nvfp4.yaml -d

Recipe used (added to recipes/):

recipe_version: "1"
name: Qwen3-235B-A22B-NVFP4
description: vLLM serving Qwen3-235B-A22B using NVFP4

model: nvidia/Qwen3-235B-A22B-FP4
container: vllm-node
cluster_only: true

env:
  VLLM_FLASHINFER_ALLREDUCE_BACKEND: trtllm
  VLLM_ALLOW_LONG_MAX_MODEL_LEN: 1

defaults:
  port: 8355
  host: 0.0.0.0
  tensor_parallel: 2
  gpu_memory_utilization: 0.75
  max_model_len: 32768
  max_num_seqs: 4

command: |
  vllm serve nvidia/Qwen3-235B-A22B-FP4 \
  --kv-cache-dtype fp8 \
  --trust-remote-code \
  --gpu-memory-utilization {gpu_memory_utilization} \
  --max-model-len {max_model_len} \
  --max-num-seqs {max_num_seqs} \
  --enable-prefix-caching \
  --host {host} \
  --port {port} \
  --tensor-parallel-size {tensor_parallel} \
  --attention-backend TRITON_ATTN \
  --distributed-executor-backend ray

Performance

  • Generation throughput: 22.6 tok/s (peak), ~22 tok/s sustained
  • CUDA graphs: FULL + PIECEWISE — captured and working on cross-node TP
  • Memory: 69 GB (N1) + 70 GB (N2) — tensor-parallel split, 25.4 GB KV cache per node
  • KV cache: FP8 (fp8_e4m3)
  • Boot to serving: ~15 minutes (weight loading + CUDA graph compilation)

CX7 networking update

Per eugr’s recommendation, we re-subnetted the CX7 interfaces from a shared 192.168.200.x/24 to 192.168.177.x/24 with proper dhcp4: false, dhcp6: false, and link-local: [] settings. While this alone didn’t unblock TP (we tested — same hang without Ray), it’s the correct configuration per the networking guide and should be done regardless.

What we tried along the way

For the community’s benefit, here’s the full sequence of what didn’t work and what did:

  1. Ubuntu 25.10 — broke everything (NCCL, MPI, TP). Reimaged to DGX OS 24.04 LTS. Big lesson here: the software update app that comes installed on the Spark – don’t touch it.
  2. NCCL standaloneall_reduce_perf works at 17.3 GB/s. Cross-node GPU communication was never the problem.
  3. TRT-LLM TP=2 — hangs at CUDA graph capture. enable_autotuner: false works but cuda_graph_config cannot be disabled. Dead end.
  4. SGLang TP=2 — three versions tested, all hang after NCCL init with PyTorch multiprocessing backend.
  5. SGLang PP=2 — works at 12 tok/s but PP is slower than single-node for the same model (eugr confirmed).
  6. vLLM TP=2 (manual) — hangs after NCCL init with --nnodes / --master-addr.
  7. vLLM TP=2 (Ray via spark-vllm-docker)WORKS at 22.6 tok/s with CUDA graphs.

Acknowledgments

eugr’s spark-vllm-docker repository, community Docker images, networking guide, and direct forum guidance were the critical path to getting TP working. The entire DGX Spark community benefits from his work. Thank you.

Environment

  • DGX OS: Ubuntu 24.04.4 LTS, kernel 6.17.0-1014-nvidia
  • Driver: 580.126.09, CUDA: 13.2 (container)
  • vLLM: 0.19.1rc1 (ghcr.io/spark-arena/dgx-vllm-eugr-nightly:latest)
  • NCCL: 2.29.7
  • Ray: distributed backend for multi-node coordination
  • ConnectX-7: Amphenol DAC, MTU 9000, 192.168.177.x/24
  • Model: nvidia/Qwen3-235B-A22B-FP4 (NVFP4, base weights)

Thank you, I got mines working using your method

Can you try the same launch command but with --no-ray parameter and see if you still get NCCL deadlocks?

./run-recipe.sh recipes/qwen3-235b-nvfp4.yaml -d --no-ray

@eugr I have no success with spark-vllm-docker in cluster. I can follow nvidia’s instruction to set up ray with vllm. My main question is how to pass my NCCL settings to the docker? The .env settings by ./run-recipes.sh –autodiscover is not enough for me. Below is the settings in sglang related with NCCL


    --env "CUDA_VISIBLE_DEVICES=0" \

    --env "NCCL_SOCKET_IFNAME=enp1s0f1np1" \

    --env "NCCL_DEBUG=INFO" \

    --env "NCCL_IB_DISABLE=0" \

    --env "NCCL_IB_GID_INDEX=3" \

    --env "MASTER_ADDR=192.168.100.11" \

    --env "MASTER_PORT=50000" \

    --env "WORLD_SIZE=2" \

    --env "NCCL_IB_TIMEOUT=22" \

    --env "NCCL_IB_RETRY_CNT=7" \

    --env "NCCL_ASYNC_ERROR_HANDLING=1" \

    --env "NCCL_BLOCKING_WAIT=1" \

    --env "TORCH_DISTRIBUTED_TIMEOUT=1800" 

I don’t experience any nccl issue as long as the version is 2.29.3. I am using the acab24a7 build:
{ "dev.scitrera.cuda_version": "13.1.1", "dev.scitrera.flashinfer_version": "0.6.5", "dev.scitrera.nccl_version": "2.29.3-1", "dev.scitrera.sglang_version": "0.5.9+git-acab24a7", "dev.scitrera.torch_audio_version": "2.10.0", "dev.scitrera.torch_version": "2.10.0", "dev.scitrera.torch_vision_version": "0.25.0", "dev.scitrera.transformers_version": "5.3.0", "dev.scitrera.triton_version": "3.6.0", "maintainer": "scitrera.ai <``open-source-team@scitrera.com``>", "org.opencontainers.image.ref.name": "ubuntu", "org.opencontainers.image.version": "24.04" }

Do you have an unusual NCCL setup? The default settings in spark-vllm-docker (assuming you followed the networking guide to set up the addresses, etc) are battle tested and work well. If you need to pass more variables to the container, you can use -e parameter, e.g. -e NCCL_DEBUG=INFO (it has an alias --nccl-debug). This parameter is accepted by both ./run-recipe.sh and ./launch-cluster.sh.

You can also specify additional NCCL settings in the .env file generated by autodiscover, e.g.:

# Auto-generated by autodiscover.sh
CLUSTER_NODES=192.168.24.115,192.168.24.104,192.168.24.119
COPY_HOSTS=192.168.177.11,192.168.197.13
LOCAL_IP=192.168.24.115
ETH_IF=enP7s7
IB_IF=rocep1s0f0,roceP2p1s0f0,rocep1s0f1,roceP2p1s0f1
# Mesh mode NCCL settings
CONTAINER_NCCL_NET_PLUGIN=none
CONTAINER_NCCL_IB_SUBNET_AWARE_ROUTING=1
CONTAINER_NCCL_IB_MERGE_NICS=0
CONTAINER_NCCL_DEBUG=INFO

Everything starting with CONTAINER_ will be passed to the docker without the CONTAINER_ prefix, so CONTAINER_NCCL_DEBUG=INFO will become NCCL_DEBUG=INFO.

Tested. No deadlock with --no-ray. It works perfectly.

./run-recipe.sh recipes/qwen3-235b-nvfp4.yaml -d --no-ray
  • Weights loaded across both nodes (27 shards, 8m53s)
  • CUDA graphs captured (FULL + PIECEWISE) — no hang
  • 23.1 tok/s — slightly faster than Ray mode (22.6 tok/s)
  • Application startup complete, inference verified

This means the NCCL deadlocks we reported earlier were not caused by Ray vs no-Ray, but by our manual launch approach. We were using raw --nnodes / --master-addr / --dist-init-addr flags without your launch-cluster.sh orchestration. Your script handles the distributed environment setup correctly in both modes — our bare launches bypassed whatever setup your script does, which is why they hung.

For reference, here’s what deadlocked (all without launch-cluster.sh):

Launch method Result
vLLM --nnodes 2 --master-addr (manual) Hangs after NCCL init
SGLang --nnodes 2 --dist-init-addr (manual) Hangs after NCCL init
TRT-LLM via Docker Swarm + MPI Hangs at CUDA graph warmup
launch-cluster.sh with Ray Works — 22.6 tok/s
launch-cluster.sh with --no-ray Works — 23.1 tok/s

The common variable in the working cases is launch-cluster.sh, not the distributed backend. Your orchestration is doing something during container/network setup that bare launches don’t — likely the SSH environment, NCCL interface pinning, or process coordination that your script handles automatically.

We’ve switched our production deployment to --no-ray since it’s marginally faster. Thank you for suggesting the test — it simplified our stack and confirmed that your launch infrastructure is the key ingredient.

Yes, the key is to correctly configure the environment variables on all nodes.