I am currently configuring multi-node inference using stacked DGX Spark systems following the official documentation:
https://build.nvidia.com/spark/vllm/stacked-sparks
The distributed cluster setup has been completed successfully and verified using:
docker exec $VLLM_CONTAINER ray status
Cluster Status
The cluster reports:
-
2 active nodes
-
0 pending / 0 failed nodes
Available resources:
CPU: 40
GPU: 2
Memory: 218.87 GiB
This indicates that Ray-level node registration and cluster connectivity are functioning correctly.
Environment Configuration
System: DGX Spark (2 stacked nodes)
CUDA Version: 13.0
NVIDIA Driver Version: 580.126.09
NCCL Version: 2.28.8
PyTorch Version: 2.10.0a0+b558c986e8.nv25.11
vLLM Version: 0.11.0+582e4e37.nv25.11
Container Base Image: nvcr.io/nvidia/vllm:25.11-py3
No custom NCCL environment variables (e.g. NCCL_SOCKET_IFNAME, NCCL_IB_DISABLE) have been configured.
Observed Issue
VLM Model – Engine Initialization Failure
Model: Qwen/Qwen3-VL-30B-A3B-Instruct
Test Results
Tensor Parallelism = 1
Model loads and runs inference successfully.
Tensor Parallelism = 2
Engine fails during initialization with the following error:
RuntimeError: Engine core initialization failed. See root cause above.
Failed core proc(s): {}
The failure occurs during engine startup before inference begins.
Question
Given that:
-
Ray cluster reports 2 GPUs across 2 nodes
-
Single-GPU inference works
-
Failure occurs only when tensor_parallel_size=2
Is there any additional configuration required for cross-node tensor parallelism when using stacked DGX Spark systems with vLLM?
Specifically:
-
Are additional NCCL environment variables required for multi-node tensor parallelism?
-
Does Qwen3-VL-30B-A3B currently support distributed tensor parallel across nodes in vLLM?
-
Are there extra Ray or vLLM launch flags needed for multi-node VLM models?
Any guidance would be appreciated.