Two-Spark Cluster: tensor-parallel-size=2 causing Engine Initialization Failure with Qwen3-VL-30B (Ray + vLLM)

saptarshi2 · March 5, 2026, 11:03am

I am currently configuring multi-node inference using stacked DGX Spark systems following the official documentation:

https://build.nvidia.com/spark/vllm/stacked-sparks

The distributed cluster setup has been completed successfully and verified using:

docker exec $VLLM_CONTAINER ray status

Cluster Status

The cluster reports:

2 active nodes
0 pending / 0 failed nodes

Available resources:

CPU: 40
GPU: 2
Memory: 218.87 GiB

This indicates that Ray-level node registration and cluster connectivity are functioning correctly.

Environment Configuration

System: DGX Spark (2 stacked nodes)

CUDA Version: 13.0

NVIDIA Driver Version: 580.126.09

NCCL Version: 2.28.8

PyTorch Version: 2.10.0a0+b558c986e8.nv25.11

vLLM Version: 0.11.0+582e4e37.nv25.11

Container Base Image: nvcr.io/nvidia/vllm:25.11-py3

No custom NCCL environment variables (e.g. NCCL_SOCKET_IFNAME, NCCL_IB_DISABLE) have been configured.

Observed Issue

VLM Model – Engine Initialization Failure

Model: Qwen/Qwen3-VL-30B-A3B-Instruct

Test Results

Tensor Parallelism = 1

Model loads and runs inference successfully.

Tensor Parallelism = 2

Engine fails during initialization with the following error:

RuntimeError: Engine core initialization failed. See root cause above.
Failed core proc(s): {}

The failure occurs during engine startup before inference begins.

Question

Given that:

Ray cluster reports 2 GPUs across 2 nodes
Single-GPU inference works
Failure occurs only when tensor_parallel_size=2

Is there any additional configuration required for cross-node tensor parallelism when using stacked DGX Spark systems with vLLM?

Specifically:

Are additional NCCL environment variables required for multi-node tensor parallelism?
Does Qwen3-VL-30B-A3B currently support distributed tensor parallel across nodes in vLLM?
Are there extra Ray or vLLM launch flags needed for multi-node VLM models?

Any guidance would be appreciated.

cosinus · March 5, 2026, 4:58pm

I highly recommend to use

instead.

The nvcr.io/nvidia/vllm:25.11-py3 image is quite old.

As for the vLLM arguments. I depends on the model. When you have a look on a model card at Hugging Face you will find usually the necessary minimum version of vLLM and/or SGLang and their options.

Qwen/Qwen3.5-35B-A3B · Hugging Face for example:

The following will create API endpoints at http://localhost:8000/v1:

Standard Version: The following command can be used to create an API endpoint with maximum context length 262,144 tokens using tensor parallel on 8 GPUs.
vllm serve Qwen/Qwen3.5-35B-A3B --port 8000 --tensor-parallel-size 8 --max-model-len 262144 --reasoning-parser qwen3 
Tool Call: To support tool use, you can use the following command.
vllm serve Qwen/Qwen3.5-35B-A3B --port 8000 --tensor-parallel-size 8 --max-model-len 262144 --reasoning-parser qwen3 --enable-auto-tool-choice --tool-call-parser qwen3_coder 
Multi-Token Prediction (MTP): The following command is recommended for MTP:
vllm serve Qwen/Qwen3.5-35B-A3B --port 8000 --tensor-parallel-size 8 --max-model-len 262144 --reasoning-parser qwen3 --speculative-config '{"method":"qwen3_next_mtp","num_speculative_tokens":2}'
Text-Only: The following command skips the vision encoder and multimodal profiling to free up memory for additional KV cache:
vllm serve Qwen/Qwen3.5-35B-A3B --port 8000 --tensor-parallel-size 8 --max-model-len 

Also recommended:

There you will find recipes for eugrs toolkit.

And another option:

If you prefer to get the playbook example running.

RuntimeError: Engine core initialization failed. See root cause above.

The full log of vLLM would be nice. As the “root cause” is above this RuntimeError message.

aniculescu · March 5, 2026, 7:25pm

Qwen3-VL-30B-A3B should support parallelism.
We recently released a new VLLM container, 26.02-py3. I recommend trying the new container. If you still run into the same issue. Please share with me the full VLLM log for more debugging.

aniculescu · March 5, 2026, 9:03pm

@saptarshi2 I would also recommend reducing the allowed memory util to 0.5

saptarshi2 · March 11, 2026, 9:36am

I will try this and let you know

Topic		Replies	Views
DGX Spark Multi-Node LLM Inference Report for Qwen3-235B model DGX Spark / GB10 nim , llama	33	1695	January 2, 2026
Install and Use vLLM for Inference on two Sparks does not work DGX Spark / GB10	159	4517	December 9, 2025
Two-Spark cluster with vLLM using tensor-parallel-size 2 causes one node to drop while the other's GPU goes 100% forever DGX Spark / GB10	36	910	February 13, 2026
Failed to run tp or pp on two-nodes ray cluster using docker vllm:25.11 DGX Spark / GB10	2	90	December 1, 2025
How do I run vLLM inference on a DGX Spark system using two ConnectX-7 NICs? DGX Spark / GB10	10	1037	December 22, 2025
Day 1 with DGX Spark (Asus version) DGX Spark / GB10	29	1742	February 7, 2026
vLLM on dual sparks DGX Spark / GB10	4	605	December 1, 2025
Help: Running NVFP4 model on 2x DGX Spark with vLLM + Ray (multi-node) DGX Spark / GB10 mistral-large	18	2031	December 25, 2025
Vllm on spark cluster starts and loads model but API not running? DGX Spark / GB10	9	692	December 1, 2025
Llama.cpp rpc on dgx spark DGX Spark / GB10 llama	4	264	March 1, 2026