Two-Spark Cluster: tensor-parallel-size=2 causing Engine Initialization Failure with Qwen3-VL-30B (Ray + vLLM)

I am currently configuring multi-node inference using stacked DGX Spark systems following the official documentation:

https://build.nvidia.com/spark/vllm/stacked-sparks

The distributed cluster setup has been completed successfully and verified using:

docker exec $VLLM_CONTAINER ray status

Cluster Status

The cluster reports:

  • 2 active nodes

  • 0 pending / 0 failed nodes

Available resources:

CPU: 40
GPU: 2
Memory: 218.87 GiB

This indicates that Ray-level node registration and cluster connectivity are functioning correctly.


Environment Configuration

System: DGX Spark (2 stacked nodes)

CUDA Version: 13.0

NVIDIA Driver Version: 580.126.09

NCCL Version: 2.28.8

PyTorch Version: 2.10.0a0+b558c986e8.nv25.11

vLLM Version: 0.11.0+582e4e37.nv25.11

Container Base Image: nvcr.io/nvidia/vllm:25.11-py3

No custom NCCL environment variables (e.g. NCCL_SOCKET_IFNAME, NCCL_IB_DISABLE) have been configured.


Observed Issue

VLM Model – Engine Initialization Failure

Model: Qwen/Qwen3-VL-30B-A3B-Instruct

Test Results

Tensor Parallelism = 1

Model loads and runs inference successfully.


Tensor Parallelism = 2

Engine fails during initialization with the following error:

RuntimeError: Engine core initialization failed. See root cause above.
Failed core proc(s): {}

The failure occurs during engine startup before inference begins.


Question

Given that:

  • Ray cluster reports 2 GPUs across 2 nodes

  • Single-GPU inference works

  • Failure occurs only when tensor_parallel_size=2

Is there any additional configuration required for cross-node tensor parallelism when using stacked DGX Spark systems with vLLM?

Specifically:

  1. Are additional NCCL environment variables required for multi-node tensor parallelism?

  2. Does Qwen3-VL-30B-A3B currently support distributed tensor parallel across nodes in vLLM?

  3. Are there extra Ray or vLLM launch flags needed for multi-node VLM models?

Any guidance would be appreciated.


I highly recommend to use

instead.

The nvcr.io/nvidia/vllm:25.11-py3 image is quite old.

As for the vLLM arguments. I depends on the model. When you have a look on a model card at Hugging Face you will find usually the necessary minimum version of vLLM and/or SGLang and their options.

Qwen/Qwen3.5-35B-A3B · Hugging Face for example:

The following will create API endpoints at http://localhost:8000/v1:

  • Standard Version: The following command can be used to create an API endpoint with maximum context length 262,144 tokens using tensor parallel on 8 GPUs.
vllm serve Qwen/Qwen3.5-35B-A3B --port 8000 --tensor-parallel-size 8 --max-model-len 262144 --reasoning-parser qwen3 
  • Tool Call: To support tool use, you can use the following command.
vllm serve Qwen/Qwen3.5-35B-A3B --port 8000 --tensor-parallel-size 8 --max-model-len 262144 --reasoning-parser qwen3 --enable-auto-tool-choice --tool-call-parser qwen3_coder 
  • Multi-Token Prediction (MTP): The following command is recommended for MTP:
vllm serve Qwen/Qwen3.5-35B-A3B --port 8000 --tensor-parallel-size 8 --max-model-len 262144 --reasoning-parser qwen3 --speculative-config '{"method":"qwen3_next_mtp","num_speculative_tokens":2}'
  • Text-Only: The following command skips the vision encoder and multimodal profiling to free up memory for additional KV cache:
vllm serve Qwen/Qwen3.5-35B-A3B --port 8000 --tensor-parallel-size 8 --max-model-len 

Also recommended:

There you will find recipes for eugrs toolkit.

And another option:

If you prefer to get the playbook example running.

RuntimeError: Engine core initialization failed. See root cause above.

The full log of vLLM would be nice. As the “root cause” is above this RuntimeError message.

1 Like

Qwen3-VL-30B-A3B should support parallelism.
We recently released a new VLLM container, 26.02-py3. I recommend trying the new container. If you still run into the same issue. Please share with me the full VLLM log for more debugging.

1 Like

@saptarshi2 I would also recommend reducing the allowed memory util to 0.5

I will try this and let you know