Qwen3.5-397B-A17B-int4-AutoRound - 4 x db10 node - updated results 37 - 94 tok/s

mars.fatih · April 5, 2026, 12:31pm

Qwen3.5-397B Loading Deadlock on 4-Node Spark Cluster — 23 Attempts, All Failed (+ GLM-5 SGLang Findings)

Hi everyone,

I’ve been trying to get Qwen3.5-397B-A17B-int4-AutoRound running on my 4-node DGX Spark cluster. After 23 systematic attempts across two full sessions, I’m stuck on a deterministic loading deadlock and would appreciate any guidance from those who have this working.

I’m also sharing some GLM-5 findings via SGLang that might be useful to the community.

Hardware

Component	Details
Nodes	4x Acer Veriton GN100 (GB10, SM121)
RAM	128 GB per node (512 GB total)
Driver	Node 1: 580.95.05, Nodes 2-4: 580.142
Interconnect	MikroTik CRS812-8DS-2DQ-2DDQ, 200G RoCEv2 (QSFP)
MTU	9000
NCCL busbw	22.44 GB/s peak (4-node all_reduce)
Swap	24 GB per node, swappiness=1

What Works

The cluster itself is healthy — NCCL benchmarks at 22.44 GB/s, RDMA verified at 108 Gb/s
The ReplicatedLinear Marlin TP=4 patch applies successfully (Python-based, not the stale recipe mod)
MiniMax M2.5 AWQ serves fine at 24.4 tok/s on 2 nodes (vLLM v0.19.1rc1)
Qwen3.5-35B-A3B FP8 runs at 30.57 tok/s on 4 nodes

The Problem: Qwen3.5-397B Loading Deadlock

The model loads weights up to approximately layer 57 (of 96), then all worker threads enter futex_wait and never recover. The API never starts. No error message, no OOM, no timeout — just a permanent hang.

This happens identically across:

vLLM v0.19.1rc1 (spark-vllm-docker build)
spark-arena nightly (v0.18.2rc1)
Both with and without Ray
Both TP=4 and PP=3 configurations
With and without --enforce-eager

All 23 Attempts (Summarized)

Session 1 (2026-04-03) — 10 attempts:

#	Config	Result
1	TP=4, vanilla vllm serve	Marlin partition error (output_size=32 < min_thread_n=64)
2	TP=4 + `--quantization gptq`	Same Marlin error
3	TP=4 + recipe mod fix-qwen35-tp4-marlin	Patch stale, doesn’t apply to current vLLM
4	TP=2	OOM (exit 137) — 226 GB doesn’t fit 256 GB with overhead
5	TP=4 + ReplicatedLinear Python patch	NCCL/PyTorch timeout after ~34 min
6	TP=4 + ReplicatedLinear + NCCL timeout 7200s	Extended to 47 min, then another timeout
7	TP=4 + ReplicatedLinear + fastsafetensors	OOM (exit 137)
8	TP=4 + fastsafetensors + gpu-mem 0.65	Still OOM or timeout
9	PP=3 + Ray + fastsafetensors (recipe)	Mods stale, container died
10	PP=3 + Ray + fastsafetensors (manual)	Container died during loading

Session 2 (2026-04-04) — 13 attempts:

#	Config	Image	Result
1	PP=3 + fastsafetensors	vllm-node-tf5 v0.19.1rc1	NCCL timeout at 600s during tensor broadcast
2	PP=3 + no fastsafetensors	vllm-node-tf5 v0.19.1rc1	Deadlock — all threads futex_wait after 30 min
3	PP=3 + enforce-eager	spark-arena nightly	Same deadlock
4	PP=3 + --no-ray	spark-arena nightly v0.18.2rc1	Same deadlock
5	TP=4 + Marlin patch (blog config)	vllm-node-tf5 v0.19.1rc1	Loaded to layer 57, then deadlock
6	TP=4 + swap fix (24GB, swappiness=1)	vllm-node-tf5 v0.19.1rc1	Same layer 57 deadlock
7	TP=4 + cache clearing	vllm-node-tf5 v0.19.1rc1	Same layer 57 deadlock
8	TP=4 + sonusflow fork mods	vllm-node-tf5 v0.19.1rc1	Cluster startup timeout
9	TP=4 + Python-based Marlin only	vllm-node-tf5 v0.19.1rc1	Cluster startup timeout (stale sglang containers)
10	TP=4 + eugr/main + Python mod	vllm-node-tf5 v0.19.1rc1	Same layer 57 deadlock
11	Build vLLM v0.17.0 (blog author’s era)	sonusflow Dockerfile	gdn_linear_attn.py not found — Qwen3.5 unsupported pre-v0.18.2
12	SGLang spark image	lmsysorg/sglang:spark v0.5.4	transformers 4.57.1 too old for qwen3_5_moe
13	SGLang + transformers 5.5.0	lmsysorg/sglang:spark	Breaks SGLang DeepseekVL2Config dataclass

What I’ve Ruled Out

Swap — increased to 24 GB, swappiness=1 on all nodes. Didn’t fix.
NCCL bandwidth — 22.44 GB/s (vs blog’s 23.89 GB/s). Healthy.
Stale caches — cleared torch_compile_cache, flashinfer, triton on all nodes.
Marlin TP=4 — ReplicatedLinear patch applied and accepted by Marlin. Not the blocker.
Ray vs no-Ray — both deadlock identically.
PP=3 vs TP=4 — both deadlock, just at different phases.
Model distribution — verified 226 GB on all 4 nodes.

Cross-Check with Working Setup

I compared my setup against the confirmed 37 tok/s blog post in detail:

Parameter	Blog (Working)	My Cluster
Hardware	4x Asus Ascent (GB10)	4x Acer VGN100 (GB10)
Driver	580	580.95.05 / 580.142
NCCL busbw	23.89 GB/s	22.44 GB/s
Swap	~23 GB, swappiness=1	24 GB, swappiness=1
vLLM version	v0.16.1rc1 context	v0.19.1rc1
gpu_memory_utilization	0.78	0.78
max_model_len	32768	32768

The biggest difference is the vLLM version. The blog was built with an older vLLM (around March 9 / v0.16.1rc1 era). My build is v0.19.1rc1. I suspect a loading regression in the newer version, but I can’t build the old version because Qwen3.5 architecture support (gdn_linear_attn.py) only exists in v0.18.2+. There’s no version that both supports Qwen3.5 AND avoids the deadlock.

My Questions

For those who have 397B working on 4-node Spark: What exact vLLM version/image are you using? Can you share the Docker image tag or commit hash?
Has anyone seen the layer 57 deadlock? Is this a known vLLM regression between v0.16 and v0.19?
Would the spark-arena nightly ( Package dgx-vllm-eugr-nightly-tf5 · GitHub ) work? I tried it but got the same deadlock.
Has anyone tried the “Heretic” int4 quant on 4-node Spark? Reportedly faster than Intel AutoRound.

Bonus: GLM-5 SGLang Findings (Community Reference)

While waiting for 397B answers, I also attempted GLM-5 on the same cluster. Sharing these findings since I haven’t seen them documented anywhere:

vLLM is dead for GLM-5 on SM121 — use_sparse=True + use_mla=True has no working attention backend (v0.19.0 qk_nope fix is insufficient)
SGLang via scitrera/dgx-spark-sglang:0.5.8-t5 partially works:
- glm_moe_dsa architecture recognized
- AWQ Marlin kernel selected
- NSA attention backend found — sparse attention IS supported in this image
- NCCL distributed init succeeded across 4 nodes
- Must use --model-impl transformers (default dispatch misroutes to DeepSeek V2 loader — head_size mismatch 576 vs 2048)
- Must set GLOO_SOCKET_IFNAME=enp1s0f1np1 (hostname resolves to 127.0.0.1 in /etc/hosts)
- BLOCKED: TransformersForCausalLM generic wrapper OOMs during weight loading (loads full partition into CPU RAM before GPU sharding — 98 GB/node with only ~11 GB headroom)
- Needs either a dedicated SGLang glm_moe_dsa module (streaming weights) or more nodes

Happy to share detailed logs, configs, or the ReplicatedLinear patch script if useful.

Thanks in advance for any pointers.

Topic		Replies	Views
Qwen3.5-397B-A17B run in dual spark! but I have a concern DGX Spark / GB10	236	9009	June 6, 2026
6x Spark setup DGX Spark / GB10	112	9887	April 25, 2026
Does Qwen3.5-35B-A3B on GB10 leave a lot of performance on the table? DGX Spark / GB10 agentic-ai	40	5850	March 16, 2026
Kimi 2.6 and Qwen 3.5-397B -FP8 on 8xGB10 cluster DGX Spark / GB10	28	1560	May 29, 2026
Qwen3.5-122B-A10B NVFP4 Quantized for DGX Spark — 234GB → 75GB, Runs on 128GB DGX Spark / GB10 Projects	44	10954	April 9, 2026
Qwen3.5-397B-A17B + DGX Spark (duo) DGX Spark / GB10 Projects	59	5726	June 2, 2026
DGX Spark + Qwen3-Next-80B: Proven Performance, But Missing Clear Path to NIM, TensorRT-LLM & Web UIs DGX Spark / GB10 cuda , nim , llama	16	4565	March 6, 2026
Install and Use vLLM for Inference on two Sparks does not work DGX Spark / GB10	159	5493	December 9, 2025
Two-Spark cluster with vLLM using tensor-parallel-size 2 causes one node to drop while the other's GPU goes 100% forever DGX Spark / GB10	36	1656	February 13, 2026
Qwen3.5-122B-A10B on single Spark: up to 51 tok/s (v2.1 — patches + quick-start + benchmark) DGX Spark / GB10 cuda , performance , docker , performance-tuning , llm	417	19654	June 9, 2026