Qwen3.5-397B Loading Deadlock on 4-Node Spark Cluster — 23 Attempts, All Failed (+ GLM-5 SGLang Findings)
Hi everyone,
I’ve been trying to get Qwen3.5-397B-A17B-int4-AutoRound running on my 4-node DGX Spark cluster. After 23 systematic attempts across two full sessions, I’m stuck on a deterministic loading deadlock and would appreciate any guidance from those who have this working.
I’m also sharing some GLM-5 findings via SGLang that might be useful to the community.
Hardware
| Component | Details |
|---|---|
| Nodes | 4x Acer Veriton GN100 (GB10, SM121) |
| RAM | 128 GB per node (512 GB total) |
| Driver | Node 1: 580.95.05, Nodes 2-4: 580.142 |
| Interconnect | MikroTik CRS812-8DS-2DQ-2DDQ, 200G RoCEv2 (QSFP) |
| MTU | 9000 |
| NCCL busbw | 22.44 GB/s peak (4-node all_reduce) |
| Swap | 24 GB per node, swappiness=1 |
What Works
-
The cluster itself is healthy — NCCL benchmarks at 22.44 GB/s, RDMA verified at 108 Gb/s
-
The ReplicatedLinear Marlin TP=4 patch applies successfully (Python-based, not the stale recipe mod)
-
MiniMax M2.5 AWQ serves fine at 24.4 tok/s on 2 nodes (vLLM v0.19.1rc1)
-
Qwen3.5-35B-A3B FP8 runs at 30.57 tok/s on 4 nodes
The Problem: Qwen3.5-397B Loading Deadlock
The model loads weights up to approximately layer 57 (of 96), then all worker threads enter futex_wait and never recover. The API never starts. No error message, no OOM, no timeout — just a permanent hang.
This happens identically across:
-
vLLM v0.19.1rc1 (spark-vllm-docker build)
-
spark-arena nightly (v0.18.2rc1)
-
Both with and without Ray
-
Both TP=4 and PP=3 configurations
-
With and without
--enforce-eager
All 23 Attempts (Summarized)
Session 1 (2026-04-03) — 10 attempts:
| # | Config | Result |
|---|---|---|
| 1 | TP=4, vanilla vllm serve | Marlin partition error (output_size=32 < min_thread_n=64) |
| 2 | TP=4 + --quantization gptq |
Same Marlin error |
| 3 | TP=4 + recipe mod fix-qwen35-tp4-marlin | Patch stale, doesn’t apply to current vLLM |
| 4 | TP=2 | OOM (exit 137) — 226 GB doesn’t fit 256 GB with overhead |
| 5 | TP=4 + ReplicatedLinear Python patch | NCCL/PyTorch timeout after ~34 min |
| 6 | TP=4 + ReplicatedLinear + NCCL timeout 7200s | Extended to 47 min, then another timeout |
| 7 | TP=4 + ReplicatedLinear + fastsafetensors | OOM (exit 137) |
| 8 | TP=4 + fastsafetensors + gpu-mem 0.65 | Still OOM or timeout |
| 9 | PP=3 + Ray + fastsafetensors (recipe) | Mods stale, container died |
| 10 | PP=3 + Ray + fastsafetensors (manual) | Container died during loading |
Session 2 (2026-04-04) — 13 attempts:
| # | Config | Image | Result |
|---|---|---|---|
| 1 | PP=3 + fastsafetensors | vllm-node-tf5 v0.19.1rc1 | NCCL timeout at 600s during tensor broadcast |
| 2 | PP=3 + no fastsafetensors | vllm-node-tf5 v0.19.1rc1 | Deadlock — all threads futex_wait after 30 min |
| 3 | PP=3 + enforce-eager | spark-arena nightly | Same deadlock |
| 4 | PP=3 + --no-ray | spark-arena nightly v0.18.2rc1 | Same deadlock |
| 5 | TP=4 + Marlin patch (blog config) | vllm-node-tf5 v0.19.1rc1 | Loaded to layer 57, then deadlock |
| 6 | TP=4 + swap fix (24GB, swappiness=1) | vllm-node-tf5 v0.19.1rc1 | Same layer 57 deadlock |
| 7 | TP=4 + cache clearing | vllm-node-tf5 v0.19.1rc1 | Same layer 57 deadlock |
| 8 | TP=4 + sonusflow fork mods | vllm-node-tf5 v0.19.1rc1 | Cluster startup timeout |
| 9 | TP=4 + Python-based Marlin only | vllm-node-tf5 v0.19.1rc1 | Cluster startup timeout (stale sglang containers) |
| 10 | TP=4 + eugr/main + Python mod | vllm-node-tf5 v0.19.1rc1 | Same layer 57 deadlock |
| 11 | Build vLLM v0.17.0 (blog author’s era) | sonusflow Dockerfile | gdn_linear_attn.py not found — Qwen3.5 unsupported pre-v0.18.2 |
| 12 | SGLang spark image | lmsysorg/sglang:spark v0.5.4 | transformers 4.57.1 too old for qwen3_5_moe |
| 13 | SGLang + transformers 5.5.0 | lmsysorg/sglang:spark | Breaks SGLang DeepseekVL2Config dataclass |
What I’ve Ruled Out
-
Swap — increased to 24 GB, swappiness=1 on all nodes. Didn’t fix.
-
NCCL bandwidth — 22.44 GB/s (vs blog’s 23.89 GB/s). Healthy.
-
Stale caches — cleared torch_compile_cache, flashinfer, triton on all nodes.
-
Marlin TP=4 — ReplicatedLinear patch applied and accepted by Marlin. Not the blocker.
-
Ray vs no-Ray — both deadlock identically.
-
PP=3 vs TP=4 — both deadlock, just at different phases.
-
Model distribution — verified 226 GB on all 4 nodes.
Cross-Check with Working Setup
I compared my setup against the confirmed 37 tok/s blog post in detail:
| Parameter | Blog (Working) | My Cluster |
|---|---|---|
| Hardware | 4x Asus Ascent (GB10) | 4x Acer VGN100 (GB10) |
| Driver | 580 | 580.95.05 / 580.142 |
| NCCL busbw | 23.89 GB/s | 22.44 GB/s |
| Swap | ~23 GB, swappiness=1 | 24 GB, swappiness=1 |
| vLLM version | v0.16.1rc1 context | v0.19.1rc1 |
| gpu_memory_utilization | 0.78 | 0.78 |
| max_model_len | 32768 | 32768 |
The biggest difference is the vLLM version. The blog was built with an older vLLM (around March 9 / v0.16.1rc1 era). My build is v0.19.1rc1. I suspect a loading regression in the newer version, but I can’t build the old version because Qwen3.5 architecture support (gdn_linear_attn.py) only exists in v0.18.2+. There’s no version that both supports Qwen3.5 AND avoids the deadlock.
My Questions
-
For those who have 397B working on 4-node Spark: What exact vLLM version/image are you using? Can you share the Docker image tag or commit hash?
-
Has anyone seen the layer 57 deadlock? Is this a known vLLM regression between v0.16 and v0.19?
-
Would the spark-arena nightly ( Package dgx-vllm-eugr-nightly-tf5 · GitHub ) work? I tried it but got the same deadlock.
-
Has anyone tried the “Heretic” int4 quant on 4-node Spark? Reportedly faster than Intel AutoRound.
Bonus: GLM-5 SGLang Findings (Community Reference)
While waiting for 397B answers, I also attempted GLM-5 on the same cluster. Sharing these findings since I haven’t seen them documented anywhere:
-
vLLM is dead for GLM-5 on SM121 —
use_sparse=True+use_mla=Truehas no working attention backend (v0.19.0 qk_nope fix is insufficient) -
SGLang via
scitrera/dgx-spark-sglang:0.5.8-t5partially works:-
glm_moe_dsaarchitecture recognized -
AWQ Marlin kernel selected
-
NSA attention backend found — sparse attention IS supported in this image
-
NCCL distributed init succeeded across 4 nodes
-
Must use
--model-impl transformers(default dispatch misroutes to DeepSeek V2 loader — head_size mismatch 576 vs 2048) -
Must set
GLOO_SOCKET_IFNAME=enp1s0f1np1(hostname resolves to 127.0.0.1 in /etc/hosts) -
BLOCKED: TransformersForCausalLM generic wrapper OOMs during weight loading (loads full partition into CPU RAM before GPU sharding — 98 GB/node with only ~11 GB headroom)
-
Needs either a dedicated SGLang
glm_moe_dsamodule (streaming weights) or more nodes
-
Happy to share detailed logs, configs, or the ReplicatedLinear patch script if useful.
Thanks in advance for any pointers.