Qwen3.5-397B-A17B-int4-AutoRound - 4 x db10 node - updated results 37 - 94 tok/s

Qwen3.5-397B Loading Deadlock on 4-Node Spark Cluster — 23 Attempts, All Failed (+ GLM-5 SGLang Findings)

Hi everyone,

I’ve been trying to get Qwen3.5-397B-A17B-int4-AutoRound running on my 4-node DGX Spark cluster. After 23 systematic attempts across two full sessions, I’m stuck on a deterministic loading deadlock and would appreciate any guidance from those who have this working.

I’m also sharing some GLM-5 findings via SGLang that might be useful to the community.

Hardware

Component Details
Nodes 4x Acer Veriton GN100 (GB10, SM121)
RAM 128 GB per node (512 GB total)
Driver Node 1: 580.95.05, Nodes 2-4: 580.142
Interconnect MikroTik CRS812-8DS-2DQ-2DDQ, 200G RoCEv2 (QSFP)
MTU 9000
NCCL busbw 22.44 GB/s peak (4-node all_reduce)
Swap 24 GB per node, swappiness=1

What Works

  • The cluster itself is healthy — NCCL benchmarks at 22.44 GB/s, RDMA verified at 108 Gb/s

  • The ReplicatedLinear Marlin TP=4 patch applies successfully (Python-based, not the stale recipe mod)

  • MiniMax M2.5 AWQ serves fine at 24.4 tok/s on 2 nodes (vLLM v0.19.1rc1)

  • Qwen3.5-35B-A3B FP8 runs at 30.57 tok/s on 4 nodes

The Problem: Qwen3.5-397B Loading Deadlock

The model loads weights up to approximately layer 57 (of 96), then all worker threads enter futex_wait and never recover. The API never starts. No error message, no OOM, no timeout — just a permanent hang.

This happens identically across:

  • vLLM v0.19.1rc1 (spark-vllm-docker build)

  • spark-arena nightly (v0.18.2rc1)

  • Both with and without Ray

  • Both TP=4 and PP=3 configurations

  • With and without --enforce-eager

All 23 Attempts (Summarized)

Session 1 (2026-04-03) — 10 attempts:

# Config Result
1 TP=4, vanilla vllm serve Marlin partition error (output_size=32 < min_thread_n=64)
2 TP=4 + --quantization gptq Same Marlin error
3 TP=4 + recipe mod fix-qwen35-tp4-marlin Patch stale, doesn’t apply to current vLLM
4 TP=2 OOM (exit 137) — 226 GB doesn’t fit 256 GB with overhead
5 TP=4 + ReplicatedLinear Python patch NCCL/PyTorch timeout after ~34 min
6 TP=4 + ReplicatedLinear + NCCL timeout 7200s Extended to 47 min, then another timeout
7 TP=4 + ReplicatedLinear + fastsafetensors OOM (exit 137)
8 TP=4 + fastsafetensors + gpu-mem 0.65 Still OOM or timeout
9 PP=3 + Ray + fastsafetensors (recipe) Mods stale, container died
10 PP=3 + Ray + fastsafetensors (manual) Container died during loading

Session 2 (2026-04-04) — 13 attempts:

# Config Image Result
1 PP=3 + fastsafetensors vllm-node-tf5 v0.19.1rc1 NCCL timeout at 600s during tensor broadcast
2 PP=3 + no fastsafetensors vllm-node-tf5 v0.19.1rc1 Deadlock — all threads futex_wait after 30 min
3 PP=3 + enforce-eager spark-arena nightly Same deadlock
4 PP=3 + --no-ray spark-arena nightly v0.18.2rc1 Same deadlock
5 TP=4 + Marlin patch (blog config) vllm-node-tf5 v0.19.1rc1 Loaded to layer 57, then deadlock
6 TP=4 + swap fix (24GB, swappiness=1) vllm-node-tf5 v0.19.1rc1 Same layer 57 deadlock
7 TP=4 + cache clearing vllm-node-tf5 v0.19.1rc1 Same layer 57 deadlock
8 TP=4 + sonusflow fork mods vllm-node-tf5 v0.19.1rc1 Cluster startup timeout
9 TP=4 + Python-based Marlin only vllm-node-tf5 v0.19.1rc1 Cluster startup timeout (stale sglang containers)
10 TP=4 + eugr/main + Python mod vllm-node-tf5 v0.19.1rc1 Same layer 57 deadlock
11 Build vLLM v0.17.0 (blog author’s era) sonusflow Dockerfile gdn_linear_attn.py not found — Qwen3.5 unsupported pre-v0.18.2
12 SGLang spark image lmsysorg/sglang:spark v0.5.4 transformers 4.57.1 too old for qwen3_5_moe
13 SGLang + transformers 5.5.0 lmsysorg/sglang:spark Breaks SGLang DeepseekVL2Config dataclass

What I’ve Ruled Out

  • Swap — increased to 24 GB, swappiness=1 on all nodes. Didn’t fix.

  • NCCL bandwidth — 22.44 GB/s (vs blog’s 23.89 GB/s). Healthy.

  • Stale caches — cleared torch_compile_cache, flashinfer, triton on all nodes.

  • Marlin TP=4 — ReplicatedLinear patch applied and accepted by Marlin. Not the blocker.

  • Ray vs no-Ray — both deadlock identically.

  • PP=3 vs TP=4 — both deadlock, just at different phases.

  • Model distribution — verified 226 GB on all 4 nodes.

Cross-Check with Working Setup

I compared my setup against the confirmed 37 tok/s blog post in detail:

Parameter Blog (Working) My Cluster
Hardware 4x Asus Ascent (GB10) 4x Acer VGN100 (GB10)
Driver 580 580.95.05 / 580.142
NCCL busbw 23.89 GB/s 22.44 GB/s
Swap ~23 GB, swappiness=1 24 GB, swappiness=1
vLLM version v0.16.1rc1 context v0.19.1rc1
gpu_memory_utilization 0.78 0.78
max_model_len 32768 32768

The biggest difference is the vLLM version. The blog was built with an older vLLM (around March 9 / v0.16.1rc1 era). My build is v0.19.1rc1. I suspect a loading regression in the newer version, but I can’t build the old version because Qwen3.5 architecture support (gdn_linear_attn.py) only exists in v0.18.2+. There’s no version that both supports Qwen3.5 AND avoids the deadlock.

My Questions

  1. For those who have 397B working on 4-node Spark: What exact vLLM version/image are you using? Can you share the Docker image tag or commit hash?

  2. Has anyone seen the layer 57 deadlock? Is this a known vLLM regression between v0.16 and v0.19?

  3. Would the spark-arena nightly ( Package dgx-vllm-eugr-nightly-tf5 · GitHub ) work? I tried it but got the same deadlock.

  4. Has anyone tried the “Heretic” int4 quant on 4-node Spark? Reportedly faster than Intel AutoRound.

Bonus: GLM-5 SGLang Findings (Community Reference)

While waiting for 397B answers, I also attempted GLM-5 on the same cluster. Sharing these findings since I haven’t seen them documented anywhere:

  • vLLM is dead for GLM-5 on SM121use_sparse=True + use_mla=True has no working attention backend (v0.19.0 qk_nope fix is insufficient)

  • SGLang via scitrera/dgx-spark-sglang:0.5.8-t5 partially works:

    • glm_moe_dsa architecture recognized

    • AWQ Marlin kernel selected

    • NSA attention backend found — sparse attention IS supported in this image

    • NCCL distributed init succeeded across 4 nodes

    • Must use --model-impl transformers (default dispatch misroutes to DeepSeek V2 loader — head_size mismatch 576 vs 2048)

    • Must set GLOO_SOCKET_IFNAME=enp1s0f1np1 (hostname resolves to 127.0.0.1 in /etc/hosts)

    • BLOCKED: TransformersForCausalLM generic wrapper OOMs during weight loading (loads full partition into CPU RAM before GPU sharding — 98 GB/node with only ~11 GB headroom)

    • Needs either a dedicated SGLang glm_moe_dsa module (streaming weights) or more nodes

Happy to share detailed logs, configs, or the ReplicatedLinear patch script if useful.

Thanks in advance for any pointers.