Qwen3.5-397B-A17B-int4-AutoRound - 4 x db10 node - updated results 37 - 94 tok/s

Test done on node of 4x DB10 (Ascent)

model test t/s peak t/s ttfr (ms) est_ppt (ms) e2e_ttft (ms)
Intel/Qwen3.5-397B-A17B-int4-AutoRound pp512 1400.85 ± 38.57 367.95 ± 10.29 366.49 ± 10.29 367.99 ± 10.29
Intel/Qwen3.5-397B-A17B-int4-AutoRound tg32 20.95 ± 0.02 21.67 ± 0.47
Intel/Qwen3.5-397B-A17B-int4-AutoRound pp512 1404.28 ± 5.02 366.77 ± 1.30 365.32 ± 1.30 366.82 ± 1.30
Intel/Qwen3.5-397B-A17B-int4-AutoRound tg128 20.92 ± 0.04 22.00 ± 0.00
Intel/Qwen3.5-397B-A17B-int4-AutoRound pp1024 1809.12 ± 101.38 569.88 ± 33.10 568.43 ± 33.10 569.93 ± 33.11
Intel/Qwen3.5-397B-A17B-int4-AutoRound tg32 20.96 ± 0.02 21.67 ± 0.47
Intel/Qwen3.5-397B-A17B-int4-AutoRound pp1024 1898.55 ± 25.32 541.62 ± 7.48 540.16 ± 7.48 541.66 ± 7.48
Intel/Qwen3.5-397B-A17B-int4-AutoRound tg128 20.85 ± 0.21 21.67 ± 0.47
Intel/Qwen3.5-397B-A17B-int4-AutoRound pp2048 2263.06 ± 11.47 906.89 ± 4.59 905.44 ± 4.59 906.93 ± 4.59
Intel/Qwen3.5-397B-A17B-int4-AutoRound tg32 20.90 ± 0.05 21.33 ± 0.47
Intel/Qwen3.5-397B-A17B-int4-AutoRound pp2048 2206.67 ± 36.20 930.40 ± 15.21 928.95 ± 15.21 930.44 ± 15.21
Intel/Qwen3.5-397B-A17B-int4-AutoRound tg128 20.88 ± 0.03 21.00 ± 0.00
Intel/Qwen3.5-397B-A17B-int4-AutoRound pp4096 2222.44 ± 53.09 1845.84 ± 44.90 1844.39 ± 44.90 1845.89 ± 44.90
Intel/Qwen3.5-397B-A17B-int4-AutoRound tg32 20.88 ± 0.02 21.00 ± 0.00
Intel/Qwen3.5-397B-A17B-int4-AutoRound pp4096 2246.69 ± 13.07 1825.24 ± 10.60 1823.78 ± 10.60 1825.30 ± 10.60
Intel/Qwen3.5-397B-A17B-int4-AutoRound tg128 20.87 ± 0.06 21.33 ± 0.47
Intel/Qwen3.5-397B-A17B-int4-AutoRound pp8192 2348.15 ± 2.43 3490.73 ± 3.49 3489.28 ± 3.49 3490.80 ± 3.49
Intel/Qwen3.5-397B-A17B-int4-AutoRound tg32 20.87 ± 0.04 21.00 ± 0.00
Intel/Qwen3.5-397B-A17B-int4-AutoRound pp8192 2110.23 ± 238.95 3937.57 ± 474.11 3936.11 ± 474.11 3937.61 ± 474.11
Intel/Qwen3.5-397B-A17B-int4-AutoRound tg128 20.81 ± 0.02 21.00 ± 0.00

Don’t tempt me to buy 2 more of these things…

Ran this on a dual setup but it was just a hair to slow and just a bit too tight on memory for my use case. Was really impressed with the model quality though.

I have additional 4 of these things waiting in a box (8x cluster) waiting for cables 🙃

I will be working now to see how much can 4x be pushed as to the performance for larger models. Also 20 t/sec seems not huge number but I have to say that is actually ok to work with. I’ll be running some agents on it for next few days to see how performs.

pictures!

:)

and soon to be connected

thats a rig right there. thank you enjoy!

Updated results
Raw data attached: qwen35-397b-tp4-bench.txt

Qwen3.5-397B-A17B INT4 on 4x DB10 — Full Benchmark with Concurrency Scaling

Setup

  • Hardware: 4x Asus Ascent (GB10, 128GB unified memory each, 512GB total)
  • Interconnect: MikroTik CRS812 QSFP-DD switch, 100G RoCEv2 fabric (MTU 9000)
  • Model: Intel/Qwen3.5-397B-A17B-int4-AutoRound (GPTQ INT4, ~199GB)
  • Runtime: vLLM v0.16.1rc1 (from nvcr.io/nvidia/pytorch:26.01-py3)
  • Tensor Parallel: TP=4 across all 4 nodes via Ray
  • KV Cache: fp8, 53.8 GiB per node (215 GiB total)
  • Context: 32K max, 8192 max batched tokens
  • Compilation: torch.compile + CUDAGraphs (64s one-time warmup)
  • Prefix Caching: Enabled
  • NCCL: v2.29.2, RoCEv2, FlashInfer attention backend
  • Benchmark tool: llama-benchy v0.3.4

Marlin TP=4 Fix

TP=4 requires a patch for the Marlin kernel — in_proj_ba layers in the linear attention (GDN) blocks have output_size=128, which becomes 32 when split across 4 GPUs, violating Marlin’s MIN_THREAD_N=64. We replace these with ReplicatedLinear (each GPU keeps the full weight) and manually slice the output. Patch available at github.com/sonusflow/spark-vllm-docker under mods/fix-qwen35-tp4-marlin.


Generation Speed — Single User (c1)

Rock-solid 37 tok/s regardless of prompt or generation length. Peak 39 tok/s.

Prompt tg32 (tok/s) tg128 (tok/s) tg512 (tok/s) Peak tok/s
pp512 35.96 36.07 36.98 39.00
pp1024 35.97 36.72 36.95 38.60
pp2048 35.76 36.76 37.10 38.40
pp4096 37.12 36.88 37.14 38.32
pp8192 35.61 36.36 36.35 38.00
pp16384 37.01 35.86 36.16 38.21

Generation speed does not degrade with longer prompts or longer outputs. The model sustains 36-37 tok/s even at 16K prompt + 512 token generation.


Concurrency Scaling — Total Throughput

Total cluster throughput scales well with concurrent users:

Prompt c1 total c2 total c4 total c4 peak
tg32 37 63 87-90 117
tg128 37 59-61 74-90 112
tg512 37 56-60 80-94 121

At 4 concurrent users, the cluster delivers up to 94 tok/s total throughput (2.5x single-user), with peak bursts hitting 121 tok/s.


Concurrency Scaling — Per-User Experience

Per-request speed degrades gracefully under load:

Concurrency tg128 avg (tok/s) tg512 avg (tok/s) Relative to c1
c1 36.4 37.0 100%
c2 29.9 29.4 ~80%
c4 21.0 21.3 ~57%

Even at 4 concurrent users, each gets 21+ tok/s — still faster than GPT-4o streaming.


Prefill Throughput

Prompt processing scales with length up to ~2048 tokens, then plateaus around 2,200-2,500 tok/s:

Prompt Length c1 (tok/s) c2 total (tok/s) c4 total (tok/s)
pp512 1,750 1,670 1,830
pp1024 2,120 2,160 2,085
pp2048 2,350 2,250 2,270
pp4096 2,220 2,190 2,190
pp8192 2,370 2,300 2,120
pp16384 2,190 2,260 2,270

Prefill throughput stays remarkably consistent even at 16K tokens with 4 concurrent users.


Time to First Token (TTFT)

This is where concurrency + long prompts hit hardest:

Prompt c1 c2 c4
pp512 0.4s 0.6s 0.9s
pp1024 0.6s 0.9s 1.7s
pp2048 1.0s 1.7s 2.8s
pp4096 1.9s 3.3s 6.3s
pp8192 3.6s 6.2s 12.0s
pp16384 7.5s 13.1s 20.5s

Single-user TTFT is excellent — under 1 second for prompts up to 1K tokens, under 4 seconds at 8K. At 4 concurrent users with 16K prompts, TTFT reaches 20 seconds as prefill requests queue up.


Thermal Profile Under Load

All 4 nodes monitored during the full benchmark run (90+ minutes of sustained inference):

Node GPU Avg GPU Range Power Avg CPU Peak Max Status
Spark 1 (head) 73°C 73-75°C 34.1W 90°C OK
Spark 2 72°C 71-76°C 35.0W 95°C WARM
Spark 3 72°C 69-76°C 33.4W 87°C OK
Spark 4 68°C 67-69°C 31.0W 89°C COOL
  • Total cluster power: ~134W (all 4 GPUs combined)
  • Spark 2 hit 95°C CPU peak once — brief, near throttle but recovered
  • Spark 4 consistently coolest — better airflow/positioning
  • All GPUs stable at 67-76°C — well within safe operating range

Before/After: enforce-eager vs torch.compile (same hardware, same TP=4)

enforce-eager torch.compile Improvement
Generation (tg128, c1) 20.9 tok/s 36.7 tok/s +76%
Peak throughput (c1) 22.0 tok/s 39.0 tok/s +77%
Peak throughput (c4) 121 tok/s
Prefill (pp2048, c1) 2,263 tok/s 2,463 tok/s +9%
Available KV cache 38.67 GiB/node 53.8 GiB/node +39%
Startup overhead None +64s one-time Cached after first run

Key Findings

  1. torch.compile is essential on DB10 — 77% generation speedup, 39% more KV cache. The 64-second one-time compile cost pays for itself on the first request.

  2. Single-user performance is remarkably consistent — 37 tok/s at pp512 and pp16384. Prompt length does not affect generation speed.

  3. Concurrency sweet spot is 2 users — 80% of single-user speed per request, nearly double the total throughput. Beyond 2, TTFT at long prompts becomes the bottleneck.

  4. 4-user total throughput peaks at 121 tok/s — the cluster handles burst load well, but per-user latency suffers at long contexts (20s TTFT at pp16384/c4).

  5. Power efficiency is exceptional — 134W total for a 397B parameter model serving 37 tok/s. That’s ~3.6W per tok/s.

  6. Thermals are not a concern — 90+ minutes of sustained benchmarking, all GPUs under 76°C, total power under 140W.

TLDR

4x DGX Spark running Qwen3.5-397B-A17B INT4 with torch.compile: 37 tok/s single-user, 94 tok/s at 4 concurrent users, 134W total power. Drop --enforce-eager — the 64-second compile time is worth every second.


Benchmark: llama-benchy v0.3.4 | pp: 512-16384 | tg: 32, 128, 512 | concurrency: 1, 2, 4 | 3 runs per test | prefix caching enabled
Raw data attached: qwen35-397b-tp4-bench

qwen35-397b-tp4-bench.txt (21.3 KB)

can you open a PR upstream to GitHub - eugr/spark-vllm-docker: Docker configuration for running VLLM on dual DGX Sparks · GitHub?

I was happy to find someone with a similar setup, so I’m sharing the results of what I tried in my own environment.
I can’t enable CUDAGraphs in my environment, but is it possible to enable it with sonusflow/spark-vllm-docker?

Hardware: DGXSpark, 3x ThinkStationPGX (GB10, 128GB unified memory each, 512GB total)
Interconnect: MikroTik CRS812 QSFP-DD switch, 100G RoCEv2 fabric (Version 7.21.1, MTU 9000)
Model: Qwen/Qwen3.5-397B-A17B-GPTQ-Int4 (GPTQ INT4, 236GB)
Runtime: vLLM 0.16.0rc2.dev376+gf4af642a6.cu130 (from vllm/vllm-openai:qwen3_5-cu130)
Tensor Parallel: TP=4 across all 4 nodes via Ray
Context: 32K max, 8192 max batched tokens
Compilation: torch.compile (None CUDAGraphs)
Prefix Caching: Enabled
NCCL: v2.28.9, RoCEv2, FlashInfer attention backend
Benchmark tool: llama-benchy v0.3.4

| model                            |   test |              t/s |     peak t/s |         ttfr (ms) |      est_ppt (ms) |     e2e_ttft (ms) |

|:---------------------------------|-------:|-----------------:|-------------:|------------------:|------------------:|------------------:|

| Qwen/Qwen3.5-397B-A17B-GPTQ-Int4 |  pp512 | 1192.88 ± 168.29 |              |   1593.46 ± 84.03 |    441.67 ± 84.03 |   1593.49 ± 84.03 |

| Qwen/Qwen3.5-397B-A17B-GPTQ-Int4 |   tg32 |     10.78 ± 1.73 | 12.20 ± 0.40 |                   |                   |                   |

| Qwen/Qwen3.5-397B-A17B-GPTQ-Int4 |  pp512 | 1545.52 ± 344.10 |              |   1497.26 ± 62.76 |    345.47 ± 62.76 |   1497.29 ± 62.76 |

| Qwen/Qwen3.5-397B-A17B-GPTQ-Int4 |  tg128 |      9.80 ± 1.54 | 12.30 ± 0.46 |                   |                   |                   |

| Qwen/Qwen3.5-397B-A17B-GPTQ-Int4 | pp1024 | 2899.55 ± 194.12 |              |   1506.90 ± 23.94 |    355.10 ± 23.94 |   1506.93 ± 23.94 |

| Qwen/Qwen3.5-397B-A17B-GPTQ-Int4 |   tg32 |     10.74 ± 2.17 | 12.30 ± 0.46 |                   |                   |                   |

| Qwen/Qwen3.5-397B-A17B-GPTQ-Int4 | pp1024 | 2922.06 ± 352.63 |              |   1509.91 ± 60.87 |    358.12 ± 60.87 |   1509.94 ± 60.87 |

| Qwen/Qwen3.5-397B-A17B-GPTQ-Int4 |  tg128 |      9.92 ± 1.49 | 12.30 ± 0.46 |                   |                   |                   |

| Qwen/Qwen3.5-397B-A17B-GPTQ-Int4 | pp4096 | 1142.19 ± 123.79 |              |  4794.34 ± 515.87 |  3642.55 ± 515.87 |  4794.38 ± 515.87 |

| Qwen/Qwen3.5-397B-A17B-GPTQ-Int4 |   tg32 |      9.15 ± 2.08 | 11.50 ± 0.50 |                   |                   |                   |

| Qwen/Qwen3.5-397B-A17B-GPTQ-Int4 | pp4096 |  1176.59 ± 56.95 |              |  4642.33 ± 176.33 |  3490.54 ± 176.33 |  4642.38 ± 176.33 |

| Qwen/Qwen3.5-397B-A17B-GPTQ-Int4 |  tg128 |      9.20 ± 1.33 | 12.10 ± 0.30 |                   |                   |                   |

| Qwen/Qwen3.5-397B-A17B-GPTQ-Int4 | pp8192 |  991.55 ± 108.10 |              | 9532.18 ± 1088.53 | 8380.39 ± 1088.53 | 9532.22 ± 1088.53 |

| Qwen/Qwen3.5-397B-A17B-GPTQ-Int4 |   tg32 |     10.21 ± 1.01 | 11.50 ± 0.50 |                   |                   |                   |

| Qwen/Qwen3.5-397B-A17B-GPTQ-Int4 | pp8192 |  1054.45 ± 35.75 |              |  8931.00 ± 275.43 |  7779.20 ± 275.43 |  8931.04 ± 275.43 |

| Qwen/Qwen3.5-397B-A17B-GPTQ-Int4 |  tg128 |     10.08 ± 0.87 | 12.20 ± 0.40 |                   |                   |                   |

If you’re measuring the power usage by just taking the nvidia-smi output, you’re not getting the full picture.

Do you have power monitoring at the wall for all 4 during load? I’d imagine it’s closer to 400-500w

@trystan1 - yes, this was a smi data only. We do have monitoring on them, and will report proper consumption some time in future. Now we are focusing on getting some optimised performance out of them which is not easy :)

We will be connecting 4 more units in coming days.

Nice Work! Do we have a rig pictures thread?

I have sent you DM. Due to a lot of releases recently we need to work out best solution to optimise the performance of 4x or 8x units. Once we are happy with results we will share.

Not sure :) You can set it up - will be nice to see how community make use of those boxes :)

I wanted to share what I’ve learned over the past week running Qwen3.5-397B-A17B (INT4 AutoRound, ~199GB) at TP=4 across 4 DGX Sparks, since some of these findings are pretty specific to the GB10 and might be useful for the community.

What’s working:

  • 37 tok/s single-user decode (peak 39) on Qwen3.5-397B at TP=4 with torch.compile + CUDAGraphs
  • Marlin INT4 GEMM kernels with a custom TP=4 fix for Qwen3.5’s GDN attention layers (upstream PR filed: vllm-project/vllm#35924)
  • FlashInfer attention backend on SM121
  • 200GbE RoCE fabric at 96% line rate (23.89 GB/s busbw on 4-node all_reduce)
  • vLLM v0.16.1rc1 on the eugr/spark-vllm-docker fork with a recipe system we built on top

Critical GB10-specific gotchas we discovered:

  1. Driver 580 ONLY. Driver 590 introduces a UMA memory leak (80-96 GiB not released after CUDA exit) and a CUDAGraph capture deadlock. Both are GB10/UMA-specific. NVIDIA forum reps confirmed 580 is the officially supported driver. The container’s CUDA 13.1 forward-compat layer on host driver 580 works perfectly — no need to match versions.

  2. gpu_memory_utilization is broken on unified memory. It works as a gate (crashes at 0.85 if exceeding profiled free) but NOT as a cap — values below the threshold all produce the same KV cache allocation because vLLM profiles the entire shared CPU/GPU pool. Docker cgroup memory limits also don’t work (CUDA UMA bypasses cgroups). Workaround: --num-gpu-blocks-override to directly control KV cache. This affects all Grace Blackwell platforms, not just Spark.

  3. NCCL auto-negotiate beats manual tuning inside vLLM. We did extensive nccl-tests benchmarking (Simple proto, 6 channels = optimal for CX7 over RoCE), but applying those settings to vLLM caused -8 to -15.7% regression. The NCCL autotuner makes better per-operation decisions when interleaved with compute kernels.

  4. torch.compile + CUDAGraphs = 77% speedup over enforce-eager on MoE. The GB10’s Grace ARM CPU is slower at Python/CUDA dispatch than x86, making the kernel launch overhead elimination even more impactful. But CUDAGraph capture needs swap headroom (~23GB swap configured, swappiness=1).

  5. VLLM_USE_FLASHINFER_MOE_MXFP4_MXFP8 breaks torch.compile on INT4 GPTQ models. It halves the compiled subgraphs. Only use with actual NVFP4/MXFP4 weights.

Qwen3.5-397B Loading Deadlock on 4-Node Spark Cluster — 23 Attempts, All Failed (+ GLM-5 SGLang Findings)

Hi everyone,

I’ve been trying to get Qwen3.5-397B-A17B-int4-AutoRound running on my 4-node DGX Spark cluster. After 23 systematic attempts across two full sessions, I’m stuck on a deterministic loading deadlock and would appreciate any guidance from those who have this working.

I’m also sharing some GLM-5 findings via SGLang that might be useful to the community.

Hardware

Component Details
Nodes 4x Acer Veriton GN100 (GB10, SM121)
RAM 128 GB per node (512 GB total)
Driver Node 1: 580.95.05, Nodes 2-4: 580.142
Interconnect MikroTik CRS812-8DS-2DQ-2DDQ, 200G RoCEv2 (QSFP)
MTU 9000
NCCL busbw 22.44 GB/s peak (4-node all_reduce)
Swap 24 GB per node, swappiness=1

What Works

  • The cluster itself is healthy — NCCL benchmarks at 22.44 GB/s, RDMA verified at 108 Gb/s

  • The ReplicatedLinear Marlin TP=4 patch applies successfully (Python-based, not the stale recipe mod)

  • MiniMax M2.5 AWQ serves fine at 24.4 tok/s on 2 nodes (vLLM v0.19.1rc1)

  • Qwen3.5-35B-A3B FP8 runs at 30.57 tok/s on 4 nodes

The Problem: Qwen3.5-397B Loading Deadlock

The model loads weights up to approximately layer 57 (of 96), then all worker threads enter futex_wait and never recover. The API never starts. No error message, no OOM, no timeout — just a permanent hang.

This happens identically across:

  • vLLM v0.19.1rc1 (spark-vllm-docker build)

  • spark-arena nightly (v0.18.2rc1)

  • Both with and without Ray

  • Both TP=4 and PP=3 configurations

  • With and without --enforce-eager

All 23 Attempts (Summarized)

Session 1 (2026-04-03) — 10 attempts:

# Config Result
1 TP=4, vanilla vllm serve Marlin partition error (output_size=32 < min_thread_n=64)
2 TP=4 + --quantization gptq Same Marlin error
3 TP=4 + recipe mod fix-qwen35-tp4-marlin Patch stale, doesn’t apply to current vLLM
4 TP=2 OOM (exit 137) — 226 GB doesn’t fit 256 GB with overhead
5 TP=4 + ReplicatedLinear Python patch NCCL/PyTorch timeout after ~34 min
6 TP=4 + ReplicatedLinear + NCCL timeout 7200s Extended to 47 min, then another timeout
7 TP=4 + ReplicatedLinear + fastsafetensors OOM (exit 137)
8 TP=4 + fastsafetensors + gpu-mem 0.65 Still OOM or timeout
9 PP=3 + Ray + fastsafetensors (recipe) Mods stale, container died
10 PP=3 + Ray + fastsafetensors (manual) Container died during loading

Session 2 (2026-04-04) — 13 attempts:

# Config Image Result
1 PP=3 + fastsafetensors vllm-node-tf5 v0.19.1rc1 NCCL timeout at 600s during tensor broadcast
2 PP=3 + no fastsafetensors vllm-node-tf5 v0.19.1rc1 Deadlock — all threads futex_wait after 30 min
3 PP=3 + enforce-eager spark-arena nightly Same deadlock
4 PP=3 + --no-ray spark-arena nightly v0.18.2rc1 Same deadlock
5 TP=4 + Marlin patch (blog config) vllm-node-tf5 v0.19.1rc1 Loaded to layer 57, then deadlock
6 TP=4 + swap fix (24GB, swappiness=1) vllm-node-tf5 v0.19.1rc1 Same layer 57 deadlock
7 TP=4 + cache clearing vllm-node-tf5 v0.19.1rc1 Same layer 57 deadlock
8 TP=4 + sonusflow fork mods vllm-node-tf5 v0.19.1rc1 Cluster startup timeout
9 TP=4 + Python-based Marlin only vllm-node-tf5 v0.19.1rc1 Cluster startup timeout (stale sglang containers)
10 TP=4 + eugr/main + Python mod vllm-node-tf5 v0.19.1rc1 Same layer 57 deadlock
11 Build vLLM v0.17.0 (blog author’s era) sonusflow Dockerfile gdn_linear_attn.py not found — Qwen3.5 unsupported pre-v0.18.2
12 SGLang spark image lmsysorg/sglang:spark v0.5.4 transformers 4.57.1 too old for qwen3_5_moe
13 SGLang + transformers 5.5.0 lmsysorg/sglang:spark Breaks SGLang DeepseekVL2Config dataclass

What I’ve Ruled Out

  • Swap — increased to 24 GB, swappiness=1 on all nodes. Didn’t fix.

  • NCCL bandwidth — 22.44 GB/s (vs blog’s 23.89 GB/s). Healthy.

  • Stale caches — cleared torch_compile_cache, flashinfer, triton on all nodes.

  • Marlin TP=4 — ReplicatedLinear patch applied and accepted by Marlin. Not the blocker.

  • Ray vs no-Ray — both deadlock identically.

  • PP=3 vs TP=4 — both deadlock, just at different phases.

  • Model distribution — verified 226 GB on all 4 nodes.

Cross-Check with Working Setup

I compared my setup against the confirmed 37 tok/s blog post in detail:

Parameter Blog (Working) My Cluster
Hardware 4x Asus Ascent (GB10) 4x Acer VGN100 (GB10)
Driver 580 580.95.05 / 580.142
NCCL busbw 23.89 GB/s 22.44 GB/s
Swap ~23 GB, swappiness=1 24 GB, swappiness=1
vLLM version v0.16.1rc1 context v0.19.1rc1
gpu_memory_utilization 0.78 0.78
max_model_len 32768 32768

The biggest difference is the vLLM version. The blog was built with an older vLLM (around March 9 / v0.16.1rc1 era). My build is v0.19.1rc1. I suspect a loading regression in the newer version, but I can’t build the old version because Qwen3.5 architecture support (gdn_linear_attn.py) only exists in v0.18.2+. There’s no version that both supports Qwen3.5 AND avoids the deadlock.

My Questions

  1. For those who have 397B working on 4-node Spark: What exact vLLM version/image are you using? Can you share the Docker image tag or commit hash?

  2. Has anyone seen the layer 57 deadlock? Is this a known vLLM regression between v0.16 and v0.19?

  3. Would the spark-arena nightly ( Package dgx-vllm-eugr-nightly-tf5 · GitHub ) work? I tried it but got the same deadlock.

  4. Has anyone tried the “Heretic” int4 quant on 4-node Spark? Reportedly faster than Intel AutoRound.

Bonus: GLM-5 SGLang Findings (Community Reference)

While waiting for 397B answers, I also attempted GLM-5 on the same cluster. Sharing these findings since I haven’t seen them documented anywhere:

  • vLLM is dead for GLM-5 on SM121use_sparse=True + use_mla=True has no working attention backend (v0.19.0 qk_nope fix is insufficient)

  • SGLang via scitrera/dgx-spark-sglang:0.5.8-t5 partially works:

    • glm_moe_dsa architecture recognized

    • AWQ Marlin kernel selected

    • NSA attention backend found — sparse attention IS supported in this image

    • NCCL distributed init succeeded across 4 nodes

    • Must use --model-impl transformers (default dispatch misroutes to DeepSeek V2 loader — head_size mismatch 576 vs 2048)

    • Must set GLOO_SOCKET_IFNAME=enp1s0f1np1 (hostname resolves to 127.0.0.1 in /etc/hosts)

    • BLOCKED: TransformersForCausalLM generic wrapper OOMs during weight loading (loads full partition into CPU RAM before GPU sharding — 98 GB/node with only ~11 GB headroom)

    • Needs either a dedicated SGLang glm_moe_dsa module (streaming weights) or more nodes

Happy to share detailed logs, configs, or the ReplicatedLinear patch script if useful.

Thanks in advance for any pointers.

UPDATE (2026-04-05): Got it working! 23 tok/s with enforce-eager, TP=4 on 4 nodes.

The fix was NOT the recipe or Marlin patch, it was 7 networking/config issues
specific to running vLLM natively (not in Docker):

  1. /etc/hosts maps hostname to 127.0.0.1 — Gloo breaks. Fix: map to QSFP IP
  2. vLLM get_ip() connects to 8.8.8.8 — workers without internet get wrong IP.
    Fix: patch network_utils.py to use ray.util.get_node_ip_address()
  3. VLLM_HOST_IP propagates head’s IP to ALL workers via Ray env copy.
    Fix: ~/.config/vllm/ray_non_carry_over_env_vars.json
  4. Workers try HuggingFace checks without internet. Fix: HF_HUB_OFFLINE=1
  5. FlashInfer autotune hangs 1+ hour on ARM. Fix: --no-enable-flashinfer-autotune
  6. Engine startup timeout too short for multi-node. Fix: VLLM_ENGINE_READY_TIMEOUT_S=1800
  7. FlashInfer JIT needs ninja on workers. Fix: sudo ln -sf .vllm/bin/ninja /usr/local/bin/ninja

Built vLLM 0.19.1+cu130 from johnnynunez/vllm fork with TORCH_CUDA_ARCH_LIST=12.1a.
FlashInfer 0.6.7 from johnnynunez/flashinfer.

Happy to share the full recipe, patched files, and pip freeze if anyone wants to replicate.

Hardware: 4x Acer VGN100, MikroTik CRS812 DDQ, 200G RoCEv2, Driver 580.x

23 t/s, seems quite low for a 4 spark cluster, you can get 30 t/s with dual sparks alone (see this thread: Qwen3.5-397B-A17B run in dual spark! but I have a concern - #156 by stefan132)

Thanks for the reference! I have actually improved to 25.2 tok/s since that initial number, but there’s still a gap to the 30 tok/s dual-Spark results.

The 30 tok/s numbers use torch.compile + CUDAGraphs (via spark-vllm-docker). I’m running a from-source build (johnny_nv’s vLLM fork, v0.19.1+cu130) with --enforce-eager because torch.compile is broken on ARM with Triton 3.6.0:

  • FlashInfer autotune hangs indefinitely. a single kernel compiles for 2+ hours without completing
  • torch.compile without autotune exceeds Ray’s 300s compiled DAG timeout (RayChannelTimeoutError), crashing the server
  • MTP speculative decoding OOMs during kernel compilation. cuda cicc processes exhaust 128 GB/node

With VLLM_MARLIN_USE_ATOMIC_ADD=1 and --language-model-only, 25.2 tok/s appears to be the enforce-eager ceiling. The ~77% gap between enforce-eager and torch.compile from the community blog post aligns with this.

Waiting on Triton 3.7.0 to unblock torch.compile on SM121/ARM.

Has anyone found a workaround?

Guys,

this is my config on which im getting 34/37 tok/s - this config was created beginning of March and since then Im using it.

Qwen3.5-397B-A17B INT4 — 4× DGX Spark Setup

Hardware

  • 4× NVIDIA DGX Spark (GB10, 128 GiB UMA each)
  • RDMA fabric: 2× ConnectX-7 per node, f1 ports active, RoCEv2 over 200G Ethernet
  • Switch: MikroTik CRS804, 200G breakout per node
  • Kernel: 6.17.0-1014-nvidia

Software

vLLM: 0.16.1rc1.dev255+g792cbd64c.d20260305
PyTorch: 2.10.0a0+a36e1d39eb (NVIDIA build nv26.01)
CUDA: 13.1
Ray: 2.54.0
Driver: 580.142

Model

Intel/Qwen3.5-397B-A17B-int4-AutoRound (~199 GiB on disk)
Quantization: INT4 AutoRound + Marlin kernel (VLLM_MARLIN_USE_ATOMIC_ADD=1)

vLLM serve command

vllm serve Intel/Qwen3.5-397B-A17B-int4-AutoRound
–tool-call-parser qwen3_coder
–reasoning-parser qwen3
–enable-auto-tool-choice
–tensor-parallel-size 4
–distributed-executor-backend ray
–kv-cache-dtype fp8
–gpu-memory-utilization 0.78
–max-model-len 131072
–max-num-batched-tokens 32768
–enable-prefix-caching
–trust-remote-code
–host 0.0.0.0
–port 8000

NCCL / RDMA environment

VLLM_MARLIN_USE_ATOMIC_ADD=1
NCCL_IB_HCA=rocep1s0f1,roceP2p1s0f1 # dual-rail f1 ports
NCCL_IB_GID_INDEX=3 # IPv4-mapped RoCEv2, ipv6.method=disabled on NM
NCCL_CROSS_NIC=2
NCCL_ALGO=Ring
NCCL_PROTO=Simple
NCCL_MIN_NCHANNELS=32

Runtime patches required (mods applied before launch)

  1. fix-qwen3-coder-next — tool/reasoning parser for Qwen3 Coder
  2. fix-qwen35-tp4-marlin — Marlin TP=4 multi-node fix

Performance

│ Metric │ Value │
│ Throughput (sequential) │ ~34–37 tok/s │
│ Context length │ 128K (131072 tokens) │
│ KV cache pool │ 1,293,232 tokens (FP8) │
│ Max concurrent 128K sessions │ ~35 │
│ KV block size │ 2,096 tokens │
│ Model load time │ ~260s │
│ GPU memory per rank │ ~49.3 GiB weights + ~44 GiB KV cache │

Key gotchas

  • NCCL_IB_GID_INDEX=3 is critical — requires ipv6.method=disabled on RDMA NetworkManager connections, otherwise IPv4-mapped GID ends
    up at index 4/5 and NCCL fails silently
  • Add --device /dev/infiniband --ulimit memlock=-1 to docker run or NCCL falls back to TCP (~1 GB/s)
  • Suppress CoT with chat_template_kwargs: {“enable_thinking”: false} in requests — model outputs thinking tokens by default
  • First request after start is slow (~30s) due to Triton kernel JIT; subsequent requests are normal speed
  • Driver must be 580.x — 590+ has a UMA memory leak on GB10