Updated results
Raw data attached: qwen35-397b-tp4-bench.txt
Qwen3.5-397B-A17B INT4 on 4x DB10 — Full Benchmark with Concurrency Scaling
Setup
- Hardware: 4x Asus Ascent (GB10, 128GB unified memory each, 512GB total)
- Interconnect: MikroTik CRS812 QSFP-DD switch, 100G RoCEv2 fabric (MTU 9000)
- Model:
Intel/Qwen3.5-397B-A17B-int4-AutoRound (GPTQ INT4, ~199GB)
- Runtime: vLLM v0.16.1rc1 (from
nvcr.io/nvidia/pytorch:26.01-py3)
- Tensor Parallel: TP=4 across all 4 nodes via Ray
- KV Cache: fp8, 53.8 GiB per node (215 GiB total)
- Context: 32K max, 8192 max batched tokens
- Compilation: torch.compile + CUDAGraphs (64s one-time warmup)
- Prefix Caching: Enabled
- NCCL: v2.29.2, RoCEv2, FlashInfer attention backend
- Benchmark tool: llama-benchy v0.3.4
Marlin TP=4 Fix
TP=4 requires a patch for the Marlin kernel — in_proj_ba layers in the linear attention (GDN) blocks have output_size=128, which becomes 32 when split across 4 GPUs, violating Marlin’s MIN_THREAD_N=64. We replace these with ReplicatedLinear (each GPU keeps the full weight) and manually slice the output. Patch available at github.com/sonusflow/spark-vllm-docker under mods/fix-qwen35-tp4-marlin.
Generation Speed — Single User (c1)
Rock-solid 37 tok/s regardless of prompt or generation length. Peak 39 tok/s.
| Prompt |
tg32 (tok/s) |
tg128 (tok/s) |
tg512 (tok/s) |
Peak tok/s |
| pp512 |
35.96 |
36.07 |
36.98 |
39.00 |
| pp1024 |
35.97 |
36.72 |
36.95 |
38.60 |
| pp2048 |
35.76 |
36.76 |
37.10 |
38.40 |
| pp4096 |
37.12 |
36.88 |
37.14 |
38.32 |
| pp8192 |
35.61 |
36.36 |
36.35 |
38.00 |
| pp16384 |
37.01 |
35.86 |
36.16 |
38.21 |
Generation speed does not degrade with longer prompts or longer outputs. The model sustains 36-37 tok/s even at 16K prompt + 512 token generation.
Concurrency Scaling — Total Throughput
Total cluster throughput scales well with concurrent users:
| Prompt |
c1 total |
c2 total |
c4 total |
c4 peak |
| tg32 |
37 |
63 |
87-90 |
117 |
| tg128 |
37 |
59-61 |
74-90 |
112 |
| tg512 |
37 |
56-60 |
80-94 |
121 |
At 4 concurrent users, the cluster delivers up to 94 tok/s total throughput (2.5x single-user), with peak bursts hitting 121 tok/s.
Concurrency Scaling — Per-User Experience
Per-request speed degrades gracefully under load:
| Concurrency |
tg128 avg (tok/s) |
tg512 avg (tok/s) |
Relative to c1 |
| c1 |
36.4 |
37.0 |
100% |
| c2 |
29.9 |
29.4 |
~80% |
| c4 |
21.0 |
21.3 |
~57% |
Even at 4 concurrent users, each gets 21+ tok/s — still faster than GPT-4o streaming.
Prefill Throughput
Prompt processing scales with length up to ~2048 tokens, then plateaus around 2,200-2,500 tok/s:
| Prompt Length |
c1 (tok/s) |
c2 total (tok/s) |
c4 total (tok/s) |
| pp512 |
1,750 |
1,670 |
1,830 |
| pp1024 |
2,120 |
2,160 |
2,085 |
| pp2048 |
2,350 |
2,250 |
2,270 |
| pp4096 |
2,220 |
2,190 |
2,190 |
| pp8192 |
2,370 |
2,300 |
2,120 |
| pp16384 |
2,190 |
2,260 |
2,270 |
Prefill throughput stays remarkably consistent even at 16K tokens with 4 concurrent users.
Time to First Token (TTFT)
This is where concurrency + long prompts hit hardest:
| Prompt |
c1 |
c2 |
c4 |
| pp512 |
0.4s |
0.6s |
0.9s |
| pp1024 |
0.6s |
0.9s |
1.7s |
| pp2048 |
1.0s |
1.7s |
2.8s |
| pp4096 |
1.9s |
3.3s |
6.3s |
| pp8192 |
3.6s |
6.2s |
12.0s |
| pp16384 |
7.5s |
13.1s |
20.5s |
Single-user TTFT is excellent — under 1 second for prompts up to 1K tokens, under 4 seconds at 8K. At 4 concurrent users with 16K prompts, TTFT reaches 20 seconds as prefill requests queue up.
Thermal Profile Under Load
All 4 nodes monitored during the full benchmark run (90+ minutes of sustained inference):
| Node |
GPU Avg |
GPU Range |
Power Avg |
CPU Peak Max |
Status |
| Spark 1 (head) |
73°C |
73-75°C |
34.1W |
90°C |
OK |
| Spark 2 |
72°C |
71-76°C |
35.0W |
95°C |
WARM |
| Spark 3 |
72°C |
69-76°C |
33.4W |
87°C |
OK |
| Spark 4 |
68°C |
67-69°C |
31.0W |
89°C |
COOL |
- Total cluster power: ~134W (all 4 GPUs combined)
- Spark 2 hit 95°C CPU peak once — brief, near throttle but recovered
- Spark 4 consistently coolest — better airflow/positioning
- All GPUs stable at 67-76°C — well within safe operating range
Before/After: enforce-eager vs torch.compile (same hardware, same TP=4)
|
enforce-eager |
torch.compile |
Improvement |
| Generation (tg128, c1) |
20.9 tok/s |
36.7 tok/s |
+76% |
| Peak throughput (c1) |
22.0 tok/s |
39.0 tok/s |
+77% |
| Peak throughput (c4) |
— |
121 tok/s |
— |
| Prefill (pp2048, c1) |
2,263 tok/s |
2,463 tok/s |
+9% |
| Available KV cache |
38.67 GiB/node |
53.8 GiB/node |
+39% |
| Startup overhead |
None |
+64s one-time |
Cached after first run |
Key Findings
-
torch.compile is essential on DB10 — 77% generation speedup, 39% more KV cache. The 64-second one-time compile cost pays for itself on the first request.
-
Single-user performance is remarkably consistent — 37 tok/s at pp512 and pp16384. Prompt length does not affect generation speed.
-
Concurrency sweet spot is 2 users — 80% of single-user speed per request, nearly double the total throughput. Beyond 2, TTFT at long prompts becomes the bottleneck.
-
4-user total throughput peaks at 121 tok/s — the cluster handles burst load well, but per-user latency suffers at long contexts (20s TTFT at pp16384/c4).
-
Power efficiency is exceptional — 134W total for a 397B parameter model serving 37 tok/s. That’s ~3.6W per tok/s.
-
Thermals are not a concern — 90+ minutes of sustained benchmarking, all GPUs under 76°C, total power under 140W.
TLDR
4x DGX Spark running Qwen3.5-397B-A17B INT4 with torch.compile: 37 tok/s single-user, 94 tok/s at 4 concurrent users, 134W total power. Drop --enforce-eager — the 64-second compile time is worth every second.
Benchmark: llama-benchy v0.3.4 | pp: 512-16384 | tg: 32, 128, 512 | concurrency: 1, 2, 4 | 3 runs per test | prefix caching enabled
Raw data attached: qwen35-397b-tp4-bench
qwen35-397b-tp4-bench.txt (21.3 KB)