Qwen3.5-397B-A17B-int4-AutoRound - 4 x db10 node - updated results 37 - 94 tok/s

adi-sonusflow · March 3, 2026, 9:23pm

Test done on node of 4x DB10 (Ascent)

model	test	t/s	peak t/s	ttfr (ms)	est_ppt (ms)	e2e_ttft (ms)
Intel/Qwen3.5-397B-A17B-int4-AutoRound	pp512	1400.85 ± 38.57		367.95 ± 10.29	366.49 ± 10.29	367.99 ± 10.29
Intel/Qwen3.5-397B-A17B-int4-AutoRound	tg32	20.95 ± 0.02	21.67 ± 0.47
Intel/Qwen3.5-397B-A17B-int4-AutoRound	pp512	1404.28 ± 5.02		366.77 ± 1.30	365.32 ± 1.30	366.82 ± 1.30
Intel/Qwen3.5-397B-A17B-int4-AutoRound	tg128	20.92 ± 0.04	22.00 ± 0.00
Intel/Qwen3.5-397B-A17B-int4-AutoRound	pp1024	1809.12 ± 101.38		569.88 ± 33.10	568.43 ± 33.10	569.93 ± 33.11
Intel/Qwen3.5-397B-A17B-int4-AutoRound	tg32	20.96 ± 0.02	21.67 ± 0.47
Intel/Qwen3.5-397B-A17B-int4-AutoRound	pp1024	1898.55 ± 25.32		541.62 ± 7.48	540.16 ± 7.48	541.66 ± 7.48
Intel/Qwen3.5-397B-A17B-int4-AutoRound	tg128	20.85 ± 0.21	21.67 ± 0.47
Intel/Qwen3.5-397B-A17B-int4-AutoRound	pp2048	2263.06 ± 11.47		906.89 ± 4.59	905.44 ± 4.59	906.93 ± 4.59
Intel/Qwen3.5-397B-A17B-int4-AutoRound	tg32	20.90 ± 0.05	21.33 ± 0.47
Intel/Qwen3.5-397B-A17B-int4-AutoRound	pp2048	2206.67 ± 36.20		930.40 ± 15.21	928.95 ± 15.21	930.44 ± 15.21
Intel/Qwen3.5-397B-A17B-int4-AutoRound	tg128	20.88 ± 0.03	21.00 ± 0.00
Intel/Qwen3.5-397B-A17B-int4-AutoRound	pp4096	2222.44 ± 53.09		1845.84 ± 44.90	1844.39 ± 44.90	1845.89 ± 44.90
Intel/Qwen3.5-397B-A17B-int4-AutoRound	tg32	20.88 ± 0.02	21.00 ± 0.00
Intel/Qwen3.5-397B-A17B-int4-AutoRound	pp4096	2246.69 ± 13.07		1825.24 ± 10.60	1823.78 ± 10.60	1825.30 ± 10.60
Intel/Qwen3.5-397B-A17B-int4-AutoRound	tg128	20.87 ± 0.06	21.33 ± 0.47
Intel/Qwen3.5-397B-A17B-int4-AutoRound	pp8192	2348.15 ± 2.43		3490.73 ± 3.49	3489.28 ± 3.49	3490.80 ± 3.49
Intel/Qwen3.5-397B-A17B-int4-AutoRound	tg32	20.87 ± 0.04	21.00 ± 0.00
Intel/Qwen3.5-397B-A17B-int4-AutoRound	pp8192	2110.23 ± 238.95		3937.57 ± 474.11	3936.11 ± 474.11	3937.61 ± 474.11
Intel/Qwen3.5-397B-A17B-int4-AutoRound	tg128	20.81 ± 0.02	21.00 ± 0.00

josephbreda · March 3, 2026, 11:03pm

Don’t tempt me to buy 2 more of these things…

Ran this on a dual setup but it was just a hair to slow and just a bit too tight on memory for my use case. Was really impressed with the model quality though.

adi-sonusflow · March 3, 2026, 11:10pm

I have additional 4 of these things waiting in a box (8x cluster) waiting for cables 🙃

I will be working now to see how much can 4x be pushed as to the performance for larger models. Also 20 t/sec seems not huge number but I have to say that is actually ok to work with. I’ll be running some agents on it for next few days to see how performs.

relc · March 4, 2026, 1:42am

pictures!

adi-sonusflow · March 4, 2026, 1:47am

:)

and soon to be connected

relc · March 4, 2026, 1:55am

thats a rig right there. thank you enjoy!

adi-sonusflow · March 4, 2026, 4:47am

Updated results
Raw data attached: qwen35-397b-tp4-bench.txt

Qwen3.5-397B-A17B INT4 on 4x DB10 — Full Benchmark with Concurrency Scaling

Setup

Hardware: 4x Asus Ascent (GB10, 128GB unified memory each, 512GB total)
Interconnect: MikroTik CRS812 QSFP-DD switch, 100G RoCEv2 fabric (MTU 9000)
Model: Intel/Qwen3.5-397B-A17B-int4-AutoRound (GPTQ INT4, ~199GB)
Runtime: vLLM v0.16.1rc1 (from nvcr.io/nvidia/pytorch:26.01-py3)
Tensor Parallel: TP=4 across all 4 nodes via Ray
KV Cache: fp8, 53.8 GiB per node (215 GiB total)
Context: 32K max, 8192 max batched tokens
Compilation: torch.compile + CUDAGraphs (64s one-time warmup)
Prefix Caching: Enabled
NCCL: v2.29.2, RoCEv2, FlashInfer attention backend
Benchmark tool: llama-benchy v0.3.4

Marlin TP=4 Fix

TP=4 requires a patch for the Marlin kernel — in_proj_ba layers in the linear attention (GDN) blocks have output_size=128, which becomes 32 when split across 4 GPUs, violating Marlin’s MIN_THREAD_N=64. We replace these with ReplicatedLinear (each GPU keeps the full weight) and manually slice the output. Patch available at github.com/sonusflow/spark-vllm-docker under mods/fix-qwen35-tp4-marlin.

Generation Speed — Single User (c1)

Rock-solid 37 tok/s regardless of prompt or generation length. Peak 39 tok/s.

Prompt	tg32 (tok/s)	tg128 (tok/s)	tg512 (tok/s)	Peak tok/s
pp512	35.96	36.07	36.98	39.00
pp1024	35.97	36.72	36.95	38.60
pp2048	35.76	36.76	37.10	38.40
pp4096	37.12	36.88	37.14	38.32
pp8192	35.61	36.36	36.35	38.00
pp16384	37.01	35.86	36.16	38.21

Generation speed does not degrade with longer prompts or longer outputs. The model sustains 36-37 tok/s even at 16K prompt + 512 token generation.

Concurrency Scaling — Total Throughput

Total cluster throughput scales well with concurrent users:

Prompt	c1 total	c2 total	c4 total	c4 peak
tg32	37	63	87-90	117
tg128	37	59-61	74-90	112
tg512	37	56-60	80-94	121

At 4 concurrent users, the cluster delivers up to 94 tok/s total throughput (2.5x single-user), with peak bursts hitting 121 tok/s.

Concurrency Scaling — Per-User Experience

Per-request speed degrades gracefully under load:

Concurrency	tg128 avg (tok/s)	tg512 avg (tok/s)	Relative to c1
c1	36.4	37.0	100%
c2	29.9	29.4	~80%
c4	21.0	21.3	~57%

Even at 4 concurrent users, each gets 21+ tok/s — still faster than GPT-4o streaming.

Prefill Throughput

Prompt processing scales with length up to ~2048 tokens, then plateaus around 2,200-2,500 tok/s:

Prompt Length	c1 (tok/s)	c2 total (tok/s)	c4 total (tok/s)
pp512	1,750	1,670	1,830
pp1024	2,120	2,160	2,085
pp2048	2,350	2,250	2,270
pp4096	2,220	2,190	2,190
pp8192	2,370	2,300	2,120
pp16384	2,190	2,260	2,270

Prefill throughput stays remarkably consistent even at 16K tokens with 4 concurrent users.

Time to First Token (TTFT)

This is where concurrency + long prompts hit hardest:

Prompt	c1	c2	c4
pp512	0.4s	0.6s	0.9s
pp1024	0.6s	0.9s	1.7s
pp2048	1.0s	1.7s	2.8s
pp4096	1.9s	3.3s	6.3s
pp8192	3.6s	6.2s	12.0s
pp16384	7.5s	13.1s	20.5s

Single-user TTFT is excellent — under 1 second for prompts up to 1K tokens, under 4 seconds at 8K. At 4 concurrent users with 16K prompts, TTFT reaches 20 seconds as prefill requests queue up.

Thermal Profile Under Load

All 4 nodes monitored during the full benchmark run (90+ minutes of sustained inference):

Node	GPU Avg	GPU Range	Power Avg	CPU Peak Max	Status
Spark 1 (head)	73°C	73-75°C	34.1W	90°C	OK
Spark 2	72°C	71-76°C	35.0W	95°C	WARM
Spark 3	72°C	69-76°C	33.4W	87°C	OK
Spark 4	68°C	67-69°C	31.0W	89°C	COOL

Total cluster power: ~134W (all 4 GPUs combined)
Spark 2 hit 95°C CPU peak once — brief, near throttle but recovered
Spark 4 consistently coolest — better airflow/positioning
All GPUs stable at 67-76°C — well within safe operating range

Before/After: enforce-eager vs torch.compile (same hardware, same TP=4)

	enforce-eager	torch.compile	Improvement
Generation (tg128, c1)	20.9 tok/s	36.7 tok/s	+76%
Peak throughput (c1)	22.0 tok/s	39.0 tok/s	+77%
Peak throughput (c4)	—	121 tok/s	—
Prefill (pp2048, c1)	2,263 tok/s	2,463 tok/s	+9%
Available KV cache	38.67 GiB/node	53.8 GiB/node	+39%
Startup overhead	None	+64s one-time	Cached after first run

Key Findings

torch.compile is essential on DB10 — 77% generation speedup, 39% more KV cache. The 64-second one-time compile cost pays for itself on the first request.
Single-user performance is remarkably consistent — 37 tok/s at pp512 and pp16384. Prompt length does not affect generation speed.
Concurrency sweet spot is 2 users — 80% of single-user speed per request, nearly double the total throughput. Beyond 2, TTFT at long prompts becomes the bottleneck.
4-user total throughput peaks at 121 tok/s — the cluster handles burst load well, but per-user latency suffers at long contexts (20s TTFT at pp16384/c4).
Power efficiency is exceptional — 134W total for a 397B parameter model serving 37 tok/s. That’s ~3.6W per tok/s.
Thermals are not a concern — 90+ minutes of sustained benchmarking, all GPUs under 76°C, total power under 140W.

TLDR

4x DGX Spark running Qwen3.5-397B-A17B INT4 with torch.compile: 37 tok/s single-user, 94 tok/s at 4 concurrent users, 134W total power. Drop --enforce-eager — the 64-second compile time is worth every second.

qwen35-397b-tp4-bench.txt (21.3 KB)

eugr · March 4, 2026, 7:31pm

can you open a PR upstream to GitHub - eugr/spark-vllm-docker: Docker configuration for running VLLM on dual DGX Sparks · GitHub?

munakatakm · March 4, 2026, 9:22pm

I was happy to find someone with a similar setup, so I’m sharing the results of what I tried in my own environment.
I can’t enable CUDAGraphs in my environment, but is it possible to enable it with sonusflow/spark-vllm-docker?

Hardware: DGXSpark, 3x ThinkStationPGX (GB10, 128GB unified memory each, 512GB total)
Interconnect: MikroTik CRS812 QSFP-DD switch, 100G RoCEv2 fabric (Version 7.21.1, MTU 9000)
Model: Qwen/Qwen3.5-397B-A17B-GPTQ-Int4 (GPTQ INT4, 236GB)
Runtime: vLLM 0.16.0rc2.dev376+gf4af642a6.cu130 (from vllm/vllm-openai:qwen3_5-cu130)
Tensor Parallel: TP=4 across all 4 nodes via Ray
Context: 32K max, 8192 max batched tokens
Compilation: torch.compile (None CUDAGraphs)
Prefix Caching: Enabled
NCCL: v2.28.9, RoCEv2, FlashInfer attention backend
Benchmark tool: llama-benchy v0.3.4

| model                            |   test |              t/s |     peak t/s |         ttfr (ms) |      est_ppt (ms) |     e2e_ttft (ms) |

|:---------------------------------|-------:|-----------------:|-------------:|------------------:|------------------:|------------------:|

| Qwen/Qwen3.5-397B-A17B-GPTQ-Int4 |  pp512 | 1192.88 ± 168.29 |              |   1593.46 ± 84.03 |    441.67 ± 84.03 |   1593.49 ± 84.03 |

| Qwen/Qwen3.5-397B-A17B-GPTQ-Int4 |   tg32 |     10.78 ± 1.73 | 12.20 ± 0.40 |                   |                   |                   |

| Qwen/Qwen3.5-397B-A17B-GPTQ-Int4 |  pp512 | 1545.52 ± 344.10 |              |   1497.26 ± 62.76 |    345.47 ± 62.76 |   1497.29 ± 62.76 |

| Qwen/Qwen3.5-397B-A17B-GPTQ-Int4 |  tg128 |      9.80 ± 1.54 | 12.30 ± 0.46 |                   |                   |                   |

| Qwen/Qwen3.5-397B-A17B-GPTQ-Int4 | pp1024 | 2899.55 ± 194.12 |              |   1506.90 ± 23.94 |    355.10 ± 23.94 |   1506.93 ± 23.94 |

| Qwen/Qwen3.5-397B-A17B-GPTQ-Int4 |   tg32 |     10.74 ± 2.17 | 12.30 ± 0.46 |                   |                   |                   |

| Qwen/Qwen3.5-397B-A17B-GPTQ-Int4 | pp1024 | 2922.06 ± 352.63 |              |   1509.91 ± 60.87 |    358.12 ± 60.87 |   1509.94 ± 60.87 |

| Qwen/Qwen3.5-397B-A17B-GPTQ-Int4 |  tg128 |      9.92 ± 1.49 | 12.30 ± 0.46 |                   |                   |                   |

| Qwen/Qwen3.5-397B-A17B-GPTQ-Int4 | pp4096 | 1142.19 ± 123.79 |              |  4794.34 ± 515.87 |  3642.55 ± 515.87 |  4794.38 ± 515.87 |

| Qwen/Qwen3.5-397B-A17B-GPTQ-Int4 |   tg32 |      9.15 ± 2.08 | 11.50 ± 0.50 |                   |                   |                   |

| Qwen/Qwen3.5-397B-A17B-GPTQ-Int4 | pp4096 |  1176.59 ± 56.95 |              |  4642.33 ± 176.33 |  3490.54 ± 176.33 |  4642.38 ± 176.33 |

| Qwen/Qwen3.5-397B-A17B-GPTQ-Int4 |  tg128 |      9.20 ± 1.33 | 12.10 ± 0.30 |                   |                   |                   |

| Qwen/Qwen3.5-397B-A17B-GPTQ-Int4 | pp8192 |  991.55 ± 108.10 |              | 9532.18 ± 1088.53 | 8380.39 ± 1088.53 | 9532.22 ± 1088.53 |

| Qwen/Qwen3.5-397B-A17B-GPTQ-Int4 |   tg32 |     10.21 ± 1.01 | 11.50 ± 0.50 |                   |                   |                   |

| Qwen/Qwen3.5-397B-A17B-GPTQ-Int4 | pp8192 |  1054.45 ± 35.75 |              |  8931.00 ± 275.43 |  7779.20 ± 275.43 |  8931.04 ± 275.43 |

| Qwen/Qwen3.5-397B-A17B-GPTQ-Int4 |  tg128 |     10.08 ± 0.87 | 12.20 ± 0.40 |                   |                   |                   |

trystan1 · March 5, 2026, 12:46am

If you’re measuring the power usage by just taking the nvidia-smi output, you’re not getting the full picture.

Do you have power monitoring at the wall for all 4 during load? I’d imagine it’s closer to 400-500w

adi-sonusflow · March 5, 2026, 3:41pm

@trystan1 - yes, this was a smi data only. We do have monitoring on them, and will report proper consumption some time in future. Now we are focusing on getting some optimised performance out of them which is not easy :)

We will be connecting 4 more units in coming days.

relc · March 5, 2026, 3:44pm

Nice Work! Do we have a rig pictures thread?

adi-sonusflow · March 5, 2026, 3:45pm

I have sent you DM. Due to a lot of releases recently we need to work out best solution to optimise the performance of 4x or 8x units. Once we are happy with results we will share.

adi-sonusflow · March 5, 2026, 3:46pm

Not sure :) You can set it up - will be nice to see how community make use of those boxes :)

adi-sonusflow · March 9, 2026, 7:29pm

I wanted to share what I’ve learned over the past week running Qwen3.5-397B-A17B (INT4 AutoRound, ~199GB) at TP=4 across 4 DGX Sparks, since some of these findings are pretty specific to the GB10 and might be useful for the community.

What’s working:

37 tok/s single-user decode (peak 39) on Qwen3.5-397B at TP=4 with torch.compile + CUDAGraphs
Marlin INT4 GEMM kernels with a custom TP=4 fix for Qwen3.5’s GDN attention layers (upstream PR filed: vllm-project/vllm#35924)
FlashInfer attention backend on SM121
200GbE RoCE fabric at 96% line rate (23.89 GB/s busbw on 4-node all_reduce)
vLLM v0.16.1rc1 on the eugr/spark-vllm-docker fork with a recipe system we built on top

Critical GB10-specific gotchas we discovered:

Driver 580 ONLY. Driver 590 introduces a UMA memory leak (80-96 GiB not released after CUDA exit) and a CUDAGraph capture deadlock. Both are GB10/UMA-specific. NVIDIA forum reps confirmed 580 is the officially supported driver. The container’s CUDA 13.1 forward-compat layer on host driver 580 works perfectly — no need to match versions.
gpu_memory_utilization is broken on unified memory. It works as a gate (crashes at 0.85 if exceeding profiled free) but NOT as a cap — values below the threshold all produce the same KV cache allocation because vLLM profiles the entire shared CPU/GPU pool. Docker cgroup memory limits also don’t work (CUDA UMA bypasses cgroups). Workaround: --num-gpu-blocks-override to directly control KV cache. This affects all Grace Blackwell platforms, not just Spark.
NCCL auto-negotiate beats manual tuning inside vLLM. We did extensive nccl-tests benchmarking (Simple proto, 6 channels = optimal for CX7 over RoCE), but applying those settings to vLLM caused -8 to -15.7% regression. The NCCL autotuner makes better per-operation decisions when interleaved with compute kernels.
torch.compile + CUDAGraphs = 77% speedup over enforce-eager on MoE. The GB10’s Grace ARM CPU is slower at Python/CUDA dispatch than x86, making the kernel launch overhead elimination even more impactful. But CUDAGraph capture needs swap headroom (~23GB swap configured, swappiness=1).
VLLM_USE_FLASHINFER_MOE_MXFP4_MXFP8 breaks torch.compile on INT4 GPTQ models. It halves the compiled subgraphs. Only use with actual NVFP4/MXFP4 weights.

Topic		Replies	Views
Qwen3.5-397B-A17B run in dual spark! but I have a concern DGX Spark / GB10	47	1402	March 12, 2026
Does Qwen3.5-35B-A3B on GB10 leave a lot of performance on the table? DGX Spark / GB10 agentic-ai	35	2250	March 12, 2026
Qwen3.5-397B-A17B + DGX Spark (duo) DGX Spark / GB10 Projects	48	2732	March 8, 2026
DGX Spark + Qwen3-Next-80B: Proven Performance, But Missing Clear Path to NIM, TensorRT-LLM & Web UIs DGX Spark / GB10 cuda , nim , llama	16	3093	March 6, 2026
Question on Inference Performance Results of Qwen3 235B A22B on 2× DGX Spark DGX Spark / GB10 cuda	5	541	December 19, 2025
PSA: State of FP4/NVFP4 Support for DGX Spark in VLLM DGX Spark / GB10	29	2381	March 10, 2026
Dgx spark benchmark performance DGX Spark / GB10	17	1758	January 4, 2026
DGX Spark performance DGX Spark / GB10	50	3070	February 27, 2026
6x Spark setup DGX Spark / GB10	85	5652	March 12, 2026
Why 200 tok/s is new normal? — TP=2 Does Scale After All DGX Spark / GB10 Projects	10	526	March 4, 2026