Qwen3.5-397B-A17B-int4-AutoRound - 4 x db10 node - updated results 37 - 94 tok/s

Test done on node of 4x DB10 (Ascent)

model test t/s peak t/s ttfr (ms) est_ppt (ms) e2e_ttft (ms)
Intel/Qwen3.5-397B-A17B-int4-AutoRound pp512 1400.85 ± 38.57 367.95 ± 10.29 366.49 ± 10.29 367.99 ± 10.29
Intel/Qwen3.5-397B-A17B-int4-AutoRound tg32 20.95 ± 0.02 21.67 ± 0.47
Intel/Qwen3.5-397B-A17B-int4-AutoRound pp512 1404.28 ± 5.02 366.77 ± 1.30 365.32 ± 1.30 366.82 ± 1.30
Intel/Qwen3.5-397B-A17B-int4-AutoRound tg128 20.92 ± 0.04 22.00 ± 0.00
Intel/Qwen3.5-397B-A17B-int4-AutoRound pp1024 1809.12 ± 101.38 569.88 ± 33.10 568.43 ± 33.10 569.93 ± 33.11
Intel/Qwen3.5-397B-A17B-int4-AutoRound tg32 20.96 ± 0.02 21.67 ± 0.47
Intel/Qwen3.5-397B-A17B-int4-AutoRound pp1024 1898.55 ± 25.32 541.62 ± 7.48 540.16 ± 7.48 541.66 ± 7.48
Intel/Qwen3.5-397B-A17B-int4-AutoRound tg128 20.85 ± 0.21 21.67 ± 0.47
Intel/Qwen3.5-397B-A17B-int4-AutoRound pp2048 2263.06 ± 11.47 906.89 ± 4.59 905.44 ± 4.59 906.93 ± 4.59
Intel/Qwen3.5-397B-A17B-int4-AutoRound tg32 20.90 ± 0.05 21.33 ± 0.47
Intel/Qwen3.5-397B-A17B-int4-AutoRound pp2048 2206.67 ± 36.20 930.40 ± 15.21 928.95 ± 15.21 930.44 ± 15.21
Intel/Qwen3.5-397B-A17B-int4-AutoRound tg128 20.88 ± 0.03 21.00 ± 0.00
Intel/Qwen3.5-397B-A17B-int4-AutoRound pp4096 2222.44 ± 53.09 1845.84 ± 44.90 1844.39 ± 44.90 1845.89 ± 44.90
Intel/Qwen3.5-397B-A17B-int4-AutoRound tg32 20.88 ± 0.02 21.00 ± 0.00
Intel/Qwen3.5-397B-A17B-int4-AutoRound pp4096 2246.69 ± 13.07 1825.24 ± 10.60 1823.78 ± 10.60 1825.30 ± 10.60
Intel/Qwen3.5-397B-A17B-int4-AutoRound tg128 20.87 ± 0.06 21.33 ± 0.47
Intel/Qwen3.5-397B-A17B-int4-AutoRound pp8192 2348.15 ± 2.43 3490.73 ± 3.49 3489.28 ± 3.49 3490.80 ± 3.49
Intel/Qwen3.5-397B-A17B-int4-AutoRound tg32 20.87 ± 0.04 21.00 ± 0.00
Intel/Qwen3.5-397B-A17B-int4-AutoRound pp8192 2110.23 ± 238.95 3937.57 ± 474.11 3936.11 ± 474.11 3937.61 ± 474.11
Intel/Qwen3.5-397B-A17B-int4-AutoRound tg128 20.81 ± 0.02 21.00 ± 0.00

Don’t tempt me to buy 2 more of these things…

Ran this on a dual setup but it was just a hair to slow and just a bit too tight on memory for my use case. Was really impressed with the model quality though.

I have additional 4 of these things waiting in a box (8x cluster) waiting for cables 🙃

I will be working now to see how much can 4x be pushed as to the performance for larger models. Also 20 t/sec seems not huge number but I have to say that is actually ok to work with. I’ll be running some agents on it for next few days to see how performs.

1 Like

pictures!

:)

and soon to be connected

2 Likes

thats a rig right there. thank you enjoy!

Updated results
Raw data attached: qwen35-397b-tp4-bench.txt

Qwen3.5-397B-A17B INT4 on 4x DB10 — Full Benchmark with Concurrency Scaling

Setup

  • Hardware: 4x Asus Ascent (GB10, 128GB unified memory each, 512GB total)
  • Interconnect: MikroTik CRS812 QSFP-DD switch, 100G RoCEv2 fabric (MTU 9000)
  • Model: Intel/Qwen3.5-397B-A17B-int4-AutoRound (GPTQ INT4, ~199GB)
  • Runtime: vLLM v0.16.1rc1 (from nvcr.io/nvidia/pytorch:26.01-py3)
  • Tensor Parallel: TP=4 across all 4 nodes via Ray
  • KV Cache: fp8, 53.8 GiB per node (215 GiB total)
  • Context: 32K max, 8192 max batched tokens
  • Compilation: torch.compile + CUDAGraphs (64s one-time warmup)
  • Prefix Caching: Enabled
  • NCCL: v2.29.2, RoCEv2, FlashInfer attention backend
  • Benchmark tool: llama-benchy v0.3.4

Marlin TP=4 Fix

TP=4 requires a patch for the Marlin kernel — in_proj_ba layers in the linear attention (GDN) blocks have output_size=128, which becomes 32 when split across 4 GPUs, violating Marlin’s MIN_THREAD_N=64. We replace these with ReplicatedLinear (each GPU keeps the full weight) and manually slice the output. Patch available at github.com/sonusflow/spark-vllm-docker under mods/fix-qwen35-tp4-marlin.


Generation Speed — Single User (c1)

Rock-solid 37 tok/s regardless of prompt or generation length. Peak 39 tok/s.

Prompt tg32 (tok/s) tg128 (tok/s) tg512 (tok/s) Peak tok/s
pp512 35.96 36.07 36.98 39.00
pp1024 35.97 36.72 36.95 38.60
pp2048 35.76 36.76 37.10 38.40
pp4096 37.12 36.88 37.14 38.32
pp8192 35.61 36.36 36.35 38.00
pp16384 37.01 35.86 36.16 38.21

Generation speed does not degrade with longer prompts or longer outputs. The model sustains 36-37 tok/s even at 16K prompt + 512 token generation.


Concurrency Scaling — Total Throughput

Total cluster throughput scales well with concurrent users:

Prompt c1 total c2 total c4 total c4 peak
tg32 37 63 87-90 117
tg128 37 59-61 74-90 112
tg512 37 56-60 80-94 121

At 4 concurrent users, the cluster delivers up to 94 tok/s total throughput (2.5x single-user), with peak bursts hitting 121 tok/s.


Concurrency Scaling — Per-User Experience

Per-request speed degrades gracefully under load:

Concurrency tg128 avg (tok/s) tg512 avg (tok/s) Relative to c1
c1 36.4 37.0 100%
c2 29.9 29.4 ~80%
c4 21.0 21.3 ~57%

Even at 4 concurrent users, each gets 21+ tok/s — still faster than GPT-4o streaming.


Prefill Throughput

Prompt processing scales with length up to ~2048 tokens, then plateaus around 2,200-2,500 tok/s:

Prompt Length c1 (tok/s) c2 total (tok/s) c4 total (tok/s)
pp512 1,750 1,670 1,830
pp1024 2,120 2,160 2,085
pp2048 2,350 2,250 2,270
pp4096 2,220 2,190 2,190
pp8192 2,370 2,300 2,120
pp16384 2,190 2,260 2,270

Prefill throughput stays remarkably consistent even at 16K tokens with 4 concurrent users.


Time to First Token (TTFT)

This is where concurrency + long prompts hit hardest:

Prompt c1 c2 c4
pp512 0.4s 0.6s 0.9s
pp1024 0.6s 0.9s 1.7s
pp2048 1.0s 1.7s 2.8s
pp4096 1.9s 3.3s 6.3s
pp8192 3.6s 6.2s 12.0s
pp16384 7.5s 13.1s 20.5s

Single-user TTFT is excellent — under 1 second for prompts up to 1K tokens, under 4 seconds at 8K. At 4 concurrent users with 16K prompts, TTFT reaches 20 seconds as prefill requests queue up.


Thermal Profile Under Load

All 4 nodes monitored during the full benchmark run (90+ minutes of sustained inference):

Node GPU Avg GPU Range Power Avg CPU Peak Max Status
Spark 1 (head) 73°C 73-75°C 34.1W 90°C OK
Spark 2 72°C 71-76°C 35.0W 95°C WARM
Spark 3 72°C 69-76°C 33.4W 87°C OK
Spark 4 68°C 67-69°C 31.0W 89°C COOL
  • Total cluster power: ~134W (all 4 GPUs combined)
  • Spark 2 hit 95°C CPU peak once — brief, near throttle but recovered
  • Spark 4 consistently coolest — better airflow/positioning
  • All GPUs stable at 67-76°C — well within safe operating range

Before/After: enforce-eager vs torch.compile (same hardware, same TP=4)

enforce-eager torch.compile Improvement
Generation (tg128, c1) 20.9 tok/s 36.7 tok/s +76%
Peak throughput (c1) 22.0 tok/s 39.0 tok/s +77%
Peak throughput (c4) 121 tok/s
Prefill (pp2048, c1) 2,263 tok/s 2,463 tok/s +9%
Available KV cache 38.67 GiB/node 53.8 GiB/node +39%
Startup overhead None +64s one-time Cached after first run

Key Findings

  1. torch.compile is essential on DB10 — 77% generation speedup, 39% more KV cache. The 64-second one-time compile cost pays for itself on the first request.

  2. Single-user performance is remarkably consistent — 37 tok/s at pp512 and pp16384. Prompt length does not affect generation speed.

  3. Concurrency sweet spot is 2 users — 80% of single-user speed per request, nearly double the total throughput. Beyond 2, TTFT at long prompts becomes the bottleneck.

  4. 4-user total throughput peaks at 121 tok/s — the cluster handles burst load well, but per-user latency suffers at long contexts (20s TTFT at pp16384/c4).

  5. Power efficiency is exceptional — 134W total for a 397B parameter model serving 37 tok/s. That’s ~3.6W per tok/s.

  6. Thermals are not a concern — 90+ minutes of sustained benchmarking, all GPUs under 76°C, total power under 140W.

TLDR

4x DGX Spark running Qwen3.5-397B-A17B INT4 with torch.compile: 37 tok/s single-user, 94 tok/s at 4 concurrent users, 134W total power. Drop --enforce-eager — the 64-second compile time is worth every second.


Benchmark: llama-benchy v0.3.4 | pp: 512-16384 | tg: 32, 128, 512 | concurrency: 1, 2, 4 | 3 runs per test | prefix caching enabled
Raw data attached: qwen35-397b-tp4-bench

qwen35-397b-tp4-bench.txt (21.3 KB)

1 Like

can you open a PR upstream to GitHub - eugr/spark-vllm-docker: Docker configuration for running VLLM on dual DGX Sparks · GitHub?

I was happy to find someone with a similar setup, so I’m sharing the results of what I tried in my own environment.
I can’t enable CUDAGraphs in my environment, but is it possible to enable it with sonusflow/spark-vllm-docker?

Hardware: DGXSpark, 3x ThinkStationPGX (GB10, 128GB unified memory each, 512GB total)
Interconnect: MikroTik CRS812 QSFP-DD switch, 100G RoCEv2 fabric (Version 7.21.1, MTU 9000)
Model: Qwen/Qwen3.5-397B-A17B-GPTQ-Int4 (GPTQ INT4, 236GB)
Runtime: vLLM 0.16.0rc2.dev376+gf4af642a6.cu130 (from vllm/vllm-openai:qwen3_5-cu130)
Tensor Parallel: TP=4 across all 4 nodes via Ray
Context: 32K max, 8192 max batched tokens
Compilation: torch.compile (None CUDAGraphs)
Prefix Caching: Enabled
NCCL: v2.28.9, RoCEv2, FlashInfer attention backend
Benchmark tool: llama-benchy v0.3.4

| model                            |   test |              t/s |     peak t/s |         ttfr (ms) |      est_ppt (ms) |     e2e_ttft (ms) |

|:---------------------------------|-------:|-----------------:|-------------:|------------------:|------------------:|------------------:|

| Qwen/Qwen3.5-397B-A17B-GPTQ-Int4 |  pp512 | 1192.88 ± 168.29 |              |   1593.46 ± 84.03 |    441.67 ± 84.03 |   1593.49 ± 84.03 |

| Qwen/Qwen3.5-397B-A17B-GPTQ-Int4 |   tg32 |     10.78 ± 1.73 | 12.20 ± 0.40 |                   |                   |                   |

| Qwen/Qwen3.5-397B-A17B-GPTQ-Int4 |  pp512 | 1545.52 ± 344.10 |              |   1497.26 ± 62.76 |    345.47 ± 62.76 |   1497.29 ± 62.76 |

| Qwen/Qwen3.5-397B-A17B-GPTQ-Int4 |  tg128 |      9.80 ± 1.54 | 12.30 ± 0.46 |                   |                   |                   |

| Qwen/Qwen3.5-397B-A17B-GPTQ-Int4 | pp1024 | 2899.55 ± 194.12 |              |   1506.90 ± 23.94 |    355.10 ± 23.94 |   1506.93 ± 23.94 |

| Qwen/Qwen3.5-397B-A17B-GPTQ-Int4 |   tg32 |     10.74 ± 2.17 | 12.30 ± 0.46 |                   |                   |                   |

| Qwen/Qwen3.5-397B-A17B-GPTQ-Int4 | pp1024 | 2922.06 ± 352.63 |              |   1509.91 ± 60.87 |    358.12 ± 60.87 |   1509.94 ± 60.87 |

| Qwen/Qwen3.5-397B-A17B-GPTQ-Int4 |  tg128 |      9.92 ± 1.49 | 12.30 ± 0.46 |                   |                   |                   |

| Qwen/Qwen3.5-397B-A17B-GPTQ-Int4 | pp4096 | 1142.19 ± 123.79 |              |  4794.34 ± 515.87 |  3642.55 ± 515.87 |  4794.38 ± 515.87 |

| Qwen/Qwen3.5-397B-A17B-GPTQ-Int4 |   tg32 |      9.15 ± 2.08 | 11.50 ± 0.50 |                   |                   |                   |

| Qwen/Qwen3.5-397B-A17B-GPTQ-Int4 | pp4096 |  1176.59 ± 56.95 |              |  4642.33 ± 176.33 |  3490.54 ± 176.33 |  4642.38 ± 176.33 |

| Qwen/Qwen3.5-397B-A17B-GPTQ-Int4 |  tg128 |      9.20 ± 1.33 | 12.10 ± 0.30 |                   |                   |                   |

| Qwen/Qwen3.5-397B-A17B-GPTQ-Int4 | pp8192 |  991.55 ± 108.10 |              | 9532.18 ± 1088.53 | 8380.39 ± 1088.53 | 9532.22 ± 1088.53 |

| Qwen/Qwen3.5-397B-A17B-GPTQ-Int4 |   tg32 |     10.21 ± 1.01 | 11.50 ± 0.50 |                   |                   |                   |

| Qwen/Qwen3.5-397B-A17B-GPTQ-Int4 | pp8192 |  1054.45 ± 35.75 |              |  8931.00 ± 275.43 |  7779.20 ± 275.43 |  8931.04 ± 275.43 |

| Qwen/Qwen3.5-397B-A17B-GPTQ-Int4 |  tg128 |     10.08 ± 0.87 | 12.20 ± 0.40 |                   |                   |                   |
1 Like

If you’re measuring the power usage by just taking the nvidia-smi output, you’re not getting the full picture.

Do you have power monitoring at the wall for all 4 during load? I’d imagine it’s closer to 400-500w

@trystan1 - yes, this was a smi data only. We do have monitoring on them, and will report proper consumption some time in future. Now we are focusing on getting some optimised performance out of them which is not easy :)

We will be connecting 4 more units in coming days.

Nice Work! Do we have a rig pictures thread?

I have sent you DM. Due to a lot of releases recently we need to work out best solution to optimise the performance of 4x or 8x units. Once we are happy with results we will share.

Not sure :) You can set it up - will be nice to see how community make use of those boxes :)

1 Like

I wanted to share what I’ve learned over the past week running Qwen3.5-397B-A17B (INT4 AutoRound, ~199GB) at TP=4 across 4 DGX Sparks, since some of these findings are pretty specific to the GB10 and might be useful for the community.

What’s working:

  • 37 tok/s single-user decode (peak 39) on Qwen3.5-397B at TP=4 with torch.compile + CUDAGraphs
  • Marlin INT4 GEMM kernels with a custom TP=4 fix for Qwen3.5’s GDN attention layers (upstream PR filed: vllm-project/vllm#35924)
  • FlashInfer attention backend on SM121
  • 200GbE RoCE fabric at 96% line rate (23.89 GB/s busbw on 4-node all_reduce)
  • vLLM v0.16.1rc1 on the eugr/spark-vllm-docker fork with a recipe system we built on top

Critical GB10-specific gotchas we discovered:

  1. Driver 580 ONLY. Driver 590 introduces a UMA memory leak (80-96 GiB not released after CUDA exit) and a CUDAGraph capture deadlock. Both are GB10/UMA-specific. NVIDIA forum reps confirmed 580 is the officially supported driver. The container’s CUDA 13.1 forward-compat layer on host driver 580 works perfectly — no need to match versions.

  2. gpu_memory_utilization is broken on unified memory. It works as a gate (crashes at 0.85 if exceeding profiled free) but NOT as a cap — values below the threshold all produce the same KV cache allocation because vLLM profiles the entire shared CPU/GPU pool. Docker cgroup memory limits also don’t work (CUDA UMA bypasses cgroups). Workaround: --num-gpu-blocks-override to directly control KV cache. This affects all Grace Blackwell platforms, not just Spark.

  3. NCCL auto-negotiate beats manual tuning inside vLLM. We did extensive nccl-tests benchmarking (Simple proto, 6 channels = optimal for CX7 over RoCE), but applying those settings to vLLM caused -8 to -15.7% regression. The NCCL autotuner makes better per-operation decisions when interleaved with compute kernels.

  4. torch.compile + CUDAGraphs = 77% speedup over enforce-eager on MoE. The GB10’s Grace ARM CPU is slower at Python/CUDA dispatch than x86, making the kernel launch overhead elimination even more impactful. But CUDAGraph capture needs swap headroom (~23GB swap configured, swappiness=1).

  5. VLLM_USE_FLASHINFER_MOE_MXFP4_MXFP8 breaks torch.compile on INT4 GPTQ models. It halves the compiled subgraphs. Only use with actual NVFP4/MXFP4 weights.

5 Likes