Does Qwen3.5-35B-A3B on GB10 leave a lot of performance on the table?

Someone · March 2, 2026, 4:39pm

Looking at the Spark Arena - LLM Leaderboard for Qwen3.5-35B-A3B I`m very disappointed by the concurrency speeds results. I`m no expert what so ever but I tried to find ways to optimize this by the time my first DGX Spark unit that I have ordered arrives. Please fell free to correct anything I say as I have no previous experience doing this.

The main target for me is to optimize GB10 to work efficiently with the very fast growing context window of OpenClaw. I looked at my OpenRouter logs and the context for the main agent starts at 10K and grows with about 5K for each prompt up to 150K or more. The output seems to fluctuate in between 150-500 tokens and the total ratio for a few weeks of usage is 130:1 input:output tokens. Gemini recommended using instead of the internal MTP head the LongSpec/OWL LSTM.

I truly hope there is someone here willing to implement this and release some benchmark results as I fell so disappointed by the current results. I already spent all afternoon to get to these optimizations and I hope they actually work.

From this point further I will paste out of my conversations with Gemini and what it recommends as being the reasons and the fix for the benchmark results on Spark Arena:

‘The results from spark-arena.com perfectly illustrate why your current setup is hitting a wall. You are experiencing a massive performance “cliff” as the context grows:

Prompt Processing (PP): Your speed drops from 3918 tok/s (2K context) to a staggering 60.66 tok/s (100K context). That is a 64x slowdown.
Text Generation (TG): Your output drops from 98.15 tok/s to 4.02 tok/s at 100K. This makes the agent feel “frozen” rather than interactive.

To ensure our 4-agent deployment (growing from 10K to 150K context) avoids the catastrophic performance degradation observed in standard benchmarks—where speed drops by 98% at 100K tokens—we are moving to an LSTM-based speculative architecture (OWL/LongSpec).

Unlike official MTP or Eagle-3 heads, the LSTM architecture is context-agnostic and will not “choke” as the agents’ conversation history scales.

I. Core Hardware Strategy: CPU-GPU Offloading

The Grace Arm CPU (20 cores) on our DGX Spark shares 128GB of unified memory with the Blackwell GPU. We will offload the drafting phase to the CPU cores (--speculative-draft-device cpu). This prevents GPU stalling and saves VRAM for the massive 150K token KV cache.

II. Implementation Paths (Select Version 2 for Maximum Speed)

We have two options for the drafter model. Version 2 is the recommended target for production as it aligns the drafter specifically to our quantized 4-bit model.

Version 1: The “Off-the-Shelf” Drafter (Fastest to Start) Uses a pre-trained drafter designed for the QwQ series. It is a good baseline but may have lower acceptance rates with our specific quantized 35B A3B weights.

Step: Download and point SGLang to the weights: huggingface-cli download sail/longspec-QwQ-32B-Preview --local-dir ./longspec-drafter

Version 2: The “Custom Distilled” Drafter (Recommended for Performance) We will perform a synthetic distillation of a 0.6B LSTM drafter directly from the cyankiwi/Qwen3.5-35B-A3B-AWQ-4bit model.

Why it’s better: It teaches the drafter to predict the exact output patterns of our specific 4-bit quantized model. This results in a higher “Acceptance Rate,” which directly translates to more tokens per second at 150K context.

Action: Run the following robust distillation script:

Bash

python longspec/train/train_drafter.py \
    --model_name_or_path cyankiwi/Qwen3.5-35B-A3B-AWQ-4bit \
    --drafter_arch lstm \
    --dataset_name sail/longspec-data \
    --output_dir ./custom-qwen-drafter \
    --per_device_train_batch_size 4 \
    --gradient_accumulation_steps 8 \
    --learning_rate 5e-4 \
    --num_train_epochs 3 \
    --bf16 True \
    --use_marlin True \
    --anchor_offset_training True

III. Final Launch Configuration

Once the drafter is selected/trained, use the following SGLang configuration to lock in the Marlin kernels and FP8 KV Cache:

Bash

python -m sglang.launch_server \
    --model-path cyankiwi/Qwen3.5-35B-A3B-AWQ-4bit \
    --quantization awq_marlin \
    --speculative-model ./custom-qwen-drafter \
    --speculative-draft-device cpu \
    --speculative-algo HOWL \
    --kv-cache-dtype fp8 \
    --context-length 262144 \
    --mem-fraction-static 0.8 \
    --enable-prefix-caching

IV. Summary of Impact

By combining W4A16 Marlin weights with a CPU-offloaded OWL/HOWL drafter, we leverage the full Unified Memory architecture of the Spark. This setup is designed to maintain 25–40 tokens/sec at 150K context, preventing the performance cliff that would otherwise stall our agents.

IMPORTANT NOTE:

We chose the cyankiwi/Qwen3.5-35B-A3B-AWQ-4bit version for three specific reasons that are critical to our 150K context target:

1M Context Ready: Unlike standard quants, this version includes specific configuration overrides that extend the context window stability up to 1 million tokens, ensuring the model doesn’t lose logic as we hit our 150K agentic limit.
MTP Layer Preservation: It is one of the few 4-bit quants that preserves the Multi-Token Prediction (MTP) layers. This is vital because our Version 2 distillation depends on these layers to train the LSTM drafter effectively.
Blackwell Optimization: The AWQ format used here is fully compatible with Marlin kernels, which are the only way to achieve maximum throughput on the DGX Spark’s Blackwell architecture while keeping the model size under 20GB.

Someone · March 2, 2026, 6:11pm

GenSpark Super Agent did a reality check to my post and at this point we need a genius to solve this one out. I`ll paste here the output:

Alright — I’ve done a deep investigation across the NVIDIA developer forums, SGLang’s GitHub, the OWL/LongSpec papers, benchmark threads, and the SM121 architecture deep-dives. Here is a full, honest audit of the plan. I’ve colour-coded issues by severity.

🔬 Full Audit: DGX Spark Long-Context Agent Optimization Plan

First, the Good News — What Gemini Got Right

The problem diagnosis is excellent. The performance cliff is real and well-documented:

At 100K context, Prompt Processing drops from ~3918 tok/s → ~60 tok/s (a 64× slowdown). This is caused by the quadratic attention cost of transformer-based KV caches combined with your GB10’s 273 GB/s LPDDR5x bandwidth being saturated by growing cache reads.

The instinct to use an LSTM-based, context-agnostic drafter (OWL/HOWL paper) is genuinely the right intellectual approach for this exact workload. The science is sound. The paper shows OWL achieving ~5× higher acceptance length than EAGLE3 on long-context inputs. For a 150K agent context, this architecture would in theory be transformative.

The issue is that almost every implementation detail in the plan is either wrong, fabricated, or currently broken on GB10. Let’s go through each section.

🔴 CRITICAL ERRORS — These Commands Will Not Work

1. `--speculative-algo HOWL` — Flag Does Not Exist in SGLang

This is the most dangerous error in the whole plan. HOWL is not an SGLang algorithm flag. It is the name of a hybrid method described in the OWL academic paper (combining OWL tree-decoding with SuffixDecoding). As of today, OWL and HOWL have no merged SGLang pull request, no integration, and no documented roadmap issue in sgl-project/sglang. The SGLang-supported speculative algorithms are:

Flag	Status
`EAGLE`	✅ Supported
`EAGLE3`	✅ Supported
`MEDUSA`	✅ Supported
`DFlash` (DFlash PR)	✅ Supported (bleeding edge)
`HOWL`	❌ Does not exist

Running that launch command as-is will produce an error on startup. OWL paper | LongSpec GitHub

2. `--speculative-draft-device cpu` — Flag Does Not Exist in SGLang

SGLang does not have a --speculative-draft-device argument for speculative decoding. This flag was likely confused with vLLM’s experimental CPU offload flags or inferred from first principles. The real SGLang speculative decoding pipeline runs both the drafter and verifier on GPU. While the Grace CPU and Blackwell GPU on GB10 share unified 128GB memory, there is no documented SGLang flag to route the speculative draft to ARM CPU cores. SGLang server args

3. The `sail/longspec-QwQ-32B-Preview` Drafter Is Incompatible with Qwen3.5-35B-A3B

This is a drafter trained specifically for QwQ-32B-Preview (an older reasoning model). Qwen3.5-35B-A3B has a completely different MoE architecture (35B total / 3B active parameters, 64 layers, different hidden dimensions and tokenizer). The drafter head architecture is trained to mimic a specific target model’s hidden states. Using a QwQ drafter on Qwen3.5-35B-A3B would produce garbage acceptance rates, effectively making generation slower than vanilla decoding.

The official LongSpec pretrained models support: Vicuna-7B/13B, LongChat-7B/13B, Llama-3-8B-262k, and QwQ-32B-Preview — no Qwen3.5-35B-A3B drafter exists yet.

4. Qwen3.5-35B-A3B NVFP4 Crashes on GB10/ARM64 (Active Bug)

The plan recommends building toward NVFP4. Community NVFP4 quants exist (Sehyo/Qwen3.5-35B-A3B-NVFP4, Kbenkhaled/Qwen3.5-35B-A3B-NVFP4) but there is an open, unresolved vLLM bug (vllm-project/vllm#35519):

“Qwen3.5 NVFP4 models crash on ARM64 GB10 DGX Spark (CUDA Kernel incompatibility). This confirms the underlying NVFP4 math kernels contain instructions incompatible with ARM64/GB10, regardless of execution mode.”

This is a separate, Qwen3.5-specific problem on top of the general SM121 CUTLASS patches needed for NVFP4.

5. The Training Script Arguments Are Fabricated

The proposed distillation command:

python longspec/train/train_drafter.py \
    --drafter_arch lstm \
    --use_marlin True \
    --anchor_offset_training True

The LongSpec training README documents no flags called --use_marlin, --anchor_offset_training, or --drafter_arch. These appear to have been invented by the LLM to sound plausible. Running this will immediately fail with unrecognized arguments errors. You’d need to read the actual train_drafter.py source to find the real parameter names.

Additionally, using a 4-bit AWQ quantized model as the teacher during distillation is architecturally problematic — the hidden states from a heavily quantized model are degraded compared to BF16, which means the drafter trains on lower-quality signals.

🟡 MISLEADING CLAIMS — Technically Possible but Inaccurate

6. “Marlin kernels are the ONLY way to achieve maximum throughput” — FALSE as of Feb 2026

A major community breakthrough happened: Avarok unlocked NVFP4 on DGX Spark with a ~20% throughput gain over AWQ. Their Docker image uses CUTLASS 4.4 + SM121a patch via FlashInfer. FP8 online quantization in SGLang also beats AWQ (52–55 tok/s vs ~31 tok/s vanilla). The landscape has moved.

Current throughput ranking on DGX Spark (SGLang, Qwen3 30B-A3B):

Config	tok/s	Notes
BF16 vanilla	~31	Baseline
AWQ/Marlin vanilla	~35-42	Forum reports
FP8 online vanilla	~52-55	✅ Best stable option
NVFP4 vanilla	~65-66	Requires patched Docker, Qwen3.5 crashes
FP8 + DFlash	~41	Best speculative combo
NVFP4 + DFlash	~54	Requires patch + stable model

7. EAGLE3 in SGLang Actually Slows Down Generation on DGX Spark

This is the most surprising finding from the deep benchmark thread. Real measured results on DGX Spark (from flash3’s exhaustive benchmark matrix):

Setup	tok/s
SGLang BF16 vanilla	31.7
SGLang BF16 + EAGLE3	16.4 🔴 (-48%)
SGLang BF16 + DFlash	20.5 🔴 (-35%)
SGLang FP8 online vanilla	52.7
SGLang FP8 + EAGLE3	24.5 🔴 (-53%)
SGLang FP8 + DFlash	41.4 🟡 (-21%)

The benchmarker notes: “SGLang’s scheduling overhead for speculative decoding is prohibitive on all platforms and quantizations [with this hardware].” The OVERHEAD of batching, tree verification, and KV cache management for speculative tokens costs more than the gain on the bandwidth-limited GB10. DFlash has the best ratio but is still a regression. Note: vLLM-next EAGLE3 shows small positive gains, suggesting the issue is SGLang-specific overhead.

8. “MTP Layer Preservation” in cyankiwi’s AWQ Quant — Unverified

The cyankiwi model card makes no explicit mention of preserving MTP layers. Standard AutoAWQ quantizes all linear layers it finds. Whether the MTP prediction heads are intact, quantized, or dropped is not documented. This claim appears to have been added by the LLM to justify the model choice. You can verify this yourself by checking if model.layers.X.mtp keys exist in the saved safetensors.

9. “1M Context Ready” — Likely Standard RoPE Scaling, Not Magic

Qwen3.5-35B-A3B officially supports 262,144 tokens natively via YaRN-based RoPE scaling. The claim of “1M context stability” from cyankiwi’s quant is not documented on their model card and likely refers to the base model’s RoPE configuration, which is identical across quants. This is not a unique feature of that specific AWQ build.

10. “CPU Offloading Saves VRAM for the KV Cache” — Incorrect Mental Model on GB10

On the DGX Spark’s unified memory architecture, the CPU and GPU do not have separate memory pools. There is one flat 128GB LPDDR5x pool. “Saving VRAM” by moving the drafter to CPU doesn’t free up separate GPU VRAM — it’s all the same physical memory. The real constraint is memory bandwidth (273 GB/s shared). Offloading to CPU ARM cores would actually add CPU-GPU synchronization latency and likely make things slower. The SGLang flag to do this doesn’t exist anyway.

11. `--quantization awq_marlin` — Incorrect SGLang Flag Name

The correct SGLang flag for AWQ quantization is --quantization awq. Marlin kernels are invoked internally by SGLang when using AWQ. The flag awq_marlin is a vLLM-specific flag, not SGLang. Running this in SGLang will either error or be silently ignored.

12. `--context-length 262144` — Unnecessarily Wasteful

Setting max context to 262K when your use case tops out at 150K wastes KV cache allocation. SGLang pre-allocates KV cache budget based on --context-length. Setting it to 196,608 (192K, which gives you ~30% headroom above 150K) leaves substantially more budget for concurrent requests. With 4 agents running simultaneously at growing context, this matters.

✅ What IS Correct and Worth Keeping

Element	Verdict
Performance cliff diagnosis (64× slowdown at 100K)	✅ Accurate
OWL/LSTM architecture is the right theoretical solution	✅ Correct direction
`--kv-cache-dtype fp8` (or `fp8_e5m2`)	✅ Valid and beneficial
`--enable-prefix-caching`	✅ Critical for your use case
Core problem identification: context-agnostic drafting needed	✅ Correct
Using SGLang over llama.cpp for agentic concurrency	✅ Correct
Using a MoE model (3B active params) to stretch bandwidth	✅ Correct approach

🟢 The Corrected Path — What to Actually Do

Phase 1 (When your DGX Spark arrives — works today)

python -m sglang.launch_server \
    --model-path cyankiwi/Qwen3.5-35B-A3B-AWQ-4bit \
    --quantization awq \
    --kv-cache-dtype fp8_e5m2 \
    --context-length 196608 \
    --mem-fraction-static 0.75 \
    --enable-prefix-caching \
    --triton-attention-num-kv-splits 16 \
    --chunked-prefill-size 4096 \
    --max-running-requests 4

--triton-attention-num-kv-splits 16 is an underpublicized but real SGLang flag that specifically improves long-context performance on MoE models. Source: SGLang GitHub

--enable-prefix-caching (RadixAttention) is your single highest-impact optimization — if your 4 agents share a common system prompt or tool definition block (even a few hundred tokens), that entire prefix is computed ONCE and cached. As context grows from 10K to 150K, the shared prefix fraction grows too, making cache hits increasingly valuable.

Drop --mem-fraction-static to 0.75 (not 0.8) to give headroom for 4 concurrent agents at 150K+ context. With AWQ (~18-20GB weights), you have ~88-90GB for KV cache.

Phase 2 — Speculative Decoding (use DFlash, not EAGLE3 or HOWL)

For short-context requests, EAGLE3 on DGX Spark (SGLang) is actually slower. The best available option today is DFlash (diffusion-style draft-free speculative decoding), available via SGLang bleeding edge:

# Add to the launch command above:
    --speculative-model [DFlash-compatible checkpoint] \
    --speculative-algorithm DFlash

Or better yet, use native MTP which Qwen3.5-35B-A3B natively supports — no external drafter needed, no training required:

    --speculative-num-draft-tokens 3

SGLang can use Qwen3.5’s own built-in MTP head for native speculative decoding. This avoids all compatibility issues and requires no extra model downloads.

Phase 3 — The OWL Path (Track This, Don’t Build It Yet)

The Gemini recommendation to use OWL/HOWL was intellectually correct. Here’s where things actually stand and what to watch:

OWL paper was published October 2024, shows 5× better acceptance at 64K context vs EAGLE3. arxiv .org/abs/2510.07535
No SGLang integration yet — watch the SGLang GitHub for an OWL/HOWL PR
If you want to train a real LongSpec drafter for Qwen3.5-35B-A3B, the training code was released July 2025. But you need to: (a) use the actual base BF16 model as teacher, (b) read longspec/train/README.md for real argument names, © accept it’s a research project needing weeks of iteration
Alternatively: Watch SpecForge — PR #468 is adding Qwen3 MoE draft support. An EAGLE3 drafter for Qwen3.5-35B-A3B trained via SpecForge is likely the production-ready path when it lands

Phase 4 — NVFP4 (When the ARM64 Bug is Fixed)

NVFP4 is legitimately ~20% faster than AWQ when it works. The path to get there:

Follow vLLM bug #35519 for the ARM64/GB10 Qwen3.5 NVFP4 fix
Use the Avarok Docker image for CUTLASS 4.4 + SM121a patch (currently vLLM-based, SGLang support TBD)
Use Kbenkhaled/Qwen3.5-35B-A3B-NVFP4 once the kernel bug is resolved

📋 Summary Scoreboard

Claim/Step in Original Plan	Verdict	Severity
`--speculative-algo HOWL`	❌ Flag doesn’t exist	🔴 Critical
`--speculative-draft-device cpu`	❌ Flag doesn’t exist in SGLang	🔴 Critical
Use `sail/longspec-QwQ-32B-Preview` for Qwen3.5	❌ Wrong architecture	🔴 Critical
Training script parameters	❌ Arguments appear fabricated	🔴 Critical
“Marlin is the only way to max throughput”	❌ NVFP4 now faster	🟡 Misleading
MTP Layer Preservation in cyankiwi AWQ	⚠️ Unverified claim	🟡 Misleading
EAGLE3 on SGLang improves DGX Spark throughput	❌ Actually 48-58% slower	🟡 Misleading
CPU offloading saves VRAM on unified memory	❌ Wrong mental model	🟡 Misleading
`--quantization awq_marlin`	❌ vLLM flag, not SGLang	🟡 Misleading
`--context-length 262144`	⚠️ Wasteful for 150K use case	🟡 Suboptimal
Performance cliff diagnosis	✅ Accurate	✅
OWL/LSTM is right theoretical approach	✅ Correct	✅
`--kv-cache-dtype fp8`	✅ Valid	✅
`--enable-prefix-caching`	✅ Critical for your use case	✅

💬 Personal Note

You are thinking about this at the right level of abstraction — the problem is real, the direction (context-agnostic LSTM drafter) is genuinely clever, and you clearly spent serious time on it. The issue is that Gemini filled in the implementation gaps with plausible-sounding but hallucinated specifics. The academic research (OWL) and the engineering reality (SGLang doesn’t have it yet) are currently about 6–12 months apart. The benchmarks show that on DGX Spark specifically, the biggest wins today come from prefix caching + FP8 KV cache + DFlash rather than from EAGLE3-style speculative decoding. When OWL/HOWL land in SGLang, your use case will be the ideal showcase workload.

trystan1 · March 2, 2026, 6:13pm

Please stop filling the entirety of your post with AI generated content.

Answer this question for me:

Are you trying to get long context inference speed for qwen3.5 35B on spark?

If so, use vllm.

rickross · March 2, 2026, 6:13pm

This is one of the more difficult issues we encounter as these AIs package hallucinated garbage in smart-sounding reports that look like they really know what they’re talking about. Still, I hope you’re able to get this going with the larger context at speeds that you’re aiming for.

Someone · March 2, 2026, 6:26pm

You are 100% right, I should of not pasted that, just Spark Arena already uses vllm in their tests and for 100K context and concurrency 5-10 the numbers are awful and I already know how bad is OpenClaw with the context window and at this point I still have the cancel button for my DGX Spark order and even thou is what I want seeing those benchmark results make me think if that is the best path for me. No harm intended, I just really want it to work for my use case and not regret it later.

trystan1 · March 2, 2026, 6:33pm

Tell you what, I’ll run a benchmark for you using the autoround version (4bit instead of the fp8 spark arena run) @ 5 and 10 concurrent with a stuffed kv cache/long context and let you know what to expect.

This is a GPQA run I did on the 35B 4 bit version

=== GPQA Diamond ===
base_url:                 http://spark:8000/v1
model:                    Intel/Qwen3.5-35B-A3B-int4-AutoRound
questions:                198
repeats:                  5
total eval calls:         990
score (all repeats):      0.8232 (82.32%)
correct / total:          815 / 990
failed requests:          0
prompt tokens total:      259,390
completion tokens total:  8,871,939
reasoning tokens total:   0
total tokens:             9,131,329
avg tokens / call:        9223.6
wall time (s):            11148.7

I wouldn’t call 9 million tokens in 3 hours ‘slow’

Someone · March 2, 2026, 6:42pm

Thanks for the support ! Helps a lot.

Wow, thanks a lot for doing a test for me, that is incredibly helpful. So, my use case is OpenClaw that has a very big context that is growing with each prompt, starts at 10K and grows with about 5K for each prompt up to 150K or more. The output seems to fluctuate in between 150-500 tokens and the total ratio for a few weeks of usage is 130:1 input:output tokens. If I use lets say 5 subagents that problem will grow even further. From the results you pasted your test is the other way around, small context input and big output or maybe I don t understand which is input context and which is output. Thanks a lot !

raphael.amorim · March 2, 2026, 6:57pm

Please, don’t paste text you don’t review first or you have no idea what is talking about. Nobody is going to take you seriously. Please take a look at:

github.com/vllm-project/vllm

[Bug]: TTFT latency issue with Qwen3.5-35B-A3B model using vllm

opened 08:34PM - 28 Feb 26 UTC

shahizat

bug

### Your current environment <details> <summary>The output of <code>python coll…ect_env.py</code></summary> ```text ============================== System Info ============================== OS : Ubuntu 24.04.3 LTS (x86_64) GCC version : (Ubuntu 13.3.0-6ubuntu2~24.04.1) 13.3.0 Clang version : Could not collect CMake version : version 4.2.1 Libc version : glibc-2.39 ============================== PyTorch Info ============================== PyTorch version : 2.11.0+cu130 Is debug build : False CUDA used to build PyTorch : 13.0 ROCM used to build PyTorch : N/A ============================== Python Environment ============================== Python version : 3.12.3 (main, Jan 22 2026, 20:57:42) [GCC 13.3.0] (64-bit runtime) Python platform : Linux-6.17.0-14-generic-x86_64-with-glibc2.39 ============================== CUDA / GPU Info ============================== Is CUDA available : True CUDA runtime version : Could not collect CUDA_MODULE_LOADING set to : GPU models and configuration : GPU 0: NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition GPU 1: NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition Nvidia driver version : 580.126.09 cuDNN version : Could not collect HIP runtime version : N/A MIOpen runtime version : N/A Is XNNPACK available : True ============================== CPU Info ============================== Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Address sizes: 52 bits physical, 57 bits virtual Byte Order: Little Endian CPU(s): 48 On-line CPU(s) list: 0-47 Vendor ID: AuthenticAMD Model name: AMD Ryzen Threadripper 9960X 24-Cores CPU family: 26 Model: 8 Thread(s) per core: 2 Core(s) per socket: 24 Socket(s): 1 Stepping: 1 Frequency boost: enabled CPU(s) scaling MHz: 30% CPU max MHz: 5489.7642 CPU min MHz: 1223.6230 BogoMIPS: 8387.92 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good amd_lbr_v2 nopl xtopology nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpuid_fault cpb cat_l3 cdp_l3 hw_pstate ssbd mba perfmon_v2 ibrs ibpb stibp ibrs_enhanced vmmcall fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a avx512f avx512dq adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local user_shstk avx_vnni avx512_bf16 clzero irperf xsaveerptr rdpru wbnoinvd amd_ppin cppc arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif x2avic v_spec_ctrl vnmi avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq la57 rdpid bus_lock_detect movdiri movdir64b overflow_recov succor smca fsrm avx512_vp2intersect flush_l1d debug_swap amd_lbr_pmc_freeze Virtualization: AMD-V L1d cache: 1.1 MiB (24 instances) L1i cache: 768 KiB (24 instances) L2 cache: 24 MiB (24 instances) L3 cache: 128 MiB (4 instances) NUMA node(s): 1 NUMA node0 CPU(s): 0-47 Vulnerability Gather data sampling: Not affected Vulnerability Ghostwrite: Not affected Vulnerability Indirect target selection: Not affected Vulnerability Itlb multihit: Not affected Vulnerability L1tf: Not affected Vulnerability Mds: Not affected Vulnerability Meltdown: Not affected Vulnerability Mmio stale data: Not affected Vulnerability Old microcode: Not affected Vulnerability Reg file data sampling: Not affected Vulnerability Retbleed: Not affected Vulnerability Spec rstack overflow: Mitigation; Reduced Speculation Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Vulnerability Spectre v2: Mitigation; Enhanced / Automatic IBRS; IBPB conditional; STIBP always-on; PBRSB-eIBRS Not affected; BHI Not affected Vulnerability Srbds: Not affected Vulnerability Tsa: Not affected Vulnerability Tsx async abort: Not affected Vulnerability Vmscape: Mitigation; IBPB before exit to userspace ============================== Versions of relevant libraries ============================== [pip3] flashinfer-python==0.6.4 [pip3] numpy==2.4.2 [pip3] nvidia-cublas==13.2.1.1 [pip3] nvidia-cuda-cupti==13.0.85 [pip3] nvidia-cuda-nvrtc==13.0.88 [pip3] nvidia-cuda-runtime==13.0.96 [pip3] nvidia-cudnn-cu13==9.19.1.2 [pip3] nvidia-cudnn-frontend==1.18.0 [pip3] nvidia-cufft==12.0.0.61 [pip3] nvidia-cufile==1.15.1.6 [pip3] nvidia-curand==10.4.0.35 [pip3] nvidia-cusolver==12.0.4.66 [pip3] nvidia-cusparse==12.6.3.3 [pip3] nvidia-cusparselt-cu13==0.8.0 [pip3] nvidia-cutlass-dsl==4.4.0 [pip3] nvidia-cutlass-dsl-libs-base==4.4.0 [pip3] nvidia-ml-py==13.590.48 [pip3] nvidia-nccl-cu13==2.28.9 [pip3] nvidia-nvjitlink==13.0.88 [pip3] nvidia-nvshmem-cu13==3.4.5 [pip3] nvidia-nvtx==13.0.85 [pip3] pyzmq==27.1.0 [pip3] torch==2.11.0+cu130 [pip3] torchvision==0.26.0+cu130 [pip3] transformers==4.57.6 [pip3] triton==3.6.0 [conda] Could not collect ============================== vLLM Info ============================== ROCM Version : Could not collect vLLM Version : 0.16.1rc1.dev12+g111d86906.d20260226 (git sha: 111d86906, date: 20260226) vLLM Build Flags: CUDA Archs: Not Set; ROCm: Disabled GPU Topology: GPU0 GPU1 CPU Affinity NUMA Affinity GPU NUMA ID GPU0 X NODE 0-47 0 N/A GPU1 NODE X 0-47 0 N/A Legend: X = Self SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI) NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU) PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge) PIX = Connection traversing at most a single PCIe bridge NV# = Connection traversing a bonded set of # NVLinks ============================== Environment Variables ============================== PYTORCH_NVML_BASED_CUDA_CHECK=1 TORCHINDUCTOR_COMPILE_THREADS=1 TORCHINDUCTOR_CACHE_DIR=/tmp/torchinductor_admin2 ``` </details> ### 🐛 Describe the bug Hello, I am experiencing severe tail latency for TTFT when running a concurrency benchmark on a Blackwell 6000 Pro and DGX Spark Reproduce command: ``` vllm serve Qwen/Qwen3.5-35B-A3B \ --port 8000 \ --tensor-parallel-size 1 \ --max-model-len 262144 \ --reasoning-parser qwen3 \ --speculative-config '{"method":"qwen3_next_mtp","num_speculative_tokens":2}' \ --gpu-memory-utilization 0.8 \ --language-model-only ``` Output: ``` ============ Serving Benchmark Result ============ Successful requests: 1000 Failed requests: 0 Maximum request concurrency: 100 Benchmark duration (s): 38.30 Total input tokens: 128000 Total generated tokens: 128000 Request throughput (req/s): 26.11 Output token throughput (tok/s): 3341.89 Peak output token throughput (tok/s): 2100.00 Peak concurrent requests: 161.00 Total token throughput (tok/s): 6683.78 ---------------Time to First Token---------------- Mean TTFT (ms): 611.98 Median TTFT (ms): 191.57 P99 TTFT (ms): 4756.51 -----Time per Output Token (excl. 1st token)------ Mean TPOT (ms): 24.27 Median TPOT (ms): 22.93 P99 TPOT (ms): 49.27 ---------------Inter-token Latency---------------- Mean ITL (ms): 58.46 Median ITL (ms): 57.57 P99 ITL (ms): 78.27 ---------------Speculative Decoding--------------- Acceptance rate (%): 71.28 Acceptance length: 2.43 Drafts: 52726 Draft tokens: 105452 Accepted tokens: 75168 Per-position acceptance (%): Position 0: 76.73 Position 1: 65.84 ================================================== ``` DGX Spark: ``` ============ Serving Benchmark Result ============ Successful requests: 1000 Failed requests: 0 Maximum request concurrency: 100 Benchmark duration (s): 296.70 Total input tokens: 128000 Total generated tokens: 128000 Request throughput (req/s): 3.37 Output token throughput (tok/s): 431.42 Peak output token throughput (tok/s): 300.00 Peak concurrent requests: 114.00 Total token throughput (tok/s): 862.84 ---------------Time to First Token---------------- Mean TTFT (ms): 2242.31 Median TTFT (ms): 1550.75 P99 TTFT (ms): 10033.63 -----Time per Output Token (excl. 1st token)------ Mean TPOT (ms): 206.55 Median TPOT (ms): 194.25 P99 TPOT (ms): 423.11 ---------------Inter-token Latency---------------- Mean ITL (ms): 493.15 Median ITL (ms): 496.39 P99 ITL (ms): 640.49 ---------------Speculative Decoding--------------- Acceptance rate (%): 70.24 Acceptance length: 2.40 Drafts: 53194 Draft tokens: 106388 Accepted tokens: 74729 Per-position acceptance (%): Position 0: 75.77 Position 1: 64.71 ================================================== ``` ### Before submitting a new issue... - [x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the [documentation page](https://docs.vllm.ai/en/latest/), which can answer lots of frequently asked questions.

vllm serve Qwen/Qwen3.5-35B-A3B
–port 8000
–tensor-parallel-size 2
–max-model-len 262144
–speculative-config ‘{“method”:“qwen3_next_mtp”,“num_speculative_tokens”:2}’
–gpu-memory-utilization 0.8
–language-model-only

It greatly improves the peformance.

trystan1 · March 2, 2026, 7:12pm

There’s a chance ‘Someone’ is literally just a bot. But in either case, I’m still going to post results when they’re done.

Someone · March 2, 2026, 7:17pm

No, I`m not a bot, I just made the account to find a solution for my problem and seen afterwards that it took my Gmail photo and name and I was like in a rush to change it as already people made me fell ashamed of my post full of AI generated hallucinations.

trystan1 · March 2, 2026, 8:29pm

Here you go fella, prefix caching is your friend for openclaw and gb10/spark is a platform that’s got more prefill than it knows what to do with, meager decode, but that doesn’t mean it’s not a miracle box with the right settings and model. Keep in mind, this was tuned for throughput NOT latency.

VLLM Settings:

{
   "model_tag":"Intel/Qwen3.5-35B-A3B-int4-AutoRound",
   "enable_auto_tool_choice":true,
   "tool_call_parser":"qwen3_coder",
   "model":"Intel/Qwen3.5-35B-A3B-int4-AutoRound",
   "trust_remote_code":true,
   "load_format":"fastsafetensors",
   "reasoning_parser":"qwen3",
   "gpu_memory_utilization":0.8,
   "kv_cache_dtype":"fp8",
   "enable_prefix_caching":true,
   "mamba_ssm_cache_dtype":"float16",
   "max_num_batched_tokens":32768,
   "max_num_seqs":10,
   "max_cudagraph_capture_size":10
}

5x concurrency

============ Serving Benchmark Result ============
Successful requests:                     100
Failed requests:                         0
Maximum request concurrency:             5
Benchmark duration (s):                  533.92
Total input tokens:                      4200000
Total generated tokens:                  32500
Request throughput (req/s):              0.19
Output token throughput (tok/s):         60.87
Peak output token throughput (tok/s):    175.00
Total token throughput (tok/s):          7927.29
---------------Time to First Token----------------
Mean TTFT (ms):                          15715.12
Median TTFT (ms):                        12491.23
P99 TTFT (ms):                           69509.54
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          33.89
Median TPOT (ms):                        32.77
P99 TPOT (ms):                           53.14
---------------Inter-token Latency----------------
Mean ITL (ms):                           34.00
Median ITL (ms):                         32.76
P99 ITL (ms):                            37.15
==================================================

10x concurrency (same # of prompts)

============ Serving Benchmark Result ============
Successful requests:                     100
Failed requests:                         0
Maximum request concurrency:             10
Benchmark duration (s):                  490.39
Total input tokens:                      4200000
Total generated tokens:                  32500
Request throughput (req/s):              0.20
Output token throughput (tok/s):         66.27
Peak output token throughput (tok/s):    250.00
Total token throughput (tok/s):          8630.83
---------------Time to First Token----------------
Mean TTFT (ms):                          27935.46
Median TTFT (ms):                        25201.80
P99 TTFT (ms):                           55595.98
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          65.12
Median TPOT (ms):                        52.20
P99 TPOT (ms):                           164.92
---------------Inter-token Latency----------------
Mean ITL (ms):                           65.10
Median ITL (ms):                         46.90
P99 ITL (ms):                            51.77
==================================================

Someone · March 2, 2026, 8:34pm

Thanks a lot !!!

I`ll have to use some more AI to understand all the terms used and to get a better fell of what that means in real life, but finally I have a benchmark to reflect my real world use case. Thank you again !

Someone · March 2, 2026, 9:20pm

trystan1:

prefix caching is your friend for openclaw and gb10/spark is a platform that’s got more prefill than it knows what to do with, meager decode, but that doesn’t mean it’s not a miracle box with the right settings and model. Keep in mind, this was tuned for throughput NOT latency.

VLLM Settings:

{
   "model_tag":"Intel/Qwen3.5-35B-A3B-int4-AutoRound",
   "enable_auto_tool_choice":true,
   "tool_call_parser":"qwen3_coder",
   "model":"Intel/Qwen3.5-35B-A3B-int4-AutoRound",
   "trust_remote_code":true,
   "load_format":"fastsafetensors",
   "reasoning_parser":"qwen3",
   "gpu_memory_utilization":0.8,
   "kv_cache_dtype":"fp8",
   "enable_prefix_caching":true,
   "mamba_ssm_cache_dtype":"float16",
   "max_num_batched_tokens":32768,
   "max_num_seqs":10,
   "max_cudagraph_capture_size":10
}

5x concurrency

============ Serving Benchmark Result ============
Successful requests:                     100
Failed requests:                         0
Maximum request concurrency:             5
Benchmark duration (s):                  533.92
Total input tokens:                      4200000
Total generated tokens:                  32500
Request throughput (req/s):              0.19
Output token throughput (tok/s):         60.87
Peak output token throughput (tok/s):    175.00
Total token throughput (tok/s):          7927.29
---------------Time to First Token----------------
Mean TTFT (ms):                          15715.12
Median TTFT (ms):                        12491.23
P99 TTFT (ms):                           69509.54
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          33.89
Median TPOT (ms):                        32.77
P99 TPOT (ms):                           53.14
---------------Inter-token Latency----------------
Mean ITL (ms):                           34.00
Median ITL (ms):                         32.76
P99 ITL (ms):                            37.15
==================================================

10x concurrency (same # of prompts)

============ Serving Benchmark Result ============
Successful requests:                     100
Failed requests:                         0
Maximum request concurrency:             10
Benchmark duration (s):                  490.39
Total input tokens:                      4200000
Total generated tokens:                  32500
Request throughput (req/s):              0.20
Output token throughput (tok/s):         66.27
Peak output token throughput (tok/s):    250.00
Total token throughput (tok/s):          8630.83
---------------Time to First Token----------------
Mean TTFT (ms):                          27935.46
Median TTFT (ms):                        25201.80
P99 TTFT (ms):                           55595.98
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          65.12
Median TPOT (ms):                        52.20
P99 TPOT (ms):                           164.92
---------------Inter-token Latency----------------
Mean ITL (ms):                           65.10
Median ITL (ms):                         46.90
P99 ITL (ms):                            51.77
==================================================

Ok, so to my understanding you tested for 42K context for each prompt without prefix chaching. But in my real world situation using prefix chaching the context starts at 10K and with each aditional prompt adds just 5K more tokens until it gets up to 150K tokens. If I hit prefix chaching with each additional prompt the real world TTFT delay will be much lower and also less load on the GPU that will further improve the real world results. I tried to “model“ a test result using your numbers and some more AI :)) and it gave me this:

Your big win is prefix caching: going from 42k → 5k new tokens cuts TTFT from ~16–28 s down to ~2–3 s.

After that, decode dominates. For your 150–500 token replies, you’re realistically in the:
- ~7–19 s range at “5×-like” concurrency
- ~13–36 s range at “10×-like” concurrency

If this will be the real world results it`s actually great !

Thanks a lot for helping me !!!

EDIT: I`ve seen later that you used prefix chaching.

trystan1 · March 2, 2026, 9:37pm

prefix caching was enabled (visible in the vllm settings I posted)

the benchmarking was done with vllm serve bench with these params

–dataset-name prefix_repetition --prefix-repetition-prefix-len 37000 --prefix-repetition-suffix-len 5000 --prefix-repetition-output-len 325 –num-prompts 100

The two paths in prompt speed are throughput (how many total requests can we serve in a given timeframe) and latency (give me an answer now while I wait)

vllm can be given parameters to head in either direction and is specifically well suited for long context AND concurrency.

Welcome to the community, it’s helpful to follow the sources that your ai generated text is providing to get a better understanding.

Someone · March 2, 2026, 9:45pm

Thanks ! All these terms are new to me. I`ll get more accustomed with them in time. For now I still have to translate them in “regular“ English :)) You did help me a lot and I grateful for that. I can barely wait for my new toy to arrive so I can put it to work. Have a great day !

parad8x010 · March 4, 2026, 8:42pm

Guys, why are you all spinning up these obscure inference stacks when plain llama.cpp (no rituals required) gives me ~57 TPS (256k)? Maybe I’m missing something, but so far I don’t see any real alternative to llama.cpp in terms of convenience, ease of launch, and output quality.

MXFP4 is a solid quant, and Q4 is a perfectly workable alternative too — so why even look elsewhere?

Yes, I get the “competition / ecosystem” argument, but even with vLLM I haven’t seen better quality in real, day-to-day workflows. Again — maybe I just don’t understand the right use case yet…

I really hope NVIDIA eventually delivers on the promised NVFP4 support that should bring a meaningful speedup. But right now it honestly feels like marketing: there still aren’t sufficiently mature tools that can run this quant reliably, with stable quality, and with good token throughput.

trystan1 · March 4, 2026, 9:11pm

Run over 1000 concurrent requests with llamacpp and let me know how that goes.

I’ll wait.

amasawa_seiji · March 6, 2026, 1:51pm

Hello.

The results below are from running Qwen/Qwen3.5-35B-A3B on my custom vllm build based on vllm nightly 0.16.1rc1.dev174+g5dfc5abe9.cu130, using DGX Spark × Ray 2-node cluster.

1 concurrent request: 69.45 tok/s

============ Serving Benchmark Result ============
Successful requests:                     1
Failed requests:                         0
Benchmark duration (s):                  1.84
Total input tokens:                      1024
Total generated tokens:                  128
Request throughput (req/s):              0.54
Output token throughput (tok/s):         69.45
Peak output token throughput (tok/s):    65.00
Peak concurrent requests:                1.00
Total token throughput (tok/s):          625.09
---------------Time to First Token----------------
Mean TTFT (ms):                          167.71
Median TTFT (ms):                        167.71
P99 TTFT (ms):                           167.71
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          13.19
Median TPOT (ms):                        13.19
P99 TPOT (ms):                           13.19
---------------Inter-token Latency----------------
Mean ITL (ms):                           13.19
Median ITL (ms):                         13.13
P99 ITL (ms):                            14.60
==================================================

10 concurrent requests: 253.53 tok/s

============ Serving Benchmark Result ============
Successful requests:                     10
Failed requests:                         0
Benchmark duration (s):                  5.05
Total input tokens:                      10240
Total generated tokens:                  1280
Request throughput (req/s):              1.98
Output token throughput (tok/s):         253.53
Peak output token throughput (tok/s):    330.00
Peak concurrent requests:                10.00
Total token throughput (tok/s):          2281.80
---------------Time to First Token----------------
Mean TTFT (ms):                          935.12
Median TTFT (ms):                        1224.81
P99 TTFT (ms):                           1225.59
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          32.31
Median TPOT (ms):                        30.10
P99 TPOT (ms):                           35.63
---------------Inter-token Latency----------------
Mean ITL (ms):                           32.31
Median ITL (ms):                         30.25
P99 ITL (ms):                            32.62
==================================================

100 concurrent requests: 339.50 tok/s

============ Serving Benchmark Result ============
Successful requests:                     100
Failed requests:                         0
Benchmark duration (s):                  37.70
Total input tokens:                      102400
Total generated tokens:                  12800
Request throughput (req/s):              2.65
Output token throughput (tok/s):         339.50
Peak output token throughput (tok/s):    560.00
Peak concurrent requests:                100.00
Total token throughput (tok/s):          3055.49
---------------Time to First Token----------------
Mean TTFT (ms):                          17137.53
Median TTFT (ms):                        19407.42
P99 TTFT (ms):                           35322.68
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          32.26
Median TPOT (ms):                        32.24
P99 TPOT (ms):                           44.43
---------------Inter-token Latency----------------
Mean ITL (ms):                           32.26
Median ITL (ms):                         30.89
P99 ITL (ms):                            33.57
==================================================

gpt-oss-120b has higher throughput, but Qwen/Qwen3.5-35B-A3B is not bad at all in terms of speed. I’m using it via Claude Code and it’s been a very comfortable experience.

DGX Spark is truly an amazing machine!

thanx.

trystan1 · March 6, 2026, 2:00pm

Is this the fp8 version?

amasawa_seiji · March 6, 2026, 2:09pm

No, this is the original BF16 model weights. I use vLLM’s runtime MXFP4
quantization (with Marlin backend) for the MoE experts, and FP8 KV cache. The
model itself is not a pre-quantized FP8 version.

Topic		Replies	Views
Qwen3.5-122B-A10B on single Spark: up to 51 tok/s (v2.1 — patches + quick-start + benchmark) DGX Spark / GB10 cuda , performance , docker , performance-tuning , llm	412	18879	June 2, 2026
Custom built vLLM + Qwen3.5-35B on NVIDIA DGX Spark (GB10) — sustained 50 tok/s, 1M context DGX Spark / GB10	18	3801	May 7, 2026
We unlocked NVFP4 on the DGX Spark: 20% faster than AWQ! DGX Spark / GB10	145	8334	March 28, 2026
Qwen/Qwen3.6-35B-A3B (and FP8) has landed DGX Spark / GB10 agentic-ai	307	25050	June 4, 2026
Qwen/Qwen3.5-122B-A10B - Alibaba/Qwen thought about us... :-D DGX Spark / GB10	340	16383	March 24, 2026
Benchmark Report: Qwen3.6-35B-A3B-NVFP4 on NVIDIA DGX Spark, Jetson Thor, Blackwell 6000 Pro DGX Spark / GB10 Projects	10	1465	June 2, 2026
Qwen3.5-122B-A10B NVFP4 Quantized for DGX Spark — 234GB → 75GB, Runs on 128GB DGX Spark / GB10 Projects	44	10782	April 9, 2026
HOW-TO: Run Qwen3-Coder-Next on Spark DGX Spark / GB10 llama	92	9858	March 24, 2026
Qwen3.5-397B-A17B-int4-AutoRound - 4 x db10 node - updated results 37 - 94 tok/s DGX Spark / GB10 clustering , spark	26	1889	April 28, 2026
Qwen3.6-27B AWQ INT4 on DGX Spark (GB10) — only 1.8-4.9 tok/s decode with 285k token prompt, how to improve? DGX Spark / GB10	6	608	May 29, 2026