Does Qwen3.5-35B-A3B on GB10 leave a lot of performance on the table?

Looking at the Spark Arena - LLM Leaderboard for Qwen3.5-35B-A3B I`m very disappointed by the concurrency speeds results. I`m no expert what so ever but I tried to find ways to optimize this by the time my first DGX Spark unit that I have ordered arrives. Please fell free to correct anything I say as I have no previous experience doing this.

The main target for me is to optimize GB10 to work efficiently with the very fast growing context window of OpenClaw. I looked at my OpenRouter logs and the context for the main agent starts at 10K and grows with about 5K for each prompt up to 150K or more. The output seems to fluctuate in between 150-500 tokens and the total ratio for a few weeks of usage is 130:1 input:output tokens. Gemini recommended using instead of the internal MTP head the LongSpec/OWL LSTM.

I truly hope there is someone here willing to implement this and release some benchmark results as I fell so disappointed by the current results. I already spent all afternoon to get to these optimizations and I hope they actually work.

From this point further I will paste out of my conversations with Gemini and what it recommends as being the reasons and the fix for the benchmark results on Spark Arena:


‘The results from spark-arena.com perfectly illustrate why your current setup is hitting a wall. You are experiencing a massive performance “cliff” as the context grows:

  • Prompt Processing (PP): Your speed drops from 3918 tok/s (2K context) to a staggering 60.66 tok/s (100K context). That is a 64x slowdown.

  • Text Generation (TG): Your output drops from 98.15 tok/s to 4.02 tok/s at 100K. This makes the agent feel “frozen” rather than interactive.


To ensure our 4-agent deployment (growing from 10K to 150K context) avoids the catastrophic performance degradation observed in standard benchmarks—where speed drops by 98% at 100K tokens—we are moving to an LSTM-based speculative architecture (OWL/LongSpec).

Unlike official MTP or Eagle-3 heads, the LSTM architecture is context-agnostic and will not “choke” as the agents’ conversation history scales.

I. Core Hardware Strategy: CPU-GPU Offloading

The Grace Arm CPU (20 cores) on our DGX Spark shares 128GB of unified memory with the Blackwell GPU. We will offload the drafting phase to the CPU cores (--speculative-draft-device cpu). This prevents GPU stalling and saves VRAM for the massive 150K token KV cache.

II. Implementation Paths (Select Version 2 for Maximum Speed)

We have two options for the drafter model. Version 2 is the recommended target for production as it aligns the drafter specifically to our quantized 4-bit model.

Version 1: The “Off-the-Shelf” Drafter (Fastest to Start) Uses a pre-trained drafter designed for the QwQ series. It is a good baseline but may have lower acceptance rates with our specific quantized 35B A3B weights.

  • Step: Download and point SGLang to the weights: huggingface-cli download sail/longspec-QwQ-32B-Preview --local-dir ./longspec-drafter

Version 2: The “Custom Distilled” Drafter (Recommended for Performance) We will perform a synthetic distillation of a 0.6B LSTM drafter directly from the cyankiwi/Qwen3.5-35B-A3B-AWQ-4bit model.

  • Why it’s better: It teaches the drafter to predict the exact output patterns of our specific 4-bit quantized model. This results in a higher “Acceptance Rate,” which directly translates to more tokens per second at 150K context.

  • Action: Run the following robust distillation script:

    • Bash

      python longspec/train/train_drafter.py \
          --model_name_or_path cyankiwi/Qwen3.5-35B-A3B-AWQ-4bit \
          --drafter_arch lstm \
          --dataset_name sail/longspec-data \
          --output_dir ./custom-qwen-drafter \
          --per_device_train_batch_size 4 \
          --gradient_accumulation_steps 8 \
          --learning_rate 5e-4 \
          --num_train_epochs 3 \
          --bf16 True \
          --use_marlin True \
          --anchor_offset_training True
      
      

    III. Final Launch Configuration

Once the drafter is selected/trained, use the following SGLang configuration to lock in the Marlin kernels and FP8 KV Cache:

Bash

python -m sglang.launch_server \
    --model-path cyankiwi/Qwen3.5-35B-A3B-AWQ-4bit \
    --quantization awq_marlin \
    --speculative-model ./custom-qwen-drafter \
    --speculative-draft-device cpu \
    --speculative-algo HOWL \
    --kv-cache-dtype fp8 \
    --context-length 262144 \
    --mem-fraction-static 0.8 \
    --enable-prefix-caching

IV. Summary of Impact

By combining W4A16 Marlin weights with a CPU-offloaded OWL/HOWL drafter, we leverage the full Unified Memory architecture of the Spark. This setup is designed to maintain 25–40 tokens/sec at 150K context, preventing the performance cliff that would otherwise stall our agents.


IMPORTANT NOTE:

We chose the cyankiwi/Qwen3.5-35B-A3B-AWQ-4bit version for three specific reasons that are critical to our 150K context target:

  • 1M Context Ready: Unlike standard quants, this version includes specific configuration overrides that extend the context window stability up to 1 million tokens, ensuring the model doesn’t lose logic as we hit our 150K agentic limit.

  • MTP Layer Preservation: It is one of the few 4-bit quants that preserves the Multi-Token Prediction (MTP) layers. This is vital because our Version 2 distillation depends on these layers to train the LSTM drafter effectively.

  • Blackwell Optimization: The AWQ format used here is fully compatible with Marlin kernels, which are the only way to achieve maximum throughput on the DGX Spark’s Blackwell architecture while keeping the model size under 20GB.

2 Likes

GenSpark Super Agent did a reality check to my post and at this point we need a genius to solve this one out. I`ll paste here the output:

Alright — I’ve done a deep investigation across the NVIDIA developer forums, SGLang’s GitHub, the OWL/LongSpec papers, benchmark threads, and the SM121 architecture deep-dives. Here is a full, honest audit of the plan. I’ve colour-coded issues by severity.


🔬 Full Audit: DGX Spark Long-Context Agent Optimization Plan

First, the Good News — What Gemini Got Right

The problem diagnosis is excellent. The performance cliff is real and well-documented:

At 100K context, Prompt Processing drops from ~3918 tok/s → ~60 tok/s (a 64× slowdown). This is caused by the quadratic attention cost of transformer-based KV caches combined with your GB10’s 273 GB/s LPDDR5x bandwidth being saturated by growing cache reads.

The instinct to use an LSTM-based, context-agnostic drafter (OWL/HOWL paper) is genuinely the right intellectual approach for this exact workload. The science is sound. The paper shows OWL achieving ~5× higher acceptance length than EAGLE3 on long-context inputs. For a 150K agent context, this architecture would in theory be transformative.

The issue is that almost every implementation detail in the plan is either wrong, fabricated, or currently broken on GB10. Let’s go through each section.


🔴 CRITICAL ERRORS — These Commands Will Not Work

1. --speculative-algo HOWLFlag Does Not Exist in SGLang

This is the most dangerous error in the whole plan. HOWL is not an SGLang algorithm flag. It is the name of a hybrid method described in the OWL academic paper (combining OWL tree-decoding with SuffixDecoding). As of today, OWL and HOWL have no merged SGLang pull request, no integration, and no documented roadmap issue in sgl-project/sglang. The SGLang-supported speculative algorithms are:

Flag Status
EAGLE ✅ Supported
EAGLE3 ✅ Supported
MEDUSA ✅ Supported
DFlash (DFlash PR) ✅ Supported (bleeding edge)
HOWL Does not exist

Running that launch command as-is will produce an error on startup. OWL paper | LongSpec GitHub


2. --speculative-draft-device cpuFlag Does Not Exist in SGLang

SGLang does not have a --speculative-draft-device argument for speculative decoding. This flag was likely confused with vLLM’s experimental CPU offload flags or inferred from first principles. The real SGLang speculative decoding pipeline runs both the drafter and verifier on GPU. While the Grace CPU and Blackwell GPU on GB10 share unified 128GB memory, there is no documented SGLang flag to route the speculative draft to ARM CPU cores. SGLang server args


3. The sail/longspec-QwQ-32B-Preview Drafter Is Incompatible with Qwen3.5-35B-A3B

This is a drafter trained specifically for QwQ-32B-Preview (an older reasoning model). Qwen3.5-35B-A3B has a completely different MoE architecture (35B total / 3B active parameters, 64 layers, different hidden dimensions and tokenizer). The drafter head architecture is trained to mimic a specific target model’s hidden states. Using a QwQ drafter on Qwen3.5-35B-A3B would produce garbage acceptance rates, effectively making generation slower than vanilla decoding.

The official LongSpec pretrained models support: Vicuna-7B/13B, LongChat-7B/13B, Llama-3-8B-262k, and QwQ-32B-Preview — no Qwen3.5-35B-A3B drafter exists yet.


4. Qwen3.5-35B-A3B NVFP4 Crashes on GB10/ARM64 (Active Bug)

The plan recommends building toward NVFP4. Community NVFP4 quants exist (Sehyo/Qwen3.5-35B-A3B-NVFP4, Kbenkhaled/Qwen3.5-35B-A3B-NVFP4) but there is an open, unresolved vLLM bug (vllm-project/vllm#35519):

“Qwen3.5 NVFP4 models crash on ARM64 GB10 DGX Spark (CUDA Kernel incompatibility). This confirms the underlying NVFP4 math kernels contain instructions incompatible with ARM64/GB10, regardless of execution mode.”

This is a separate, Qwen3.5-specific problem on top of the general SM121 CUTLASS patches needed for NVFP4.


5. The Training Script Arguments Are Fabricated

The proposed distillation command:

python longspec/train/train_drafter.py \
    --drafter_arch lstm \
    --use_marlin True \
    --anchor_offset_training True

The LongSpec training README documents no flags called --use_marlin, --anchor_offset_training, or --drafter_arch. These appear to have been invented by the LLM to sound plausible. Running this will immediately fail with unrecognized arguments errors. You’d need to read the actual train_drafter.py source to find the real parameter names.

Additionally, using a 4-bit AWQ quantized model as the teacher during distillation is architecturally problematic — the hidden states from a heavily quantized model are degraded compared to BF16, which means the drafter trains on lower-quality signals.


🟡 MISLEADING CLAIMS — Technically Possible but Inaccurate

6. “Marlin kernels are the ONLY way to achieve maximum throughput” — FALSE as of Feb 2026

A major community breakthrough happened: Avarok unlocked NVFP4 on DGX Spark with a ~20% throughput gain over AWQ. Their Docker image uses CUTLASS 4.4 + SM121a patch via FlashInfer. FP8 online quantization in SGLang also beats AWQ (52–55 tok/s vs ~31 tok/s vanilla). The landscape has moved.

Current throughput ranking on DGX Spark (SGLang, Qwen3 30B-A3B):

Config tok/s Notes
BF16 vanilla ~31 Baseline
AWQ/Marlin vanilla ~35-42 Forum reports
FP8 online vanilla ~52-55 ✅ Best stable option
NVFP4 vanilla ~65-66 Requires patched Docker, Qwen3.5 crashes
FP8 + DFlash ~41 Best speculative combo
NVFP4 + DFlash ~54 Requires patch + stable model

7. EAGLE3 in SGLang Actually Slows Down Generation on DGX Spark

This is the most surprising finding from the deep benchmark thread. Real measured results on DGX Spark (from flash3’s exhaustive benchmark matrix):

Setup tok/s
SGLang BF16 vanilla 31.7
SGLang BF16 + EAGLE3 16.4 🔴 (-48%)
SGLang BF16 + DFlash 20.5 🔴 (-35%)
SGLang FP8 online vanilla 52.7
SGLang FP8 + EAGLE3 24.5 🔴 (-53%)
SGLang FP8 + DFlash 41.4 🟡 (-21%)

The benchmarker notes: “SGLang’s scheduling overhead for speculative decoding is prohibitive on all platforms and quantizations [with this hardware].” The OVERHEAD of batching, tree verification, and KV cache management for speculative tokens costs more than the gain on the bandwidth-limited GB10. DFlash has the best ratio but is still a regression. Note: vLLM-next EAGLE3 shows small positive gains, suggesting the issue is SGLang-specific overhead.


8. “MTP Layer Preservation” in cyankiwi’s AWQ Quant — Unverified

The cyankiwi model card makes no explicit mention of preserving MTP layers. Standard AutoAWQ quantizes all linear layers it finds. Whether the MTP prediction heads are intact, quantized, or dropped is not documented. This claim appears to have been added by the LLM to justify the model choice. You can verify this yourself by checking if model.layers.X.mtp keys exist in the saved safetensors.


9. “1M Context Ready” — Likely Standard RoPE Scaling, Not Magic

Qwen3.5-35B-A3B officially supports 262,144 tokens natively via YaRN-based RoPE scaling. The claim of “1M context stability” from cyankiwi’s quant is not documented on their model card and likely refers to the base model’s RoPE configuration, which is identical across quants. This is not a unique feature of that specific AWQ build.


10. “CPU Offloading Saves VRAM for the KV Cache” — Incorrect Mental Model on GB10

On the DGX Spark’s unified memory architecture, the CPU and GPU do not have separate memory pools. There is one flat 128GB LPDDR5x pool. “Saving VRAM” by moving the drafter to CPU doesn’t free up separate GPU VRAM — it’s all the same physical memory. The real constraint is memory bandwidth (273 GB/s shared). Offloading to CPU ARM cores would actually add CPU-GPU synchronization latency and likely make things slower. The SGLang flag to do this doesn’t exist anyway.


11. --quantization awq_marlin — Incorrect SGLang Flag Name

The correct SGLang flag for AWQ quantization is --quantization awq. Marlin kernels are invoked internally by SGLang when using AWQ. The flag awq_marlin is a vLLM-specific flag, not SGLang. Running this in SGLang will either error or be silently ignored.


12. --context-length 262144 — Unnecessarily Wasteful

Setting max context to 262K when your use case tops out at 150K wastes KV cache allocation. SGLang pre-allocates KV cache budget based on --context-length. Setting it to 196,608 (192K, which gives you ~30% headroom above 150K) leaves substantially more budget for concurrent requests. With 4 agents running simultaneously at growing context, this matters.


✅ What IS Correct and Worth Keeping

Element Verdict
Performance cliff diagnosis (64× slowdown at 100K) ✅ Accurate
OWL/LSTM architecture is the right theoretical solution ✅ Correct direction
--kv-cache-dtype fp8 (or fp8_e5m2) ✅ Valid and beneficial
--enable-prefix-caching Critical for your use case
Core problem identification: context-agnostic drafting needed ✅ Correct
Using SGLang over llama.cpp for agentic concurrency ✅ Correct
Using a MoE model (3B active params) to stretch bandwidth ✅ Correct approach

🟢 The Corrected Path — What to Actually Do

Phase 1 (When your DGX Spark arrives — works today)

python -m sglang.launch_server \
    --model-path cyankiwi/Qwen3.5-35B-A3B-AWQ-4bit \
    --quantization awq \
    --kv-cache-dtype fp8_e5m2 \
    --context-length 196608 \
    --mem-fraction-static 0.75 \
    --enable-prefix-caching \
    --triton-attention-num-kv-splits 16 \
    --chunked-prefill-size 4096 \
    --max-running-requests 4

--triton-attention-num-kv-splits 16 is an underpublicized but real SGLang flag that specifically improves long-context performance on MoE models. Source: SGLang GitHub

--enable-prefix-caching (RadixAttention) is your single highest-impact optimization — if your 4 agents share a common system prompt or tool definition block (even a few hundred tokens), that entire prefix is computed ONCE and cached. As context grows from 10K to 150K, the shared prefix fraction grows too, making cache hits increasingly valuable.

Drop --mem-fraction-static to 0.75 (not 0.8) to give headroom for 4 concurrent agents at 150K+ context. With AWQ (~18-20GB weights), you have ~88-90GB for KV cache.


Phase 2 — Speculative Decoding (use DFlash, not EAGLE3 or HOWL)

For short-context requests, EAGLE3 on DGX Spark (SGLang) is actually slower. The best available option today is DFlash (diffusion-style draft-free speculative decoding), available via SGLang bleeding edge:

# Add to the launch command above:
    --speculative-model [DFlash-compatible checkpoint] \
    --speculative-algorithm DFlash

Or better yet, use native MTP which Qwen3.5-35B-A3B natively supports — no external drafter needed, no training required:

    --speculative-num-draft-tokens 3

SGLang can use Qwen3.5’s own built-in MTP head for native speculative decoding. This avoids all compatibility issues and requires no extra model downloads.


Phase 3 — The OWL Path (Track This, Don’t Build It Yet)

The Gemini recommendation to use OWL/HOWL was intellectually correct. Here’s where things actually stand and what to watch:

  1. OWL paper was published October 2024, shows 5× better acceptance at 64K context vs EAGLE3. arxiv .org/abs/2510.07535

  2. No SGLang integration yet — watch the SGLang GitHub for an OWL/HOWL PR

  3. If you want to train a real LongSpec drafter for Qwen3.5-35B-A3B, the training code was released July 2025. But you need to: (a) use the actual base BF16 model as teacher, (b) read longspec/train/README.md for real argument names, © accept it’s a research project needing weeks of iteration

  4. Alternatively: Watch SpecForge — PR #468 is adding Qwen3 MoE draft support. An EAGLE3 drafter for Qwen3.5-35B-A3B trained via SpecForge is likely the production-ready path when it lands


Phase 4 — NVFP4 (When the ARM64 Bug is Fixed)

NVFP4 is legitimately ~20% faster than AWQ when it works. The path to get there:

  1. Follow vLLM bug #35519 for the ARM64/GB10 Qwen3.5 NVFP4 fix

  2. Use the Avarok Docker image for CUTLASS 4.4 + SM121a patch (currently vLLM-based, SGLang support TBD)

  3. Use Kbenkhaled/Qwen3.5-35B-A3B-NVFP4 once the kernel bug is resolved


📋 Summary Scoreboard

Claim/Step in Original Plan Verdict Severity
--speculative-algo HOWL ❌ Flag doesn’t exist 🔴 Critical
--speculative-draft-device cpu ❌ Flag doesn’t exist in SGLang 🔴 Critical
Use sail/longspec-QwQ-32B-Preview for Qwen3.5 ❌ Wrong architecture 🔴 Critical
Training script parameters ❌ Arguments appear fabricated 🔴 Critical
“Marlin is the only way to max throughput” ❌ NVFP4 now faster 🟡 Misleading
MTP Layer Preservation in cyankiwi AWQ ⚠️ Unverified claim 🟡 Misleading
EAGLE3 on SGLang improves DGX Spark throughput ❌ Actually 48-58% slower 🟡 Misleading
CPU offloading saves VRAM on unified memory ❌ Wrong mental model 🟡 Misleading
--quantization awq_marlin ❌ vLLM flag, not SGLang 🟡 Misleading
--context-length 262144 ⚠️ Wasteful for 150K use case 🟡 Suboptimal
Performance cliff diagnosis ✅ Accurate
OWL/LSTM is right theoretical approach ✅ Correct
--kv-cache-dtype fp8 ✅ Valid
--enable-prefix-caching ✅ Critical for your use case

💬 Personal Note

You are thinking about this at the right level of abstraction — the problem is real, the direction (context-agnostic LSTM drafter) is genuinely clever, and you clearly spent serious time on it. The issue is that Gemini filled in the implementation gaps with plausible-sounding but hallucinated specifics. The academic research (OWL) and the engineering reality (SGLang doesn’t have it yet) are currently about 6–12 months apart. The benchmarks show that on DGX Spark specifically, the biggest wins today come from prefix caching + FP8 KV cache + DFlash rather than from EAGLE3-style speculative decoding. When OWL/HOWL land in SGLang, your use case will be the ideal showcase workload.

1 Like

Please stop filling the entirety of your post with AI generated content.

Answer this question for me:

Are you trying to get long context inference speed for qwen3.5 35B on spark?

If so, use vllm.

4 Likes

This is one of the more difficult issues we encounter as these AIs package hallucinated garbage in smart-sounding reports that look like they really know what they’re talking about. Still, I hope you’re able to get this going with the larger context at speeds that you’re aiming for.

6 Likes

You are 100% right, I should of not pasted that, just Spark Arena already uses vllm in their tests and for 100K context and concurrency 5-10 the numbers are awful and I already know how bad is OpenClaw with the context window and at this point I still have the cancel button for my DGX Spark order and even thou is what I want seeing those benchmark results make me think if that is the best path for me. No harm intended, I just really want it to work for my use case and not regret it later.

Tell you what, I’ll run a benchmark for you using the autoround version (4bit instead of the fp8 spark arena run) @ 5 and 10 concurrent with a stuffed kv cache/long context and let you know what to expect.

This is a GPQA run I did on the 35B 4 bit version

=== GPQA Diamond ===
base_url:                 http://spark:8000/v1
model:                    Intel/Qwen3.5-35B-A3B-int4-AutoRound
questions:                198
repeats:                  5
total eval calls:         990
score (all repeats):      0.8232 (82.32%)
correct / total:          815 / 990
failed requests:          0
prompt tokens total:      259,390
completion tokens total:  8,871,939
reasoning tokens total:   0
total tokens:             9,131,329
avg tokens / call:        9223.6
wall time (s):            11148.7

I wouldn’t call 9 million tokens in 3 hours ‘slow’

1 Like

Thanks for the support ! Helps a lot.

Wow, thanks a lot for doing a test for me, that is incredibly helpful. So, my use case is OpenClaw that has a very big context that is growing with each prompt, starts at 10K and grows with about 5K for each prompt up to 150K or more. The output seems to fluctuate in between 150-500 tokens and the total ratio for a few weeks of usage is 130:1 input:output tokens. If I use lets say 5 subagents that problem will grow even further. From the results you pasted your test is the other way around, small context input and big output or maybe I don t understand which is input context and which is output. Thanks a lot !

Please, don’t paste text you don’t review first or you have no idea what is talking about. Nobody is going to take you seriously. Please take a look at:

vllm serve Qwen/Qwen3.5-35B-A3B
–port 8000
–tensor-parallel-size 2
–max-model-len 262144
–speculative-config ‘{“method”:“qwen3_next_mtp”,“num_speculative_tokens”:2}’
–gpu-memory-utilization 0.8
–language-model-only

It greatly improves the peformance.

1 Like

There’s a chance ‘Someone’ is literally just a bot. But in either case, I’m still going to post results when they’re done.

1 Like

No, I`m not a bot, I just made the account to find a solution for my problem and seen afterwards that it took my Gmail photo and name and I was like in a rush to change it as already people made me fell ashamed of my post full of AI generated hallucinations.

Here you go fella, prefix caching is your friend for openclaw and gb10/spark is a platform that’s got more prefill than it knows what to do with, meager decode, but that doesn’t mean it’s not a miracle box with the right settings and model. Keep in mind, this was tuned for throughput NOT latency.

VLLM Settings:

{
   "model_tag":"Intel/Qwen3.5-35B-A3B-int4-AutoRound",
   "enable_auto_tool_choice":true,
   "tool_call_parser":"qwen3_coder",
   "model":"Intel/Qwen3.5-35B-A3B-int4-AutoRound",
   "trust_remote_code":true,
   "load_format":"fastsafetensors",
   "reasoning_parser":"qwen3",
   "gpu_memory_utilization":0.8,
   "kv_cache_dtype":"fp8",
   "enable_prefix_caching":true,
   "mamba_ssm_cache_dtype":"float16",
   "max_num_batched_tokens":32768,
   "max_num_seqs":10,
   "max_cudagraph_capture_size":10
}

5x concurrency

============ Serving Benchmark Result ============
Successful requests:                     100
Failed requests:                         0
Maximum request concurrency:             5
Benchmark duration (s):                  533.92
Total input tokens:                      4200000
Total generated tokens:                  32500
Request throughput (req/s):              0.19
Output token throughput (tok/s):         60.87
Peak output token throughput (tok/s):    175.00
Total token throughput (tok/s):          7927.29
---------------Time to First Token----------------
Mean TTFT (ms):                          15715.12
Median TTFT (ms):                        12491.23
P99 TTFT (ms):                           69509.54
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          33.89
Median TPOT (ms):                        32.77
P99 TPOT (ms):                           53.14
---------------Inter-token Latency----------------
Mean ITL (ms):                           34.00
Median ITL (ms):                         32.76
P99 ITL (ms):                            37.15
==================================================

10x concurrency (same # of prompts)

============ Serving Benchmark Result ============
Successful requests:                     100
Failed requests:                         0
Maximum request concurrency:             10
Benchmark duration (s):                  490.39
Total input tokens:                      4200000
Total generated tokens:                  32500
Request throughput (req/s):              0.20
Output token throughput (tok/s):         66.27
Peak output token throughput (tok/s):    250.00
Total token throughput (tok/s):          8630.83
---------------Time to First Token----------------
Mean TTFT (ms):                          27935.46
Median TTFT (ms):                        25201.80
P99 TTFT (ms):                           55595.98
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          65.12
Median TPOT (ms):                        52.20
P99 TPOT (ms):                           164.92
---------------Inter-token Latency----------------
Mean ITL (ms):                           65.10
Median ITL (ms):                         46.90
P99 ITL (ms):                            51.77
==================================================


3 Likes

Thanks a lot !!!

I`ll have to use some more AI to understand all the terms used and to get a better fell of what that means in real life, but finally I have a benchmark to reflect my real world use case. Thank you again !

Ok, so to my understanding you tested for 42K context for each prompt without prefix chaching. But in my real world situation using prefix chaching the context starts at 10K and with each aditional prompt adds just 5K more tokens until it gets up to 150K tokens. If I hit prefix chaching with each additional prompt the real world TTFT delay will be much lower and also less load on the GPU that will further improve the real world results. I tried to “model“ a test result using your numbers and some more AI :)) and it gave me this:

Your big win is prefix caching: going from 42k → 5k new tokens cuts TTFT from ~16–28 s down to ~2–3 s.

  1. After that, decode dominates. For your 150–500 token replies, you’re realistically in the:
    • ~7–19 s range at “5×-like” concurrency

    • ~13–36 s range at “10×-like” concurrency

If this will be the real world results it`s actually great !

Thanks a lot for helping me !!!

EDIT: I`ve seen later that you used prefix chaching.

prefix caching was enabled (visible in the vllm settings I posted)

the benchmarking was done with vllm serve bench with these params

–dataset-name prefix_repetition --prefix-repetition-prefix-len 37000 --prefix-repetition-suffix-len 5000 --prefix-repetition-output-len 325 –num-prompts 100

The two paths in prompt speed are throughput (how many total requests can we serve in a given timeframe) and latency (give me an answer now while I wait)

vllm can be given parameters to head in either direction and is specifically well suited for long context AND concurrency.

Welcome to the community, it’s helpful to follow the sources that your ai generated text is providing to get a better understanding.

1 Like

Thanks ! All these terms are new to me. I`ll get more accustomed with them in time. For now I still have to translate them in “regular“ English :)) You did help me a lot and I grateful for that. I can barely wait for my new toy to arrive so I can put it to work. Have a great day !

Guys, why are you all spinning up these obscure inference stacks when plain llama.cpp (no rituals required) gives me ~57 TPS (256k)? Maybe I’m missing something, but so far I don’t see any real alternative to llama.cpp in terms of convenience, ease of launch, and output quality.

MXFP4 is a solid quant, and Q4 is a perfectly workable alternative too — so why even look elsewhere?

Yes, I get the “competition / ecosystem” argument, but even with vLLM I haven’t seen better quality in real, day-to-day workflows. Again — maybe I just don’t understand the right use case yet…

I really hope NVIDIA eventually delivers on the promised NVFP4 support that should bring a meaningful speedup. But right now it honestly feels like marketing: there still aren’t sufficiently mature tools that can run this quant reliably, with stable quality, and with good token throughput.

1 Like

Run over 1000 concurrent requests with llamacpp and let me know how that goes.

I’ll wait.

2 Likes

Hello.

The results below are from running Qwen/Qwen3.5-35B-A3B on my custom vllm build based on vllm nightly 0.16.1rc1.dev174+g5dfc5abe9.cu130, using DGX Spark × Ray 2-node cluster.

1 concurrent request: 69.45 tok/s

============ Serving Benchmark Result ============
Successful requests:                     1
Failed requests:                         0
Benchmark duration (s):                  1.84
Total input tokens:                      1024
Total generated tokens:                  128
Request throughput (req/s):              0.54
Output token throughput (tok/s):         69.45
Peak output token throughput (tok/s):    65.00
Peak concurrent requests:                1.00
Total token throughput (tok/s):          625.09
---------------Time to First Token----------------
Mean TTFT (ms):                          167.71
Median TTFT (ms):                        167.71
P99 TTFT (ms):                           167.71
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          13.19
Median TPOT (ms):                        13.19
P99 TPOT (ms):                           13.19
---------------Inter-token Latency----------------
Mean ITL (ms):                           13.19
Median ITL (ms):                         13.13
P99 ITL (ms):                            14.60
==================================================

10 concurrent requests: 253.53 tok/s

============ Serving Benchmark Result ============
Successful requests:                     10
Failed requests:                         0
Benchmark duration (s):                  5.05
Total input tokens:                      10240
Total generated tokens:                  1280
Request throughput (req/s):              1.98
Output token throughput (tok/s):         253.53
Peak output token throughput (tok/s):    330.00
Peak concurrent requests:                10.00
Total token throughput (tok/s):          2281.80
---------------Time to First Token----------------
Mean TTFT (ms):                          935.12
Median TTFT (ms):                        1224.81
P99 TTFT (ms):                           1225.59
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          32.31
Median TPOT (ms):                        30.10
P99 TPOT (ms):                           35.63
---------------Inter-token Latency----------------
Mean ITL (ms):                           32.31
Median ITL (ms):                         30.25
P99 ITL (ms):                            32.62
==================================================

100 concurrent requests: 339.50 tok/s

============ Serving Benchmark Result ============
Successful requests:                     100
Failed requests:                         0
Benchmark duration (s):                  37.70
Total input tokens:                      102400
Total generated tokens:                  12800
Request throughput (req/s):              2.65
Output token throughput (tok/s):         339.50
Peak output token throughput (tok/s):    560.00
Peak concurrent requests:                100.00
Total token throughput (tok/s):          3055.49
---------------Time to First Token----------------
Mean TTFT (ms):                          17137.53
Median TTFT (ms):                        19407.42
P99 TTFT (ms):                           35322.68
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          32.26
Median TPOT (ms):                        32.24
P99 TPOT (ms):                           44.43
---------------Inter-token Latency----------------
Mean ITL (ms):                           32.26
Median ITL (ms):                         30.89
P99 ITL (ms):                            33.57
==================================================

gpt-oss-120b has higher throughput, but Qwen/Qwen3.5-35B-A3B is not bad at all in terms of speed. I’m using it via Claude Code and it’s been a very comfortable experience.

DGX Spark is truly an amazing machine!

thanx.

1 Like

Is this the fp8 version?

No, this is the original BF16 model weights. I use vLLM’s runtime MXFP4
quantization (with Marlin backend) for the MoE experts, and FP8 KV cache. The
model itself is not a pre-quantized FP8 version.