GenSpark Super Agent did a reality check to my post and at this point we need a genius to solve this one out. I`ll paste here the output:
Alright — I’ve done a deep investigation across the NVIDIA developer forums, SGLang’s GitHub, the OWL/LongSpec papers, benchmark threads, and the SM121 architecture deep-dives. Here is a full, honest audit of the plan. I’ve colour-coded issues by severity.
🔬 Full Audit: DGX Spark Long-Context Agent Optimization Plan
First, the Good News — What Gemini Got Right
The problem diagnosis is excellent. The performance cliff is real and well-documented:
At 100K context, Prompt Processing drops from ~3918 tok/s → ~60 tok/s (a 64× slowdown). This is caused by the quadratic attention cost of transformer-based KV caches combined with your GB10’s 273 GB/s LPDDR5x bandwidth being saturated by growing cache reads.
The instinct to use an LSTM-based, context-agnostic drafter (OWL/HOWL paper) is genuinely the right intellectual approach for this exact workload. The science is sound. The paper shows OWL achieving ~5× higher acceptance length than EAGLE3 on long-context inputs. For a 150K agent context, this architecture would in theory be transformative.
The issue is that almost every implementation detail in the plan is either wrong, fabricated, or currently broken on GB10. Let’s go through each section.
🔴 CRITICAL ERRORS — These Commands Will Not Work
1. --speculative-algo HOWL — Flag Does Not Exist in SGLang
This is the most dangerous error in the whole plan. HOWL is not an SGLang algorithm flag. It is the name of a hybrid method described in the OWL academic paper (combining OWL tree-decoding with SuffixDecoding). As of today, OWL and HOWL have no merged SGLang pull request, no integration, and no documented roadmap issue in sgl-project/sglang. The SGLang-supported speculative algorithms are:
| Flag |
Status |
EAGLE |
✅ Supported |
EAGLE3 |
✅ Supported |
MEDUSA |
✅ Supported |
DFlash (DFlash PR) |
✅ Supported (bleeding edge) |
HOWL |
❌ Does not exist |
Running that launch command as-is will produce an error on startup. OWL paper | LongSpec GitHub
2. --speculative-draft-device cpu — Flag Does Not Exist in SGLang
SGLang does not have a --speculative-draft-device argument for speculative decoding. This flag was likely confused with vLLM’s experimental CPU offload flags or inferred from first principles. The real SGLang speculative decoding pipeline runs both the drafter and verifier on GPU. While the Grace CPU and Blackwell GPU on GB10 share unified 128GB memory, there is no documented SGLang flag to route the speculative draft to ARM CPU cores. SGLang server args
3. The sail/longspec-QwQ-32B-Preview Drafter Is Incompatible with Qwen3.5-35B-A3B
This is a drafter trained specifically for QwQ-32B-Preview (an older reasoning model). Qwen3.5-35B-A3B has a completely different MoE architecture (35B total / 3B active parameters, 64 layers, different hidden dimensions and tokenizer). The drafter head architecture is trained to mimic a specific target model’s hidden states. Using a QwQ drafter on Qwen3.5-35B-A3B would produce garbage acceptance rates, effectively making generation slower than vanilla decoding.
The official LongSpec pretrained models support: Vicuna-7B/13B, LongChat-7B/13B, Llama-3-8B-262k, and QwQ-32B-Preview — no Qwen3.5-35B-A3B drafter exists yet.
4. Qwen3.5-35B-A3B NVFP4 Crashes on GB10/ARM64 (Active Bug)
The plan recommends building toward NVFP4. Community NVFP4 quants exist (Sehyo/Qwen3.5-35B-A3B-NVFP4, Kbenkhaled/Qwen3.5-35B-A3B-NVFP4) but there is an open, unresolved vLLM bug (vllm-project/vllm#35519):
“Qwen3.5 NVFP4 models crash on ARM64 GB10 DGX Spark (CUDA Kernel incompatibility). This confirms the underlying NVFP4 math kernels contain instructions incompatible with ARM64/GB10, regardless of execution mode.”
This is a separate, Qwen3.5-specific problem on top of the general SM121 CUTLASS patches needed for NVFP4.
5. The Training Script Arguments Are Fabricated
The proposed distillation command:
python longspec/train/train_drafter.py \
--drafter_arch lstm \
--use_marlin True \
--anchor_offset_training True
The LongSpec training README documents no flags called --use_marlin, --anchor_offset_training, or --drafter_arch. These appear to have been invented by the LLM to sound plausible. Running this will immediately fail with unrecognized arguments errors. You’d need to read the actual train_drafter.py source to find the real parameter names.
Additionally, using a 4-bit AWQ quantized model as the teacher during distillation is architecturally problematic — the hidden states from a heavily quantized model are degraded compared to BF16, which means the drafter trains on lower-quality signals.
🟡 MISLEADING CLAIMS — Technically Possible but Inaccurate
6. “Marlin kernels are the ONLY way to achieve maximum throughput” — FALSE as of Feb 2026
A major community breakthrough happened: Avarok unlocked NVFP4 on DGX Spark with a ~20% throughput gain over AWQ. Their Docker image uses CUTLASS 4.4 + SM121a patch via FlashInfer. FP8 online quantization in SGLang also beats AWQ (52–55 tok/s vs ~31 tok/s vanilla). The landscape has moved.
Current throughput ranking on DGX Spark (SGLang, Qwen3 30B-A3B):
| Config |
tok/s |
Notes |
| BF16 vanilla |
~31 |
Baseline |
| AWQ/Marlin vanilla |
~35-42 |
Forum reports |
| FP8 online vanilla |
~52-55 |
✅ Best stable option |
| NVFP4 vanilla |
~65-66 |
Requires patched Docker, Qwen3.5 crashes |
| FP8 + DFlash |
~41 |
Best speculative combo |
| NVFP4 + DFlash |
~54 |
Requires patch + stable model |
7. EAGLE3 in SGLang Actually Slows Down Generation on DGX Spark
This is the most surprising finding from the deep benchmark thread. Real measured results on DGX Spark (from flash3’s exhaustive benchmark matrix):
| Setup |
tok/s |
| SGLang BF16 vanilla |
31.7 |
| SGLang BF16 + EAGLE3 |
16.4 🔴 (-48%) |
| SGLang BF16 + DFlash |
20.5 🔴 (-35%) |
| SGLang FP8 online vanilla |
52.7 |
| SGLang FP8 + EAGLE3 |
24.5 🔴 (-53%) |
| SGLang FP8 + DFlash |
41.4 🟡 (-21%) |
The benchmarker notes: “SGLang’s scheduling overhead for speculative decoding is prohibitive on all platforms and quantizations [with this hardware].” The OVERHEAD of batching, tree verification, and KV cache management for speculative tokens costs more than the gain on the bandwidth-limited GB10. DFlash has the best ratio but is still a regression. Note: vLLM-next EAGLE3 shows small positive gains, suggesting the issue is SGLang-specific overhead.
8. “MTP Layer Preservation” in cyankiwi’s AWQ Quant — Unverified
The cyankiwi model card makes no explicit mention of preserving MTP layers. Standard AutoAWQ quantizes all linear layers it finds. Whether the MTP prediction heads are intact, quantized, or dropped is not documented. This claim appears to have been added by the LLM to justify the model choice. You can verify this yourself by checking if model.layers.X.mtp keys exist in the saved safetensors.
9. “1M Context Ready” — Likely Standard RoPE Scaling, Not Magic
Qwen3.5-35B-A3B officially supports 262,144 tokens natively via YaRN-based RoPE scaling. The claim of “1M context stability” from cyankiwi’s quant is not documented on their model card and likely refers to the base model’s RoPE configuration, which is identical across quants. This is not a unique feature of that specific AWQ build.
10. “CPU Offloading Saves VRAM for the KV Cache” — Incorrect Mental Model on GB10
On the DGX Spark’s unified memory architecture, the CPU and GPU do not have separate memory pools. There is one flat 128GB LPDDR5x pool. “Saving VRAM” by moving the drafter to CPU doesn’t free up separate GPU VRAM — it’s all the same physical memory. The real constraint is memory bandwidth (273 GB/s shared). Offloading to CPU ARM cores would actually add CPU-GPU synchronization latency and likely make things slower. The SGLang flag to do this doesn’t exist anyway.
11. --quantization awq_marlin — Incorrect SGLang Flag Name
The correct SGLang flag for AWQ quantization is --quantization awq. Marlin kernels are invoked internally by SGLang when using AWQ. The flag awq_marlin is a vLLM-specific flag, not SGLang. Running this in SGLang will either error or be silently ignored.
12. --context-length 262144 — Unnecessarily Wasteful
Setting max context to 262K when your use case tops out at 150K wastes KV cache allocation. SGLang pre-allocates KV cache budget based on --context-length. Setting it to 196,608 (192K, which gives you ~30% headroom above 150K) leaves substantially more budget for concurrent requests. With 4 agents running simultaneously at growing context, this matters.
✅ What IS Correct and Worth Keeping
| Element |
Verdict |
| Performance cliff diagnosis (64× slowdown at 100K) |
✅ Accurate |
| OWL/LSTM architecture is the right theoretical solution |
✅ Correct direction |
--kv-cache-dtype fp8 (or fp8_e5m2) |
✅ Valid and beneficial |
--enable-prefix-caching |
✅ Critical for your use case |
| Core problem identification: context-agnostic drafting needed |
✅ Correct |
| Using SGLang over llama.cpp for agentic concurrency |
✅ Correct |
| Using a MoE model (3B active params) to stretch bandwidth |
✅ Correct approach |
🟢 The Corrected Path — What to Actually Do
Phase 1 (When your DGX Spark arrives — works today)
python -m sglang.launch_server \
--model-path cyankiwi/Qwen3.5-35B-A3B-AWQ-4bit \
--quantization awq \
--kv-cache-dtype fp8_e5m2 \
--context-length 196608 \
--mem-fraction-static 0.75 \
--enable-prefix-caching \
--triton-attention-num-kv-splits 16 \
--chunked-prefill-size 4096 \
--max-running-requests 4
--triton-attention-num-kv-splits 16 is an underpublicized but real SGLang flag that specifically improves long-context performance on MoE models. Source: SGLang GitHub
--enable-prefix-caching (RadixAttention) is your single highest-impact optimization — if your 4 agents share a common system prompt or tool definition block (even a few hundred tokens), that entire prefix is computed ONCE and cached. As context grows from 10K to 150K, the shared prefix fraction grows too, making cache hits increasingly valuable.
Drop --mem-fraction-static to 0.75 (not 0.8) to give headroom for 4 concurrent agents at 150K+ context. With AWQ (~18-20GB weights), you have ~88-90GB for KV cache.
Phase 2 — Speculative Decoding (use DFlash, not EAGLE3 or HOWL)
For short-context requests, EAGLE3 on DGX Spark (SGLang) is actually slower. The best available option today is DFlash (diffusion-style draft-free speculative decoding), available via SGLang bleeding edge:
# Add to the launch command above:
--speculative-model [DFlash-compatible checkpoint] \
--speculative-algorithm DFlash
Or better yet, use native MTP which Qwen3.5-35B-A3B natively supports — no external drafter needed, no training required:
--speculative-num-draft-tokens 3
SGLang can use Qwen3.5’s own built-in MTP head for native speculative decoding. This avoids all compatibility issues and requires no extra model downloads.
Phase 3 — The OWL Path (Track This, Don’t Build It Yet)
The Gemini recommendation to use OWL/HOWL was intellectually correct. Here’s where things actually stand and what to watch:
-
OWL paper was published October 2024, shows 5× better acceptance at 64K context vs EAGLE3. arxiv .org/abs/2510.07535
-
No SGLang integration yet — watch the SGLang GitHub for an OWL/HOWL PR
-
If you want to train a real LongSpec drafter for Qwen3.5-35B-A3B, the training code was released July 2025. But you need to: (a) use the actual base BF16 model as teacher, (b) read longspec/train/README.md for real argument names, © accept it’s a research project needing weeks of iteration
-
Alternatively: Watch SpecForge — PR #468 is adding Qwen3 MoE draft support. An EAGLE3 drafter for Qwen3.5-35B-A3B trained via SpecForge is likely the production-ready path when it lands
Phase 4 — NVFP4 (When the ARM64 Bug is Fixed)
NVFP4 is legitimately ~20% faster than AWQ when it works. The path to get there:
-
Follow vLLM bug #35519 for the ARM64/GB10 Qwen3.5 NVFP4 fix
-
Use the Avarok Docker image for CUTLASS 4.4 + SM121a patch (currently vLLM-based, SGLang support TBD)
-
Use Kbenkhaled/Qwen3.5-35B-A3B-NVFP4 once the kernel bug is resolved
📋 Summary Scoreboard
| Claim/Step in Original Plan |
Verdict |
Severity |
--speculative-algo HOWL |
❌ Flag doesn’t exist |
🔴 Critical |
--speculative-draft-device cpu |
❌ Flag doesn’t exist in SGLang |
🔴 Critical |
Use sail/longspec-QwQ-32B-Preview for Qwen3.5 |
❌ Wrong architecture |
🔴 Critical |
| Training script parameters |
❌ Arguments appear fabricated |
🔴 Critical |
| “Marlin is the only way to max throughput” |
❌ NVFP4 now faster |
🟡 Misleading |
| MTP Layer Preservation in cyankiwi AWQ |
⚠️ Unverified claim |
🟡 Misleading |
| EAGLE3 on SGLang improves DGX Spark throughput |
❌ Actually 48-58% slower |
🟡 Misleading |
| CPU offloading saves VRAM on unified memory |
❌ Wrong mental model |
🟡 Misleading |
--quantization awq_marlin |
❌ vLLM flag, not SGLang |
🟡 Misleading |
--context-length 262144 |
⚠️ Wasteful for 150K use case |
🟡 Suboptimal |
| Performance cliff diagnosis |
✅ Accurate |
✅ |
| OWL/LSTM is right theoretical approach |
✅ Correct |
✅ |
--kv-cache-dtype fp8 |
✅ Valid |
✅ |
--enable-prefix-caching |
✅ Critical for your use case |
✅ |
💬 Personal Note
You are thinking about this at the right level of abstraction — the problem is real, the direction (context-agnostic LSTM drafter) is genuinely clever, and you clearly spent serious time on it. The issue is that Gemini filled in the implementation gaps with plausible-sounding but hallucinated specifics. The academic research (OWL) and the engineering reality (SGLang doesn’t have it yet) are currently about 6–12 months apart. The benchmarks show that on DGX Spark specifically, the biggest wins today come from prefix caching + FP8 KV cache + DFlash rather than from EAGLE3-style speculative decoding. When OWL/HOWL land in SGLang, your use case will be the ideal showcase workload.