Slow Qwen 3.6 27B NVFP4 - recipe feedback

Hey everyone, I’m trying to set up a single DGX Spark for with Nemoclaw for offline LLM workloads. Based on another forum entry I found that people have success with using DFlash for fast inference. However, I’m having no luck. I use eugr’s repo for Spark-specific builds as detailed in the forum post, but I’m not seeing the benchmarked ~30-40 tokens/sec in real life usage. This is the recipe I’m using right now, but I’ve had a couple of variations with more SpecDecoding tokens and higher batch sizes, but no dice:

name: qwen36-27b-prismascout-dflash-agentic-tps
description: Qwen3.6-27B PrismaSCOUT DFlash Spark optimized for 6-7 concurrent agents

model: rdtand/Qwen3.6-27B-PrismaSCOUT-Blackwell-NVFP4-BF16-vllm
solo_only: true

container: vllm-node-dflash

build_args:
–apply-vllm-pr
“40898”
–tf5

mods:
mods/fix-qwen3.6-chat-template
mods/flashqla
defaults:
port: 8000
host: 0.0.0.0
tensor_parallel: 1
gpu_memory_utilization: 0.85
max_model_len: 262144
max_num_batched_tokens: 49152
max_num_seqs: 16
served_model_name: qwen/qwen3.6-27b
speculative_dflash: ‘{“method”:“dflash”,“model”:“z-lab/Qwen3.6-27B-DFlash”,“num_speculative_tokens”:7}’

env:
VLLM_ALLOW_LONG_MAX_MODEL_LEN: 1
PYTORCH_CUDA_ALLOC_CONF: expandable_segments:True
TORCH_MATMUL_PRECISION: high
NVIDIA_FORWARD_COMPAT: 1
NVIDIA_DISABLE_REQUIRE: 1
VLLM_MARLIN_USE_ATOMIC_ADD: 1

command: |
vllm serve rdtand/Qwen3.6-27B-PrismaSCOUT-Blackwell-NVFP4-BF16-vllm 
–served-model-name {served_model_name} 
–host {host} 
–port {port} 
-tp {tensor_parallel} 
–max-model-len {max_model_len} 
–max-num-batched-tokens {max_num_batched_tokens} 
–max-num-seqs {max_num_seqs} 
–gpu-memory-utilization {gpu_memory_utilization} 
–dtype bfloat16 
–kv-cache-dtype fp8 
–load-format fastsafetensors 
–enable-prefix-caching 
–enable-auto-tool-choice 
–reasoning-parser qwen3 
–tool-call-parser qwen3_coder 
–chat-template fixed_chat_template.jinja 
–speculative-config ‘{speculative_dflash}’ 
–attention-backend flash_attn

Can you give me some feedback? The results quality-wise are fine, but I’m getting around 10 tokens per sec, sometimes around 30 but mostly when running parallel calls/subagents. Am I misconfiguring something, or is this a Spark limitation? I’d be happy to get any kind of feedback.

I tried it today, no luck either. Could be a new image pushed just yesterday. My regular 27B nvfp4 on 0.20 release is essentially the same but with working q8 kv cache, this one has Turboquant integrated but not wired into the backend, at least to FlashAttention this model relies on.