Hey everyone, I’m trying to set up a single DGX Spark for with Nemoclaw for offline LLM workloads. Based on another forum entry I found that people have success with using DFlash for fast inference. However, I’m having no luck. I use eugr’s repo for Spark-specific builds as detailed in the forum post, but I’m not seeing the benchmarked ~30-40 tokens/sec in real life usage. This is the recipe I’m using right now, but I’ve had a couple of variations with more SpecDecoding tokens and higher batch sizes, but no dice:
name: qwen36-27b-prismascout-dflash-agentic-tps
description: Qwen3.6-27B PrismaSCOUT DFlash Spark optimized for 6-7 concurrent agents
model: rdtand/Qwen3.6-27B-PrismaSCOUT-Blackwell-NVFP4-BF16-vllm
solo_only: true
container: vllm-node-dflash
build_args:
–apply-vllm-pr
“40898”
–tf5
mods:
mods/fix-qwen3.6-chat-template
mods/flashqla
defaults:
port: 8000
host: 0.0.0.0
tensor_parallel: 1
gpu_memory_utilization: 0.85
max_model_len: 262144
max_num_batched_tokens: 49152
max_num_seqs: 16
served_model_name: qwen/qwen3.6-27b
speculative_dflash: ‘{“method”:“dflash”,“model”:“z-lab/Qwen3.6-27B-DFlash”,“num_speculative_tokens”:7}’
env:
VLLM_ALLOW_LONG_MAX_MODEL_LEN: 1
PYTORCH_CUDA_ALLOC_CONF: expandable_segments:True
TORCH_MATMUL_PRECISION: high
NVIDIA_FORWARD_COMPAT: 1
NVIDIA_DISABLE_REQUIRE: 1
VLLM_MARLIN_USE_ATOMIC_ADD: 1
command: |
vllm serve rdtand/Qwen3.6-27B-PrismaSCOUT-Blackwell-NVFP4-BF16-vllm
–served-model-name {served_model_name}
–host {host}
–port {port}
-tp {tensor_parallel}
–max-model-len {max_model_len}
–max-num-batched-tokens {max_num_batched_tokens}
–max-num-seqs {max_num_seqs}
–gpu-memory-utilization {gpu_memory_utilization}
–dtype bfloat16
–kv-cache-dtype fp8
–load-format fastsafetensors
–enable-prefix-caching
–enable-auto-tool-choice
–reasoning-parser qwen3
–tool-call-parser qwen3_coder
–chat-template fixed_chat_template.jinja
–speculative-config ‘{speculative_dflash}’
–attention-backend flash_attn
Can you give me some feedback? The results quality-wise are fine, but I’m getting around 10 tokens per sec, sometimes around 30 but mostly when running parallel calls/subagents. Am I misconfiguring something, or is this a Spark limitation? I’d be happy to get any kind of feedback.