Qwen3.5-122B-A10B NVFP4 Quantized for DGX Spark — 234GB → 75GB, Runs on 128GB

alper.tor · March 10, 2026, 10:06am

Thanks to @relc for sharing the config and @eugr for the spark-vllm-docker tooling that made this possible. Inspired by the AutoRound results people were getting, I ran a controlled comparison on my single DGX Spark (128 GB).

TL;DR: AutoRound INT4 is ~1.9x faster than NVFP4 with identical output quality. fastsafetensors works at 0.85 utilization, cutting startup from 9 min to 2 min.

Setup

Hardware: Single DGX Spark (GB10), 128 GB unified memory, SM121

	AutoRound INT4	NVFP4
Model	`Intel/Qwen3.5-122B-A10B-int4-AutoRound`	`txn545/Qwen3.5-122B-A10B-NVFP4`
Quantization	Intel AutoRound (GPTQ/Marlin)	NVIDIA ModelOpt v0.42.0
Size on disk	67 GB (14 shards)	78 GB (2 shards)
GPU memory	62.65 GiB	~63 GiB
Docker image	`vllm-node-tf5` (eugr’s, vLLM 0.17.0rc1, transformers v5.3.0)	`dgx-vllm-qwen35:v1-gate-fix` (Avarok’s, vLLM 0.16.0rc2)
Quantization kernel	MarlinLinearKernel	ModelOpt NVFP4
Context	262K	262K
gpu_memory_utilization	0.85	0.75
KV cache dtype	bf16	fp8

AutoRound mods applied: fix-qwen3.5-autoround (rope validation fix for transformers v5) + fix-qwen3.5-chat-template (unsloth.jinja).

AutoRound env: VLLM_MARLIN_USE_ATOMIC_ADD=1

Launch command (single Spark, TP=1):

vllm serve /models/Intel-Qwen3.5-122B-A10B-int4-AutoRound \
  --max-model-len 262144 --gpu-memory-utilization 0.85 \
  --port 8080 --host 0.0.0.0 \
  --load-format fastsafetensors \
  --enable-prefix-caching --enable-auto-tool-choice \
  --tool-call-parser qwen3_xml --reasoning-parser qwen3 \
  --max-num-batched-tokens 8192 --trust-remote-code \
  --chat-template unsloth.jinja

Speed Comparison

All tests: single request, sequential, temperature=0.3, warmup excluded.

Test	AutoRound INT4	NVFP4	Speedup
Think mode (400 tok)	14.1s = 28.4 tok/s	26.5s = 15.1 tok/s	1.88x
Text generation (500 tok)	17.5s = 28.6 tok/s	33.1s = 15.1 tok/s	1.89x
Turkish language (200 tok)	7.2s = 27.7 tok/s	12.8s = 14.8 tok/s	1.87x
Vision / OCR (524 tok)	30.5s = 17.2 tok/s	46.7s = 9.8 tok/s	1.76x
Tool calling (72 tok)	12.6s = 5.7 tok/s	N/A	—
Server-reported peak	28.7 tok/s	15.4 tok/s	1.86x

The MarlinLinearKernel makes a huge difference on SM121.

Quality Comparison

I tested both models with identical prompts. Results:

Factual Q&A: Identical answers (“The capital of Turkey is Ankara.”)
Turkish language: Both produced correct TÜİK 2023 data with proper diacritics (ü, ö, ç, ş, ı, İ)
Vision (signature circular OCR): Both extracted the same 2 signatories, notary information, and authority types from a scanned Turkish legal document (İmza Sirküleri)
Think/nothink separation: AutoRound correctly separates reasoning and content fields via --reasoning-parser qwen3
Tool calling: AutoRound generates valid structured JSON for function calls via --tool-call-parser qwen3_xml

No observable quality degradation switching from NVFP4 to AutoRound INT4.

fastsafetensors — Works at 0.85 Utilization!

fastsafetensors previously caused a system freeze with NVFP4 at 0.84 util (the 78 GB model + temp buffer exceeded 128 GB during GPU-direct loading). AutoRound’s smaller footprint (67 GB) leaves enough headroom.

Phase	Standard loading	fastsafetensors
Weight loading	430s (7.2 min)	60s (1 min)
torch.compile	15.9s	0.9s (cached)
CUDA graph capture	13s	13s
Total startup	~9 min	~2 min
Generation speed	28.6 tok/s	28.7 tok/s (identical)

Memory Breakdown

Total GPU memory:        128.0 GiB
gpu_memory_utilization:  0.85 → 108.8 GiB budget
Model weights:            62.65 GiB
CUDA graph pool:           1.29 GiB
Available for KV cache:  ~44.9 GiB (bf16)
Max concurrency @ 262K:    5.57x

Bottom Line

On a single DGX Spark, switching from NVFP4 to AutoRound INT4:

1.85x faster generation (28 vs 15 tok/s)
1.76x faster vision/OCR (17 vs 10 tok/s)
7x faster startup with fastsafetensors (2 min vs 9-11 min)
No quality loss in text, Turkish, vision, tool calling, or reasoning
11 GB smaller on disk (67 vs 78 GB)

We’ve switched our production contract management system to AutoRound + fastsafetensors. Running 262K context at 0.85 util on a single Spark with vision, tool calling, and think/nothink mode — all working.

Thanks again to everyone in this thread for the configs and tooling. This community is making the Spark ecosystem much more accessible.

Topic		Replies	Views
Qwen/Qwen3.5-122B-A10B - Alibaba/Qwen thought about us... :-D DGX Spark / GB10	340	17052	March 24, 2026
RedHatAI/Qwen3.5-122B-A10B-NVFP4 seems to be the best option for a single Spark DGX Spark / GB10 Projects llm	75	6434	May 4, 2026
Qwen3.5-397B-A17B + DGX Spark (duo) DGX Spark / GB10 Projects	62	6248	June 14, 2026
Qwen3.5-397B-A17B run in dual spark! but I have a concern DGX Spark / GB10	236	9454	June 6, 2026
Qwen3.5-122B-A10B on single Spark: up to 51 tok/s (v2.1 — patches + quick-start + benchmark) DGX Spark / GB10 cuda , performance , docker , performance-tuning , llm	434	22465	June 24, 2026
What's the best speed we can get with Qwen 3.6 27B without quantizing? DGX Spark / GB10	55	18705	June 27, 2026
DGX Spark + Qwen3-Next-80B: Proven Performance, But Missing Clear Path to NIM, TensorRT-LLM & Web UIs DGX Spark / GB10 cuda , nim , llama	16	4767	March 6, 2026
Introducing Spark Auto Round /w OpenCode Instruct dataset DGX Spark / GB10 cuda , spark , agentic-ai	87	2450	June 27, 2026
Can someone please just help me set the DGX Spark up for optimal LLM use? DGX Spark / GB10 llama	11	1048	June 20, 2026
HOW-TO: Run Qwen3-Coder-Next on Spark DGX Spark / GB10 llama	92	10340	March 24, 2026