Qwen3.5-122B-A10B NVFP4 Quantized for DGX Spark — 234GB → 75GB, Runs on 128GB

Thanks to @relc for sharing the config and @eugr for the spark-vllm-docker tooling that made this possible. Inspired by the AutoRound results people were getting, I ran a controlled comparison on my single DGX Spark (128 GB).

TL;DR: AutoRound INT4 is ~1.9x faster than NVFP4 with identical output quality. fastsafetensors works at 0.85 utilization, cutting startup from 9 min to 2 min.


Setup

Hardware: Single DGX Spark (GB10), 128 GB unified memory, SM121

AutoRound INT4 NVFP4
Model Intel/Qwen3.5-122B-A10B-int4-AutoRound txn545/Qwen3.5-122B-A10B-NVFP4
Quantization Intel AutoRound (GPTQ/Marlin) NVIDIA ModelOpt v0.42.0
Size on disk 67 GB (14 shards) 78 GB (2 shards)
GPU memory 62.65 GiB ~63 GiB
Docker image vllm-node-tf5 (eugr’s, vLLM 0.17.0rc1, transformers v5.3.0) dgx-vllm-qwen35:v1-gate-fix (Avarok’s, vLLM 0.16.0rc2)
Quantization kernel MarlinLinearKernel ModelOpt NVFP4
Context 262K 262K
gpu_memory_utilization 0.85 0.75
KV cache dtype bf16 fp8

AutoRound mods applied: fix-qwen3.5-autoround (rope validation fix for transformers v5) + fix-qwen3.5-chat-template (unsloth.jinja).

AutoRound env: VLLM_MARLIN_USE_ATOMIC_ADD=1

Launch command (single Spark, TP=1):

vllm serve /models/Intel-Qwen3.5-122B-A10B-int4-AutoRound \
  --max-model-len 262144 --gpu-memory-utilization 0.85 \
  --port 8080 --host 0.0.0.0 \
  --load-format fastsafetensors \
  --enable-prefix-caching --enable-auto-tool-choice \
  --tool-call-parser qwen3_xml --reasoning-parser qwen3 \
  --max-num-batched-tokens 8192 --trust-remote-code \
  --chat-template unsloth.jinja

Speed Comparison

All tests: single request, sequential, temperature=0.3, warmup excluded.

Test AutoRound INT4 NVFP4 Speedup
Think mode (400 tok) 14.1s = 28.4 tok/s 26.5s = 15.1 tok/s 1.88x
Text generation (500 tok) 17.5s = 28.6 tok/s 33.1s = 15.1 tok/s 1.89x
Turkish language (200 tok) 7.2s = 27.7 tok/s 12.8s = 14.8 tok/s 1.87x
Vision / OCR (524 tok) 30.5s = 17.2 tok/s 46.7s = 9.8 tok/s 1.76x
Tool calling (72 tok) 12.6s = 5.7 tok/s N/A
Server-reported peak 28.7 tok/s 15.4 tok/s 1.86x

The MarlinLinearKernel makes a huge difference on SM121.


Quality Comparison

I tested both models with identical prompts. Results:

  • Factual Q&A: Identical answers (“The capital of Turkey is Ankara.”)
  • Turkish language: Both produced correct TÜİK 2023 data with proper diacritics (ü, ö, ç, ş, ı, İ)
  • Vision (signature circular OCR): Both extracted the same 2 signatories, notary information, and authority types from a scanned Turkish legal document (İmza Sirküleri)
  • Think/nothink separation: AutoRound correctly separates reasoning and content fields via --reasoning-parser qwen3
  • Tool calling: AutoRound generates valid structured JSON for function calls via --tool-call-parser qwen3_xml

No observable quality degradation switching from NVFP4 to AutoRound INT4.


fastsafetensors — Works at 0.85 Utilization!

fastsafetensors previously caused a system freeze with NVFP4 at 0.84 util (the 78 GB model + temp buffer exceeded 128 GB during GPU-direct loading). AutoRound’s smaller footprint (67 GB) leaves enough headroom.

Phase Standard loading fastsafetensors
Weight loading 430s (7.2 min) 60s (1 min)
torch.compile 15.9s 0.9s (cached)
CUDA graph capture 13s 13s
Total startup ~9 min ~2 min
Generation speed 28.6 tok/s 28.7 tok/s (identical)

Memory Breakdown

Total GPU memory:        128.0 GiB
gpu_memory_utilization:  0.85 → 108.8 GiB budget
Model weights:            62.65 GiB
CUDA graph pool:           1.29 GiB
Available for KV cache:  ~44.9 GiB (bf16)
Max concurrency @ 262K:    5.57x

Bottom Line

On a single DGX Spark, switching from NVFP4 to AutoRound INT4:

  • 1.85x faster generation (28 vs 15 tok/s)
  • 1.76x faster vision/OCR (17 vs 10 tok/s)
  • 7x faster startup with fastsafetensors (2 min vs 9-11 min)
  • No quality loss in text, Turkish, vision, tool calling, or reasoning
  • 11 GB smaller on disk (67 vs 78 GB)

We’ve switched our production contract management system to AutoRound + fastsafetensors. Running 262K context at 0.85 util on a single Spark with vision, tool calling, and think/nothink mode — all working.

Thanks again to everyone in this thread for the configs and tooling. This community is making the Spark ecosystem much more accessible.