Thanks to @relc for sharing the config and @eugr for the spark-vllm-docker tooling that made this possible. Inspired by the AutoRound results people were getting, I ran a controlled comparison on my single DGX Spark (128 GB).
TL;DR: AutoRound INT4 is ~1.9x faster than NVFP4 with identical output quality. fastsafetensors works at 0.85 utilization, cutting startup from 9 min to 2 min.
Setup
Hardware: Single DGX Spark (GB10), 128 GB unified memory, SM121
| AutoRound INT4 | NVFP4 | |
|---|---|---|
| Model | Intel/Qwen3.5-122B-A10B-int4-AutoRound |
txn545/Qwen3.5-122B-A10B-NVFP4 |
| Quantization | Intel AutoRound (GPTQ/Marlin) | NVIDIA ModelOpt v0.42.0 |
| Size on disk | 67 GB (14 shards) | 78 GB (2 shards) |
| GPU memory | 62.65 GiB | ~63 GiB |
| Docker image | vllm-node-tf5 (eugr’s, vLLM 0.17.0rc1, transformers v5.3.0) |
dgx-vllm-qwen35:v1-gate-fix (Avarok’s, vLLM 0.16.0rc2) |
| Quantization kernel | MarlinLinearKernel | ModelOpt NVFP4 |
| Context | 262K | 262K |
| gpu_memory_utilization | 0.85 | 0.75 |
| KV cache dtype | bf16 | fp8 |
AutoRound mods applied: fix-qwen3.5-autoround (rope validation fix for transformers v5) + fix-qwen3.5-chat-template (unsloth.jinja).
AutoRound env: VLLM_MARLIN_USE_ATOMIC_ADD=1
Launch command (single Spark, TP=1):
vllm serve /models/Intel-Qwen3.5-122B-A10B-int4-AutoRound \
--max-model-len 262144 --gpu-memory-utilization 0.85 \
--port 8080 --host 0.0.0.0 \
--load-format fastsafetensors \
--enable-prefix-caching --enable-auto-tool-choice \
--tool-call-parser qwen3_xml --reasoning-parser qwen3 \
--max-num-batched-tokens 8192 --trust-remote-code \
--chat-template unsloth.jinja
Speed Comparison
All tests: single request, sequential, temperature=0.3, warmup excluded.
| Test | AutoRound INT4 | NVFP4 | Speedup |
|---|---|---|---|
| Think mode (400 tok) | 14.1s = 28.4 tok/s | 26.5s = 15.1 tok/s | 1.88x |
| Text generation (500 tok) | 17.5s = 28.6 tok/s | 33.1s = 15.1 tok/s | 1.89x |
| Turkish language (200 tok) | 7.2s = 27.7 tok/s | 12.8s = 14.8 tok/s | 1.87x |
| Vision / OCR (524 tok) | 30.5s = 17.2 tok/s | 46.7s = 9.8 tok/s | 1.76x |
| Tool calling (72 tok) | 12.6s = 5.7 tok/s | N/A | — |
| Server-reported peak | 28.7 tok/s | 15.4 tok/s | 1.86x |
The MarlinLinearKernel makes a huge difference on SM121.
Quality Comparison
I tested both models with identical prompts. Results:
- Factual Q&A: Identical answers (“The capital of Turkey is Ankara.”)
- Turkish language: Both produced correct TÜİK 2023 data with proper diacritics (ü, ö, ç, ş, ı, İ)
- Vision (signature circular OCR): Both extracted the same 2 signatories, notary information, and authority types from a scanned Turkish legal document (İmza Sirküleri)
- Think/nothink separation: AutoRound correctly separates
reasoningandcontentfields via--reasoning-parser qwen3 - Tool calling: AutoRound generates valid structured JSON for function calls via
--tool-call-parser qwen3_xml
No observable quality degradation switching from NVFP4 to AutoRound INT4.
fastsafetensors — Works at 0.85 Utilization!
fastsafetensors previously caused a system freeze with NVFP4 at 0.84 util (the 78 GB model + temp buffer exceeded 128 GB during GPU-direct loading). AutoRound’s smaller footprint (67 GB) leaves enough headroom.
| Phase | Standard loading | fastsafetensors |
|---|---|---|
| Weight loading | 430s (7.2 min) | 60s (1 min) |
| torch.compile | 15.9s | 0.9s (cached) |
| CUDA graph capture | 13s | 13s |
| Total startup | ~9 min | ~2 min |
| Generation speed | 28.6 tok/s | 28.7 tok/s (identical) |
Memory Breakdown
Total GPU memory: 128.0 GiB
gpu_memory_utilization: 0.85 → 108.8 GiB budget
Model weights: 62.65 GiB
CUDA graph pool: 1.29 GiB
Available for KV cache: ~44.9 GiB (bf16)
Max concurrency @ 262K: 5.57x
Bottom Line
On a single DGX Spark, switching from NVFP4 to AutoRound INT4:
- 1.85x faster generation (28 vs 15 tok/s)
- 1.76x faster vision/OCR (17 vs 10 tok/s)
- 7x faster startup with fastsafetensors (2 min vs 9-11 min)
- No quality loss in text, Turkish, vision, tool calling, or reasoning
- 11 GB smaller on disk (67 vs 78 GB)
We’ve switched our production contract management system to AutoRound + fastsafetensors. Running 262K context at 0.85 util on a single Spark with vision, tool calling, and think/nothink mode — all working.
Thanks again to everyone in this thread for the configs and tooling. This community is making the Spark ecosystem much more accessible.