Qwen3.5-122B-A10B NVFP4 Quantized for DGX Spark — 234GB → 75GB, Runs on 128GB

eugr · March 9, 2026, 4:37pm

You need to run recent version of vLLM with latest fastsafetensors for that, e.g. our community docker.
Although I haven’t tested it at high gpu-memory-utilization yet, just saw some related PRs merged.

alper.tor · March 9, 2026, 4:40pm

Thank you for the tip. As far as I saw today, you released a new version. I would like to start testing with that one.

By the way, does Intel Autoround support vision? I am using both LLM and Vision

joshua.dale.warner · March 9, 2026, 4:56pm

As eugr already noted, this is incorrect.

I use 0.85 gpu memory utilization with fastsafetensors no problem. Here is my startup script (I have locally cached models NOT using HF cache so need to map my directory in, and I run this startup script from home directory so have to prefix the mod paths; otherwise it is effectively the recipe)

VLLM_SPARK_EXTRA_DOCKER_ARGS="-v $HOME/models:/models" \
~/containers/spark-vllm-docker/launch-cluster.sh -t vllm-node-tf5 --solo \
  --apply-mod ~/containers/spark-vllm-docker/mods/fix-qwen3.5-autoround \
  --apply-mod ~/containers/spark-vllm-docker/mods/fix-qwen3.5-chat-template \
  -e VLLM_MARLIN_USE_ATOMIC_ADD=1 \
  exec vllm serve /models/Qwen3.5-122B-A10B-int4-AutoRound \
  --max-model-len 262144 \
  --gpu-memory-utilization 0.85 \
  --port 8000 \
  --host 0.0.0.0 \
  --load-format fastsafetensors \
  --enable-prefix-caching \
  --enable-auto-tool-choice \
  --tool-call-parser qwen3_coder \
  --reasoning-parser qwen3 \
  --max-num-batched-tokens 8192 \
  --trust-remote-code \
  --chat-template unsloth.jinja

No memory issues on startup. Under 70 sec for the shards, under 5 minutes cold start to server available. Total KV cache available with this setup is 381k tokens.

eugr · March 9, 2026, 4:57pm

It does!

alper.tor · March 9, 2026, 4:58pm

It may be the model then. I will try tomorrow and revert. Thanks.

eugr · March 9, 2026, 5:09pm

FYI, new “stable” versions are released nightly if they pass regression testing on multiple models in both solo and cluster configuration. And you can always compile the most recent commit by using --rebuild-vllm flag, but for the most people it would make sense to use precompiled wheels that are downloaded by default.

relc · March 9, 2026, 5:35pm

I think i like it. Playing rn. Vision is on. My first two node text and vision lets go!

recipe_version: “1”
name: Qwen3.5-122B-INT4-Autoround
description: vLLM serving Qwen3.5-122B-INT4-Autoround
model: Intel/Qwen3.5-122B-A10B-int4-AutoRound
container: vllm-node-tf5
build_args:

–tf5
mods:
mods/fix-qwen3.5-autoround
defaults:
port: 8000
host: 0.0.0.0
tensor_parallel: 2
gpu_memory_utilization: 0.88
max_model_len: 262144
max_num_batched_tokens: 8192
env:
VLLM_MARLIN_USE_ATOMIC_ADD: 1
command: |
vllm serve Intel/Qwen3.5-122B-A10B-int4-AutoRound
–max-model-len {max_model_len}
–gpu-memory-utilization {gpu_memory_utilization}
–port {port}
–host {host}
–load-format fastsafetensors
–enable-prefix-caching
–enable-auto-tool-choice
–tool-call-parser qwen3_xml
–max-num-batched-tokens {max_num_batched_tokens}
–trust-remote-code
–distributed-executor-backend ray
–reasoning-parser qwen3
-tp {tensor_parallel}

(.venv) ping@spark-5c99:~/llama-benchy$ curl -s http://169.254.7.199:8000/v1/chat/completions
-H “Content-Type: application/json”
-d ‘{
“model”: “Intel/Qwen3.5-122B-A10B-int4-AutoRound”,
“messages”: [{
“role”: “user”,
“content”: [
{
“type”: “image_url”,
“image_url”: {“url”: “https://http.cat/200”}
},
{
“type”: “text”,
“text”: “What do you see in this image? Describe it in detail.”
}
]
}],
“max_tokens”: 2048,
“temperature”: 0.6,
“top_p”: 0.95,
“top_k”: 20,
“presence_penalty”: 0.0
}’ | python3 -c "
import sys, json
r = json.load(sys.stdin)
if ‘error’ in r:
print(‘ERROR:’, r[‘error’])
else:
print(r[‘choices’][0][‘message’][‘content’])
print(‘finish_reason:’, r[‘choices’][0].get(‘finish_reason’))
" print(‘tokens:’, r.get(‘usage’))

Based on the image provided, here is a detailed description:

Visual Content:
The image is formatted as a “demotivational poster,” featuring a photograph centered on a black background with a caption below it.
- The Subject: The main focus is a close-up photograph of a white cat. The cat is positioned on the right side of the frame, looking slightly off to the left. It has a distinct expression—its eyes are narrowed and yellowish-green, giving it a look of skepticism, judgment, or mild annoyance. It has a pink nose and long white whiskers.
- The Background Object: To the left of the cat is an open laptop computer. The screen is illuminated and shows a window that looks like a standard desktop interface (possibly an older version of Windows) with some text and colored bars (green and orange), though the screen is out of focus. The laptop itself is silver or light grey.
Text:
Below the photograph, on the black background, is text in white:
- “200”: Written in a large, serif font.
- “OK”: Written in a much smaller, sans-serif font directly underneath the number.
Context/Meaning:
The text “200 OK” refers to the standard HTTP status code for a successful web request. The humor of the image comes from the juxtaposition of this technical success message with the cat’s unimpressed, skeptical, or “meh” facial expression. It suggests that even though the computer task was successful, the cat (or the user) is not particularly impressed by it.
finish_reason: stop
tokens: {‘prompt_tokens’: 463, ‘total_tokens’: 1647, ‘completion_tokens’: 1184, ‘prompt_tokens_details’: None}

stuckwi · March 10, 2026, 4:13am

Thank you! I thought I was using Fastsafetensors, but apparently I wasn’t. Once I re-ran the launch command I am seeing Loading safetensors using Fastsafetensors and those weights only took ~69 sec to load.

alper.tor · March 10, 2026, 10:06am

Thanks to @relc for sharing the config and @eugr for the spark-vllm-docker tooling that made this possible. Inspired by the AutoRound results people were getting, I ran a controlled comparison on my single DGX Spark (128 GB).

TL;DR: AutoRound INT4 is ~1.9x faster than NVFP4 with identical output quality. fastsafetensors works at 0.85 utilization, cutting startup from 9 min to 2 min.

Setup

Hardware: Single DGX Spark (GB10), 128 GB unified memory, SM121

	AutoRound INT4	NVFP4
Model	`Intel/Qwen3.5-122B-A10B-int4-AutoRound`	`txn545/Qwen3.5-122B-A10B-NVFP4`
Quantization	Intel AutoRound (GPTQ/Marlin)	NVIDIA ModelOpt v0.42.0
Size on disk	67 GB (14 shards)	78 GB (2 shards)
GPU memory	62.65 GiB	~63 GiB
Docker image	`vllm-node-tf5` (eugr’s, vLLM 0.17.0rc1, transformers v5.3.0)	`dgx-vllm-qwen35:v1-gate-fix` (Avarok’s, vLLM 0.16.0rc2)
Quantization kernel	MarlinLinearKernel	ModelOpt NVFP4
Context	262K	262K
gpu_memory_utilization	0.85	0.75
KV cache dtype	bf16	fp8

AutoRound mods applied: fix-qwen3.5-autoround (rope validation fix for transformers v5) + fix-qwen3.5-chat-template (unsloth.jinja).

AutoRound env: VLLM_MARLIN_USE_ATOMIC_ADD=1

Launch command (single Spark, TP=1):

vllm serve /models/Intel-Qwen3.5-122B-A10B-int4-AutoRound \
  --max-model-len 262144 --gpu-memory-utilization 0.85 \
  --port 8080 --host 0.0.0.0 \
  --load-format fastsafetensors \
  --enable-prefix-caching --enable-auto-tool-choice \
  --tool-call-parser qwen3_xml --reasoning-parser qwen3 \
  --max-num-batched-tokens 8192 --trust-remote-code \
  --chat-template unsloth.jinja

Speed Comparison

All tests: single request, sequential, temperature=0.3, warmup excluded.

Test	AutoRound INT4	NVFP4	Speedup
Think mode (400 tok)	14.1s = 28.4 tok/s	26.5s = 15.1 tok/s	1.88x
Text generation (500 tok)	17.5s = 28.6 tok/s	33.1s = 15.1 tok/s	1.89x
Turkish language (200 tok)	7.2s = 27.7 tok/s	12.8s = 14.8 tok/s	1.87x
Vision / OCR (524 tok)	30.5s = 17.2 tok/s	46.7s = 9.8 tok/s	1.76x
Tool calling (72 tok)	12.6s = 5.7 tok/s	N/A	—
Server-reported peak	28.7 tok/s	15.4 tok/s	1.86x

The MarlinLinearKernel makes a huge difference on SM121.

Quality Comparison

I tested both models with identical prompts. Results:

Factual Q&A: Identical answers (“The capital of Turkey is Ankara.”)
Turkish language: Both produced correct TÜİK 2023 data with proper diacritics (ü, ö, ç, ş, ı, İ)
Vision (signature circular OCR): Both extracted the same 2 signatories, notary information, and authority types from a scanned Turkish legal document (İmza Sirküleri)
Think/nothink separation: AutoRound correctly separates reasoning and content fields via --reasoning-parser qwen3
Tool calling: AutoRound generates valid structured JSON for function calls via --tool-call-parser qwen3_xml

No observable quality degradation switching from NVFP4 to AutoRound INT4.

fastsafetensors — Works at 0.85 Utilization!

fastsafetensors previously caused a system freeze with NVFP4 at 0.84 util (the 78 GB model + temp buffer exceeded 128 GB during GPU-direct loading). AutoRound’s smaller footprint (67 GB) leaves enough headroom.

Phase	Standard loading	fastsafetensors
Weight loading	430s (7.2 min)	60s (1 min)
torch.compile	15.9s	0.9s (cached)
CUDA graph capture	13s	13s
Total startup	~9 min	~2 min
Generation speed	28.6 tok/s	28.7 tok/s (identical)

Memory Breakdown

Total GPU memory:        128.0 GiB
gpu_memory_utilization:  0.85 → 108.8 GiB budget
Model weights:            62.65 GiB
CUDA graph pool:           1.29 GiB
Available for KV cache:  ~44.9 GiB (bf16)
Max concurrency @ 262K:    5.57x

Bottom Line

On a single DGX Spark, switching from NVFP4 to AutoRound INT4:

1.85x faster generation (28 vs 15 tok/s)
1.76x faster vision/OCR (17 vs 10 tok/s)
7x faster startup with fastsafetensors (2 min vs 9-11 min)
No quality loss in text, Turkish, vision, tool calling, or reasoning
11 GB smaller on disk (67 vs 78 GB)

We’ve switched our production contract management system to AutoRound + fastsafetensors. Running 262K context at 0.85 util on a single Spark with vision, tool calling, and think/nothink mode — all working.

Thanks again to everyone in this thread for the configs and tooling. This community is making the Spark ecosystem much more accessible.

relc · March 10, 2026, 10:34am

Wow nice work. Thank you for sharing alper.tor
Good results. I will test.

tenari · March 10, 2026, 2:55pm

Thank you; been trying to get nvfp4 working for a while now, but the performance of AutoRound has just been too compelling to ignore. And @eugr thanks for the amazing project – has made onboarding my first spark and running local agents surprisingly painless.

user134 · March 11, 2026, 3:57am

I’m running an Intel-Qwen3.5-122B-A10B-int4-AutoRound model using the @eugr 's vllm-spark-docker image. Why is it using about 95GB of memory? Shouldn’t it normally be using around 65GB?
I ran it using the recipe provided on Github.

eugr · March 11, 2026, 5:05am

vLLM will take all available memory you have allocated up to --gpu-memory-utilization value which is 0.7 in the recipe. The weights may take 65GB, but there are also CUDA graphs and the rest is KV cache.

user134 · March 11, 2026, 5:34am

I got it. Thank you for your response.

AL_NVIDIA · March 29, 2026, 3:55pm

@alper.tor alpertor version no longer exists. Is RedHatAI’s the best replacement?

alper.tor · March 29, 2026, 4:05pm

I am using Intel’s AutoRound version. I got sustainable 30 tk/s and works without issues. I will try when a new NGC version is available.

user134 · March 30, 2026, 5:35am

Do you know what the difference is between qwen3_xml used for the tool-call parser and qwen3_coder used in eugr’s recipe?

alper.tor · March 30, 2026, 5:42am

This is how the tool is communicated with. This is how - in what format - your AI engine will accept command/requests.

AL_NVIDIA · March 31, 2026, 12:43am

I am eager to get some urgent work done using OpenClaw, so haven’t played much using Claude Code + the local model we are talking about here. After a 16k length image crashes soon after model started doing an inference which involves browsing more info, I settled the model to 8k and tested using Open WebUI. However, once this model configure is integrated into OpenClaw, the short token length allowance became culprit of using this setup.

Has anyone had any experience using Qwen3.5-122B-A10B 4bit (intel’s or NVFP4) locally with local OpenClaw? what’s the recipe to make is a success?

wentbackward · March 31, 2026, 2:26am

You can use my project llm-proxy to set up virtual models - all the same backend/real model, each vmodel has it’s own parameters. Set or clamp them. Log and inspect the params or even the whole context. /models in openclaw to quickly switch between them. Hot swap params in the config

I’ll do a bigger post on here when I resolve a few more niggles

Topic		Replies	Views
Qwen/Qwen3.5-122B-A10B - Alibaba/Qwen thought about us... :-D DGX Spark / GB10	340	14606	March 24, 2026
RedHatAI/Qwen3.5-122B-A10B-NVFP4 seems to be the best option for a single Spark DGX Spark / GB10 Projects llm	74	4456	April 11, 2026
Qwen3.5-397B-A17B run in dual spark! but I have a concern DGX Spark / GB10	222	6233	April 14, 2026
Qwen3.5-397B-A17B + DGX Spark (duo) DGX Spark / GB10 Projects	56	4524	April 13, 2026
Qwen3.5-122B-A10B on single Spark: up to 51 tok/s (v2.1 — patches + quick-start + benchmark) DGX Spark / GB10 cuda , performance , docker , performance-tuning , llm	290	6802	April 16, 2026
DGX Spark + Qwen3-Next-80B: Proven Performance, But Missing Clear Path to NIM, TensorRT-LLM & Web UIs DGX Spark / GB10 cuda , nim , llama	16	3838	March 6, 2026
New bleeding-edge vLLM Docker Image: avarok/vllm-nvfp4-gb10-sm120 DGX Spark / GB10 Projects	35	2776	December 31, 2025
We unlocked NVFP4 on the DGX Spark: 20% faster than AWQ! DGX Spark / GB10	145	6399	March 28, 2026
Does Qwen3.5-35B-A3B on GB10 leave a lot of performance on the table? DGX Spark / GB10 agentic-ai	40	4618	March 16, 2026
Qwen3.5-397B-A17B-int4-AutoRound - 4 x db10 node - updated results 37 - 94 tok/s DGX Spark / GB10 clustering , spark	22	1248	April 8, 2026