You need to run recent version of vLLM with latest fastsafetensors for that, e.g. our community docker.
Although I haven’t tested it at high gpu-memory-utilization yet, just saw some related PRs merged.
Thank you for the tip. As far as I saw today, you released a new version. I would like to start testing with that one.
By the way, does Intel Autoround support vision? I am using both LLM and Vision
As eugr already noted, this is incorrect.
I use 0.85 gpu memory utilization with fastsafetensors no problem. Here is my startup script (I have locally cached models NOT using HF cache so need to map my directory in, and I run this startup script from home directory so have to prefix the mod paths; otherwise it is effectively the recipe)
VLLM_SPARK_EXTRA_DOCKER_ARGS="-v $HOME/models:/models" \
~/containers/spark-vllm-docker/launch-cluster.sh -t vllm-node-tf5 --solo \
--apply-mod ~/containers/spark-vllm-docker/mods/fix-qwen3.5-autoround \
--apply-mod ~/containers/spark-vllm-docker/mods/fix-qwen3.5-chat-template \
-e VLLM_MARLIN_USE_ATOMIC_ADD=1 \
exec vllm serve /models/Qwen3.5-122B-A10B-int4-AutoRound \
--max-model-len 262144 \
--gpu-memory-utilization 0.85 \
--port 8000 \
--host 0.0.0.0 \
--load-format fastsafetensors \
--enable-prefix-caching \
--enable-auto-tool-choice \
--tool-call-parser qwen3_coder \
--reasoning-parser qwen3 \
--max-num-batched-tokens 8192 \
--trust-remote-code \
--chat-template unsloth.jinja
No memory issues on startup. Under 70 sec for the shards, under 5 minutes cold start to server available. Total KV cache available with this setup is 381k tokens.
It does!
It may be the model then. I will try tomorrow and revert. Thanks.
FYI, new “stable” versions are released nightly if they pass regression testing on multiple models in both solo and cluster configuration. And you can always compile the most recent commit by using --rebuild-vllm flag, but for the most people it would make sense to use precompiled wheels that are downloaded by default.
I think i like it. Playing rn. Vision is on. My first two node text and vision lets go!
recipe_version: “1”
name: Qwen3.5-122B-INT4-Autoround
description: vLLM serving Qwen3.5-122B-INT4-Autoround
model: Intel/Qwen3.5-122B-A10B-int4-AutoRound
container: vllm-node-tf5
build_args:
-
–tf5
mods: -
mods/fix-qwen3.5-autoround
defaults:
port: 8000
host: 0.0.0.0
tensor_parallel: 2
gpu_memory_utilization: 0.88
max_model_len: 262144
max_num_batched_tokens: 8192
env:
VLLM_MARLIN_USE_ATOMIC_ADD: 1
command: |
vllm serve Intel/Qwen3.5-122B-A10B-int4-AutoRound
–max-model-len {max_model_len}
–gpu-memory-utilization {gpu_memory_utilization}
–port {port}
–host {host}
–load-format fastsafetensors
–enable-prefix-caching
–enable-auto-tool-choice
–tool-call-parser qwen3_xml
–max-num-batched-tokens {max_num_batched_tokens}
–trust-remote-code
–distributed-executor-backend ray
–reasoning-parser qwen3
-tp {tensor_parallel}(.venv) ping@spark-5c99:~/llama-benchy$ curl -s http://169.254.7.199:8000/v1/chat/completions
-H “Content-Type: application/json”
-d ‘{
“model”: “Intel/Qwen3.5-122B-A10B-int4-AutoRound”,
“messages”: [{
“role”: “user”,
“content”: [
{
“type”: “image_url”,
“image_url”: {“url”: “https://http.cat/200”}
},
{
“type”: “text”,
“text”: “What do you see in this image? Describe it in detail.”
}
]
}],
“max_tokens”: 2048,
“temperature”: 0.6,
“top_p”: 0.95,
“top_k”: 20,
“presence_penalty”: 0.0
}’ | python3 -c "
import sys, json
r = json.load(sys.stdin)
if ‘error’ in r:
print(‘ERROR:’, r[‘error’])
else:
print(r[‘choices’][0][‘message’][‘content’])
print(‘finish_reason:’, r[‘choices’][0].get(‘finish_reason’))
" print(‘tokens:’, r.get(‘usage’))Based on the image provided, here is a detailed description:
Visual Content:
The image is formatted as a “demotivational poster,” featuring a photograph centered on a black background with a caption below it.- The Subject: The main focus is a close-up photograph of a white cat. The cat is positioned on the right side of the frame, looking slightly off to the left. It has a distinct expression—its eyes are narrowed and yellowish-green, giving it a look of skepticism, judgment, or mild annoyance. It has a pink nose and long white whiskers.
- The Background Object: To the left of the cat is an open laptop computer. The screen is illuminated and shows a window that looks like a standard desktop interface (possibly an older version of Windows) with some text and colored bars (green and orange), though the screen is out of focus. The laptop itself is silver or light grey.
Text:
Below the photograph, on the black background, is text in white:- “200”: Written in a large, serif font.
- “OK”: Written in a much smaller, sans-serif font directly underneath the number.
Context/Meaning:
The text “200 OK” refers to the standard HTTP status code for a successful web request. The humor of the image comes from the juxtaposition of this technical success message with the cat’s unimpressed, skeptical, or “meh” facial expression. It suggests that even though the computer task was successful, the cat (or the user) is not particularly impressed by it.
finish_reason: stop
tokens: {‘prompt_tokens’: 463, ‘total_tokens’: 1647, ‘completion_tokens’: 1184, ‘prompt_tokens_details’: None}
Thank you! I thought I was using Fastsafetensors, but apparently I wasn’t. Once I re-ran the launch command I am seeing Loading safetensors using Fastsafetensors and those weights only took ~69 sec to load.
Thanks to @relc for sharing the config and @eugr for the spark-vllm-docker tooling that made this possible. Inspired by the AutoRound results people were getting, I ran a controlled comparison on my single DGX Spark (128 GB).
TL;DR: AutoRound INT4 is ~1.9x faster than NVFP4 with identical output quality. fastsafetensors works at 0.85 utilization, cutting startup from 9 min to 2 min.
Setup
Hardware: Single DGX Spark (GB10), 128 GB unified memory, SM121
| AutoRound INT4 | NVFP4 | |
|---|---|---|
| Model | Intel/Qwen3.5-122B-A10B-int4-AutoRound |
txn545/Qwen3.5-122B-A10B-NVFP4 |
| Quantization | Intel AutoRound (GPTQ/Marlin) | NVIDIA ModelOpt v0.42.0 |
| Size on disk | 67 GB (14 shards) | 78 GB (2 shards) |
| GPU memory | 62.65 GiB | ~63 GiB |
| Docker image | vllm-node-tf5 (eugr’s, vLLM 0.17.0rc1, transformers v5.3.0) |
dgx-vllm-qwen35:v1-gate-fix (Avarok’s, vLLM 0.16.0rc2) |
| Quantization kernel | MarlinLinearKernel | ModelOpt NVFP4 |
| Context | 262K | 262K |
| gpu_memory_utilization | 0.85 | 0.75 |
| KV cache dtype | bf16 | fp8 |
AutoRound mods applied: fix-qwen3.5-autoround (rope validation fix for transformers v5) + fix-qwen3.5-chat-template (unsloth.jinja).
AutoRound env: VLLM_MARLIN_USE_ATOMIC_ADD=1
Launch command (single Spark, TP=1):
vllm serve /models/Intel-Qwen3.5-122B-A10B-int4-AutoRound \
--max-model-len 262144 --gpu-memory-utilization 0.85 \
--port 8080 --host 0.0.0.0 \
--load-format fastsafetensors \
--enable-prefix-caching --enable-auto-tool-choice \
--tool-call-parser qwen3_xml --reasoning-parser qwen3 \
--max-num-batched-tokens 8192 --trust-remote-code \
--chat-template unsloth.jinja
Speed Comparison
All tests: single request, sequential, temperature=0.3, warmup excluded.
| Test | AutoRound INT4 | NVFP4 | Speedup |
|---|---|---|---|
| Think mode (400 tok) | 14.1s = 28.4 tok/s | 26.5s = 15.1 tok/s | 1.88x |
| Text generation (500 tok) | 17.5s = 28.6 tok/s | 33.1s = 15.1 tok/s | 1.89x |
| Turkish language (200 tok) | 7.2s = 27.7 tok/s | 12.8s = 14.8 tok/s | 1.87x |
| Vision / OCR (524 tok) | 30.5s = 17.2 tok/s | 46.7s = 9.8 tok/s | 1.76x |
| Tool calling (72 tok) | 12.6s = 5.7 tok/s | N/A | — |
| Server-reported peak | 28.7 tok/s | 15.4 tok/s | 1.86x |
The MarlinLinearKernel makes a huge difference on SM121.
Quality Comparison
I tested both models with identical prompts. Results:
- Factual Q&A: Identical answers (“The capital of Turkey is Ankara.”)
- Turkish language: Both produced correct TÜİK 2023 data with proper diacritics (ü, ö, ç, ş, ı, İ)
- Vision (signature circular OCR): Both extracted the same 2 signatories, notary information, and authority types from a scanned Turkish legal document (İmza Sirküleri)
- Think/nothink separation: AutoRound correctly separates
reasoningandcontentfields via--reasoning-parser qwen3 - Tool calling: AutoRound generates valid structured JSON for function calls via
--tool-call-parser qwen3_xml
No observable quality degradation switching from NVFP4 to AutoRound INT4.
fastsafetensors — Works at 0.85 Utilization!
fastsafetensors previously caused a system freeze with NVFP4 at 0.84 util (the 78 GB model + temp buffer exceeded 128 GB during GPU-direct loading). AutoRound’s smaller footprint (67 GB) leaves enough headroom.
| Phase | Standard loading | fastsafetensors |
|---|---|---|
| Weight loading | 430s (7.2 min) | 60s (1 min) |
| torch.compile | 15.9s | 0.9s (cached) |
| CUDA graph capture | 13s | 13s |
| Total startup | ~9 min | ~2 min |
| Generation speed | 28.6 tok/s | 28.7 tok/s (identical) |
Memory Breakdown
Total GPU memory: 128.0 GiB
gpu_memory_utilization: 0.85 → 108.8 GiB budget
Model weights: 62.65 GiB
CUDA graph pool: 1.29 GiB
Available for KV cache: ~44.9 GiB (bf16)
Max concurrency @ 262K: 5.57x
Bottom Line
On a single DGX Spark, switching from NVFP4 to AutoRound INT4:
- 1.85x faster generation (28 vs 15 tok/s)
- 1.76x faster vision/OCR (17 vs 10 tok/s)
- 7x faster startup with fastsafetensors (2 min vs 9-11 min)
- No quality loss in text, Turkish, vision, tool calling, or reasoning
- 11 GB smaller on disk (67 vs 78 GB)
We’ve switched our production contract management system to AutoRound + fastsafetensors. Running 262K context at 0.85 util on a single Spark with vision, tool calling, and think/nothink mode — all working.
Thanks again to everyone in this thread for the configs and tooling. This community is making the Spark ecosystem much more accessible.
Wow nice work. Thank you for sharing alper.tor
Good results. I will test.
Thank you; been trying to get nvfp4 working for a while now, but the performance of AutoRound has just been too compelling to ignore. And @eugr thanks for the amazing project – has made onboarding my first spark and running local agents surprisingly painless.
I’m running an Intel-Qwen3.5-122B-A10B-int4-AutoRound model using the @eugr 's vllm-spark-docker image. Why is it using about 95GB of memory? Shouldn’t it normally be using around 65GB?
I ran it using the recipe provided on Github.
vLLM will take all available memory you have allocated up to --gpu-memory-utilization value which is 0.7 in the recipe. The weights may take 65GB, but there are also CUDA graphs and the rest is KV cache.
I got it. Thank you for your response.
@alper.tor alpertor version no longer exists. Is RedHatAI’s the best replacement?
I am using Intel’s AutoRound version. I got sustainable 30 tk/s and works without issues. I will try when a new NGC version is available.
Do you know what the difference is between qwen3_xml used for the tool-call parser and qwen3_coder used in eugr’s recipe?
This is how the tool is communicated with. This is how - in what format - your AI engine will accept command/requests.
I am eager to get some urgent work done using OpenClaw, so haven’t played much using Claude Code + the local model we are talking about here. After a 16k length image crashes soon after model started doing an inference which involves browsing more info, I settled the model to 8k and tested using Open WebUI. However, once this model configure is integrated into OpenClaw, the short token length allowance became culprit of using this setup.
Has anyone had any experience using Qwen3.5-122B-A10B 4bit (intel’s or NVFP4) locally with local OpenClaw? what’s the recipe to make is a success?
You can use my project llm-proxy to set up virtual models - all the same backend/real model, each vmodel has it’s own parameters. Set or clamp them. Log and inspect the params or even the whole context. /models in openclaw to quickly switch between them. Hot swap params in the config
I’ll do a bigger post on here when I resolve a few more niggles