In my opinion, the output quality feels very close to Qwen/Qwen3.5-122B-A10B.
While it’s slower than Qwen3.5-122B-A10B-int4-AutoRound in terms of speed, the output quality is comparable to the FP16 version. Because of that, I’ll likely be using this going forward.
I’d like to know this too - other NVFP4 quants I have tried so far have produced lower quality than tuned INT4 + Autoround quants, but that could be down to lack of proper calibration.
I think, as you mentioned, it may come down to differences in post-quantization calibration.
In particular, RedHatAI seems to apply this kind of calibration quite carefully.
If you look through the repository, you can see that it still maintains a fairly high level of reconstruction quality even after quantization.
Compared with Qwen3.5-122B-A10B-int4-AutoRound, one difference I noticed in actual use is that my setup is designed to call a wide variety of functions through multiple MCPs. With Qwen3.5-122B-A10B-int4-AutoRound, I experienced function-calling failures fairly often.
However, with RedHatAI/Qwen3.5-122B-A10B-NVFP4, those issues were noticeably less frequent, and in particular, it felt much better at following prompt instructions clearly and consistently.
However, this is not an objective measurement but rather a highly subjective evaluation based on my personal usage, making it difficult to accept as conclusive.
More importantly, it is a subjective assessment that completely excludes any errors on my part or code-related issues in the agent configuration.
This is really interesting and I think we need better mechanisms to rapidly benchmark the real world quality of quants - outside the llama.cpp GGUF perplexity and similar infrastructure.
Edit: it’s also worth noting that Intel Autoround can perform NVFP4 quants, and use configurable calibration data, so this is not as simple as just saying the datatype. The compression engine, settings, and calibration data are all important.
Related: I’ve been experimenting with AutoRound quants. and one thing that surprised me was that when I went to great lengths to create and curate a coding calibration dataset for Qwen3.5-27b to optimize an int4 quant for coding, it did worse in every way. It was worse at coding than the default calibration, worse at general tasks too.
I was baffled about this for a while. Tried the same with Omnicoder-9B and it was also quite a lot worse. Then I looked back at Intel’s own page for the int4 quant of Qwen3-Coder-Next, the coding specific model which everyone seems to like, and they claim to have used the default calibration set - not even their registered coding sets (which are both not very good and mostly Python).
So, it’s definitely possible to get a suboptimal quant from Autoround. And some of their guidance regarding using domain specific calibrations seems wrong or not the whole story.
Maybe coding models actually need general text to keep logical structure and understanding of prompts? Maybe we should mix general and domain samples? What’s the right mix - 1:1?
My repository already builds vLLM from main, so we get all the new features before they get released officially (as long as it passes the test pipeline).
I actually tried to switch to 26.02 pytorch base image a few weeks ago, and had a lot of problems with compatibility between vLLM and pytorch in that repo, so I had to roll back. I’ll give 26.02 a try again to see if it works now, but I’m actually considering switching back to more barebones base image now, because new vLLM builds fail on 26.01 as well. As an added benefit, it will shave ~9GB of stuff that we don’t need.
Next I also used your community docker image and build with PyTorch Release 26.02 - NVIDIA Docs. The build was successful, but the container did not start successfully and I had the error
”Executing command on head node: vllm serve Intel/Qwen3.5-122B-A10B-int4-AutoRound --trust-remote-code --load-format fastsafetensors --max-model-len 262144 --kv-cache-dtype fp8 --enable-prefix-caching --enable-chunked-prefill --max-num-batched-tokens 32768 --attention-backend flashinfer --enable-auto-tool-choice --tool-call-parser qwen3_coder --reasoning-parser qwen3 --chat-template unsloth.jinja --gpu-memory-utilization 0.80 --host 0.0.0.0 --port 8000
Traceback (most recent call last):
File “/usr/local/bin/vllm”, line 4, in
from vllm.entrypoints.cli.main import main
File “/usr/local/lib/python3.12/dist-packages/vllm/init.py”, line 14, in
import vllm.env_override # noqa: F401
^^^^^^^^^^^^^^^^^^^^^^^^
File “/usr/local/lib/python3.12/dist-packages/vllm/env_override.py”, line 90, in
from vllm.utils.torch_utils import is_torch_equal
File “/usr/local/lib/python3.12/dist-packages/vllm/utils/torch_utils.py”, line 746, in
from torch._opaque_base import OpaqueBase
ModuleNotFoundError: No module named ‘torch._opaque_base’
Stopping cluster…”
This seems to be a compatibility issue between Pytorch and VLLM.
FYI. I also came across the VLLM NVIDIA dgx spark playbook vLLM for Inference | DGX Spark. It seems to work for some LLMs, but I do not know how good this is optimized.
That playbook works, but vLLM version they use is lagging a few versions behind the most recent release (and main). But at least it works now, back in October there were no working vLLM versions at all - that’s how the community Docker project was started.
“playbook works, but vLLM version they use is lagging a few versions behind the most recent release”
This is why we are a community and build our own (mainly you) community docker image.
Many many many thanks for all the work you have done for this community. This is usually a job for NVIDIA, who want to sell their products with its performance as claimed.
gpieceoffice - Are you running vllm directly on the Spark or is this command embedded in a docker somehow? Sorry, still very new to how we’re doing things on the Spark and don’t want to mess up my boxes right out of the gate. The instructions that were included with the Spark gave examples of running vLLM in a docker, not directly installing it.
Btw, my goal here with this model is to try running 4 separate copies on each of the 4 sparks, not have it spread across them. Not sure the best way of going about that right now. Open to input from eugr and others on that goal.
This is not a guide for creating a venv and running vLLM directly on Spark.
What I shared in the post is about building a vLLM Docker image for Spark from the main branch and running the serving process inside that container, along with the necessary flags.
While it is possible to run vLLM in a virtual environment, I personally find using Docker much more convenient when setting up a deployment environment.
As for running four separate instances across four Sparks: the straightforward approach is to run one Docker container on each Spark node and serve the model independently. Each instance will expose its own API endpoint, which you can then integrate into your project as needed.
That’s the approach I’m aware of for running four independent copies across four Spark machines.
o run that model with --max-model-len 262144, please first check that you have at least 116 GiB of free memory when the system is idle (e.g., by running free -h).
If you don’t have that much available, try freeing up memory until you reach that level before running it again.
Also, set --gpu-memory-utilization to at least 0.85, but below 0.9.