RedHatAI/Qwen3.5-122B-A10B-NVFP4 seems to be the best option for a single Spark

The model RedHatAI/Qwen3.5-122B-A10B-NVFP4 has been released.

As of March 17, 2026, I personally think this is the most well-balanced model for a single Spark setup.

I built vLLM from the main branch, and you’ll likely need Transformers version 5 or higher.

model test t/s peak t/s ttfr (ms) est_ppt (ms) e2e_ttft (ms)
/workspace/Model/Qwen3.5-122B-A10B-NVFP4 pp2048 1849.48 ± 329.50 1040.11 ± 203.70 1036.52 ± 203.70 1040.16 ± 203.71
/workspace/Model/Qwen3.5-122B-A10B-NVFP4 tg32 16.15 ± 0.04 17.00 ± 0.00
/workspace/Model/Qwen3.5-122B-A10B-NVFP4 pp2048 @ d4096 2452.08 ± 109.18 2330.78 ± 90.53 2327.20 ± 90.53 2330.85 ± 90.54
/workspace/Model/Qwen3.5-122B-A10B-NVFP4 tg32 @ d4096 16.00 ± 0.01 16.67 ± 0.47
/workspace/Model/Qwen3.5-122B-A10B-NVFP4 pp2048 @ d8192 2587.14 ± 24.05 3602.87 ± 37.24 3599.29 ± 37.24 3602.94 ± 37.24
/workspace/Model/Qwen3.5-122B-A10B-NVFP4 tg32 @ d8192 15.90 ± 0.02 16.33 ± 0.47
/workspace/Model/Qwen3.5-122B-A10B-NVFP4 pp2048 @ d16384 2624.68 ± 7.74 6333.91 ± 35.43 6330.33 ± 35.43 6333.98 ± 35.43
/workspace/Model/Qwen3.5-122B-A10B-NVFP4 tg32 @ d16384 15.66 ± 0.06 16.00 ± 0.00

Below is the command I used:

vllm serve /workspace/Model/Qwen3.5-122B-A10B-NVFP4 \
  --host 0.0.0.0 --port 8000 \
  --reasoning-parser qwen3 \
  --enable-prefix-caching \
  --enable-auto-tool-choice \
  --tool-call-parser qwen3_coder \
  --max-model-len 262144 \
  --moe_backend flashinfer_cutlass \
  --max-num-batched-tokens 8192 \
  --default-chat-template-kwargs '{"enable_thinking": false}' \
  --chat-template /workspace/spark-vllm-docker/mods/fix-qwen3.5-chat-template/chat_template.jinja \
  --max-num-seqs 100 \
  --trust-remote-code

In my opinion, the output quality feels very close to Qwen/Qwen3.5-122B-A10B.

While it’s slower than Qwen3.5-122B-A10B-int4-AutoRound in terms of speed, the output quality is comparable to the FP16 version. Because of that, I’ll likely be using this going forward.

I’d recommend others give it a try as well.

1 Like

Thank you for sharing, I will move this to GB10 projects

Does RedHatAI/Qwen3.5-122B-A10B-NVFP4 have better output quality than Qwen3.5-122B-A10B-int4-AutoRound?

Have you tried to run it with Marlin NVFP4 GEMM backend? You will likely get better performance…
You can look into Nemotron3-Super recipe for details.

I’d like to

I’d like to know this too - other NVFP4 quants I have tried so far have produced lower quality than tuned INT4 + Autoround quants, but that could be down to lack of proper calibration.

1 Like

I think, as you mentioned, it may come down to differences in post-quantization calibration.
In particular, RedHatAI seems to apply this kind of calibration quite carefully.
If you look through the repository, you can see that it still maintains a fairly high level of reconstruction quality even after quantization.

Compared with Qwen3.5-122B-A10B-int4-AutoRound, one difference I noticed in actual use is that my setup is designed to call a wide variety of functions through multiple MCPs. With Qwen3.5-122B-A10B-int4-AutoRound, I experienced function-calling failures fairly often.
However, with RedHatAI/Qwen3.5-122B-A10B-NVFP4, those issues were noticeably less frequent, and in particular, it felt much better at following prompt instructions clearly and consistently.

However, this is not an objective measurement but rather a highly subjective evaluation based on my personal usage, making it difficult to accept as conclusive.
More importantly, it is a subjective assessment that completely excludes any errors on my part or code-related issues in the agent configuration.

@eugr

I tried running it with the Marlin NVFP4 GEMM backend as well, following the recipe you shared.

There was some speed improvement, but it was less than 1 token per second, so the overall effect did not seem very significant. Here are the results.

model test t/s peak t/s ttfr (ms) est_ppt (ms) e2e_ttft (ms)
/workspace/Model/Qwen3.5-122B-A10B-NVFP4 pp2048 1656.49 ± 206.39 1181.01 ± 165.14 1178.22 ± 165.14 1181.09 ± 165.14
/workspace/Model/Qwen3.5-122B-A10B-NVFP4 tg32 16.83 ± 0.02 17.67 ± 0.47
/workspace/Model/Qwen3.5-122B-A10B-NVFP4 pp2048 @ d4096 2141.83 ± 111.62 2641.53 ± 159.40 2638.74 ± 159.40 2641.62 ± 159.40
/workspace/Model/Qwen3.5-122B-A10B-NVFP4 tg32 @ d4096 16.70 ± 0.02 17.00 ± 0.00
/workspace/Model/Qwen3.5-122B-A10B-NVFP4 pp2048 @ d8192 2222.72 ± 20.43 4186.01 ± 29.93 4183.22 ± 29.93 4186.09 ± 29.93
/workspace/Model/Qwen3.5-122B-A10B-NVFP4 tg32 @ d8192 16.62 ± 0.05 17.00 ± 0.00
/workspace/Model/Qwen3.5-122B-A10B-NVFP4 pp2048 @ d16384 2239.47 ± 7.24 7530.72 ± 47.32 7527.93 ± 47.32 7530.79 ± 47.33
/workspace/Model/Qwen3.5-122B-A10B-NVFP4 tg32 @ d16384 16.38 ± 0.02 17.00 ± 0.00
2 Likes

This is really interesting and I think we need better mechanisms to rapidly benchmark the real world quality of quants - outside the llama.cpp GGUF perplexity and similar infrastructure.

Edit: it’s also worth noting that Intel Autoround can perform NVFP4 quants, and use configurable calibration data, so this is not as simple as just saying the datatype. The compression engine, settings, and calibration data are all important.

1 Like

Related: I’ve been experimenting with AutoRound quants. and one thing that surprised me was that when I went to great lengths to create and curate a coding calibration dataset for Qwen3.5-27b to optimize an int4 quant for coding, it did worse in every way. It was worse at coding than the default calibration, worse at general tasks too.

I was baffled about this for a while. Tried the same with Omnicoder-9B and it was also quite a lot worse. Then I looked back at Intel’s own page for the int4 quant of Qwen3-Coder-Next, the coding specific model which everyone seems to like, and they claim to have used the default calibration set - not even their registered coding sets (which are both not very good and mostly Python).

So, it’s definitely possible to get a suboptimal quant from Autoround. And some of their guidance regarding using domain specific calibrations seems wrong or not the whole story.

Maybe coding models actually need general text to keep logical structure and understanding of prompts? Maybe we should mix general and domain samples? What’s the right mix - 1:1?

If anyone has more insights on this I’m all ears!

The main benefit is stability - NVFP4 quants with flashinfer are known to crash (mostly for Nemotron models though).

2 Likes

@gpieceoffice I can’t get RedHatAI/Qwen3.5-122B-A10B-NVFP4 to not OOM for some reason. (single spark)

vLLM v0.18.0 has been release some hours ago and PyTorch NGC 26.02-py3
Hi @eugr,
I just saw that

and

has been released, which might benefit your spark-vllm docker image :-).

My repository already builds vLLM from main, so we get all the new features before they get released officially (as long as it passes the test pipeline).

I actually tried to switch to 26.02 pytorch base image a few weeks ago, and had a lot of problems with compatibility between vLLM and pytorch in that repo, so I had to roll back. I’ll give 26.02 a try again to see if it works now, but I’m actually considering switching back to more barebones base image now, because new vLLM builds fail on 26.01 as well. As an added benefit, it will shave ~9GB of stuff that we don’t need.

3 Likes

Yes I agree.

Today, I had some time to rebuild the community docker image GitHub - eugr/spark-vllm-docker: Docker configuration for running VLLM on dual DGX Sparks · GitHub as is and got a build errors.

Next I also used your community docker image and build with PyTorch Release 26.02 - NVIDIA Docs. The build was successful, but the container did not start successfully and I had the error

”Executing command on head node: vllm serve Intel/Qwen3.5-122B-A10B-int4-AutoRound --trust-remote-code --load-format fastsafetensors --max-model-len 262144 --kv-cache-dtype fp8 --enable-prefix-caching --enable-chunked-prefill --max-num-batched-tokens 32768 --attention-backend flashinfer --enable-auto-tool-choice --tool-call-parser qwen3_coder --reasoning-parser qwen3 --chat-template unsloth.jinja --gpu-memory-utilization 0.80 --host 0.0.0.0 --port 8000
Traceback (most recent call last):
File “/usr/local/bin/vllm”, line 4, in
from vllm.entrypoints.cli.main import main
File “/usr/local/lib/python3.12/dist-packages/vllm/init.py”, line 14, in
import vllm.env_override # noqa: F401
^^^^^^^^^^^^^^^^^^^^^^^^
File “/usr/local/lib/python3.12/dist-packages/vllm/env_override.py”, line 90, in
from vllm.utils.torch_utils import is_torch_equal
File “/usr/local/lib/python3.12/dist-packages/vllm/utils/torch_utils.py”, line 746, in
from torch._opaque_base import OpaqueBase
ModuleNotFoundError: No module named ‘torch._opaque_base’

Stopping cluster…”

This seems to be a compatibility issue between Pytorch and VLLM.

FYI. I also came across the VLLM NVIDIA dgx spark playbook vLLM for Inference | DGX Spark. It seems to work for some LLMs, but I do not know how good this is optimized.

That playbook works, but vLLM version they use is lagging a few versions behind the most recent release (and main). But at least it works now, back in October there were no working vLLM versions at all - that’s how the community Docker project was started.

1 Like

“playbook works, but vLLM version they use is lagging a few versions behind the most recent release”

This is why we are a community and build our own (mainly you) community docker image.

Many many many thanks for all the work you have done for this community. This is usually a job for NVIDIA, who want to sell their products with its performance as claimed.

gpieceoffice - Are you running vllm directly on the Spark or is this command embedded in a docker somehow? Sorry, still very new to how we’re doing things on the Spark and don’t want to mess up my boxes right out of the gate. The instructions that were included with the Spark gave examples of running vLLM in a docker, not directly installing it.

Btw, my goal here with this model is to try running 4 separate copies on each of the 4 sparks, not have it spread across them. Not sure the best way of going about that right now. Open to input from eugr and others on that goal.

This is not a guide for creating a venv and running vLLM directly on Spark.

What I shared in the post is about building a vLLM Docker image for Spark from the main branch and running the serving process inside that container, along with the necessary flags.

While it is possible to run vLLM in a virtual environment, I personally find using Docker much more convenient when setting up a deployment environment.

As for running four separate instances across four Sparks: the straightforward approach is to run one Docker container on each Spark node and serve the model independently. Each instance will expose its own API endpoint, which you can then integrate into your project as needed.

That’s the approach I’m aware of for running four independent copies across four Spark machines.

1 Like

o run that model with --max-model-len 262144, please first check that you have at least 116 GiB of free memory when the system is idle (e.g., by running free -h).

If you don’t have that much available, try freeing up memory until you reach that level before running it again.

Also, set --gpu-memory-utilization to at least 0.85, but below 0.9.

1 Like

On fresh boot I only have 115GB free. Confirmed across all 4 Sparks. I haven’t installed anything besides sparkrun and eugr’s repo/scripts.

What about if you run sudo sh -c 'sync; echo 3 > /proc/sys/vm/drop_caches'?

For reference here’s the numbers I get (with only a small dashboard running):

danny@toad:~$ sudo sh -c 'sync; echo 3 > /proc/sys/vm/drop_caches'

danny@toad:~$ free -h
               total        used        free      shared  buff/cache   available
Mem:           121Gi       2.4Gi       119Gi       1.0Mi       261Mi       119Gi
Swap:           15Gi        67Mi        15Gi

danny@toad:~$ free -h --si
               total        used        free      shared  buff/cache   available
Mem:            130G        2.6G        128G        1.0M        336M        128G
Swap:            17G         71M         17G

Though I do have set it to use the multi-user target (so it doesn’t load any desktop) because I only use it over SSH.

1 Like