RedHatAI/Qwen3.5-122B-A10B-NVFP4 seems to be the best option for a single Spark

gpieceoffice · March 17, 2026, 1:14pm

The model RedHatAI/Qwen3.5-122B-A10B-NVFP4 has been released.

As of March 17, 2026, I personally think this is the most well-balanced model for a single Spark setup.

I built vLLM from the main branch, and you’ll likely need Transformers version 5 or higher.

model	test	t/s	peak t/s	ttfr (ms)	est_ppt (ms)	e2e_ttft (ms)
/workspace/Model/Qwen3.5-122B-A10B-NVFP4	pp2048	1849.48 ± 329.50		1040.11 ± 203.70	1036.52 ± 203.70	1040.16 ± 203.71
/workspace/Model/Qwen3.5-122B-A10B-NVFP4	tg32	16.15 ± 0.04	17.00 ± 0.00
/workspace/Model/Qwen3.5-122B-A10B-NVFP4	pp2048 @ d4096	2452.08 ± 109.18		2330.78 ± 90.53	2327.20 ± 90.53	2330.85 ± 90.54
/workspace/Model/Qwen3.5-122B-A10B-NVFP4	tg32 @ d4096	16.00 ± 0.01	16.67 ± 0.47
/workspace/Model/Qwen3.5-122B-A10B-NVFP4	pp2048 @ d8192	2587.14 ± 24.05		3602.87 ± 37.24	3599.29 ± 37.24	3602.94 ± 37.24
/workspace/Model/Qwen3.5-122B-A10B-NVFP4	tg32 @ d8192	15.90 ± 0.02	16.33 ± 0.47
/workspace/Model/Qwen3.5-122B-A10B-NVFP4	pp2048 @ d16384	2624.68 ± 7.74		6333.91 ± 35.43	6330.33 ± 35.43	6333.98 ± 35.43
/workspace/Model/Qwen3.5-122B-A10B-NVFP4	tg32 @ d16384	15.66 ± 0.06	16.00 ± 0.00

Below is the command I used:

vllm serve /workspace/Model/Qwen3.5-122B-A10B-NVFP4 \
  --host 0.0.0.0 --port 8000 \
  --reasoning-parser qwen3 \
  --enable-prefix-caching \
  --enable-auto-tool-choice \
  --tool-call-parser qwen3_coder \
  --max-model-len 262144 \
  --moe_backend flashinfer_cutlass \
  --max-num-batched-tokens 8192 \
  --default-chat-template-kwargs '{"enable_thinking": false}' \
  --chat-template /workspace/spark-vllm-docker/mods/fix-qwen3.5-chat-template/chat_template.jinja \
  --max-num-seqs 100 \
  --trust-remote-code

In my opinion, the output quality feels very close to Qwen/Qwen3.5-122B-A10B.

While it’s slower than Qwen3.5-122B-A10B-int4-AutoRound in terms of speed, the output quality is comparable to the FP16 version. Because of that, I’ll likely be using this going forward.

I’d recommend others give it a try as well.

aniculescu · March 17, 2026, 3:36pm

Thank you for sharing, I will move this to GB10 projects

sjug · March 17, 2026, 5:47pm

Does RedHatAI/Qwen3.5-122B-A10B-NVFP4 have better output quality than Qwen3.5-122B-A10B-int4-AutoRound?

eugr · March 17, 2026, 5:58pm

Have you tried to run it with Marlin NVFP4 GEMM backend? You will likely get better performance…
You can look into Nemotron3-Super recipe for details.

arctic.gus · March 17, 2026, 6:03pm

I’d like to

I’d like to know this too - other NVFP4 quants I have tried so far have produced lower quality than tuned INT4 + Autoround quants, but that could be down to lack of proper calibration.

gpieceoffice · March 18, 2026, 12:10am

I think, as you mentioned, it may come down to differences in post-quantization calibration.
In particular, RedHatAI seems to apply this kind of calibration quite carefully.
If you look through the repository, you can see that it still maintains a fairly high level of reconstruction quality even after quantization.

Compared with Qwen3.5-122B-A10B-int4-AutoRound, one difference I noticed in actual use is that my setup is designed to call a wide variety of functions through multiple MCPs. With Qwen3.5-122B-A10B-int4-AutoRound, I experienced function-calling failures fairly often.
However, with RedHatAI/Qwen3.5-122B-A10B-NVFP4, those issues were noticeably less frequent, and in particular, it felt much better at following prompt instructions clearly and consistently.

However, this is not an objective measurement but rather a highly subjective evaluation based on my personal usage, making it difficult to accept as conclusive.
More importantly, it is a subjective assessment that completely excludes any errors on my part or code-related issues in the agent configuration.

@eugr

I tried running it with the Marlin NVFP4 GEMM backend as well, following the recipe you shared.

There was some speed improvement, but it was less than 1 token per second, so the overall effect did not seem very significant. Here are the results.

model	test	t/s	peak t/s	ttfr (ms)	est_ppt (ms)	e2e_ttft (ms)
/workspace/Model/Qwen3.5-122B-A10B-NVFP4	pp2048	1656.49 ± 206.39		1181.01 ± 165.14	1178.22 ± 165.14	1181.09 ± 165.14
/workspace/Model/Qwen3.5-122B-A10B-NVFP4	tg32	16.83 ± 0.02	17.67 ± 0.47
/workspace/Model/Qwen3.5-122B-A10B-NVFP4	pp2048 @ d4096	2141.83 ± 111.62		2641.53 ± 159.40	2638.74 ± 159.40	2641.62 ± 159.40
/workspace/Model/Qwen3.5-122B-A10B-NVFP4	tg32 @ d4096	16.70 ± 0.02	17.00 ± 0.00
/workspace/Model/Qwen3.5-122B-A10B-NVFP4	pp2048 @ d8192	2222.72 ± 20.43		4186.01 ± 29.93	4183.22 ± 29.93	4186.09 ± 29.93
/workspace/Model/Qwen3.5-122B-A10B-NVFP4	tg32 @ d8192	16.62 ± 0.05	17.00 ± 0.00
/workspace/Model/Qwen3.5-122B-A10B-NVFP4	pp2048 @ d16384	2239.47 ± 7.24		7530.72 ± 47.32	7527.93 ± 47.32	7530.79 ± 47.33
/workspace/Model/Qwen3.5-122B-A10B-NVFP4	tg32 @ d16384	16.38 ± 0.02	17.00 ± 0.00

jwarner · March 18, 2026, 1:39am

This is really interesting and I think we need better mechanisms to rapidly benchmark the real world quality of quants - outside the llama.cpp GGUF perplexity and similar infrastructure.

Edit: it’s also worth noting that Intel Autoround can perform NVFP4 quants, and use configurable calibration data, so this is not as simple as just saying the datatype. The compression engine, settings, and calibration data are all important.

jwarner · March 18, 2026, 1:54am

Related: I’ve been experimenting with AutoRound quants. and one thing that surprised me was that when I went to great lengths to create and curate a coding calibration dataset for Qwen3.5-27b to optimize an int4 quant for coding, it did worse in every way. It was worse at coding than the default calibration, worse at general tasks too.

I was baffled about this for a while. Tried the same with Omnicoder-9B and it was also quite a lot worse. Then I looked back at Intel’s own page for the int4 quant of Qwen3-Coder-Next, the coding specific model which everyone seems to like, and they claim to have used the default calibration set - not even their registered coding sets (which are both not very good and mostly Python).

So, it’s definitely possible to get a suboptimal quant from Autoround. And some of their guidance regarding using domain specific calibrations seems wrong or not the whole story.

Maybe coding models actually need general text to keep logical structure and understanding of prompts? Maybe we should mix general and domain samples? What’s the right mix - 1:1?

If anyone has more insights on this I’m all ears!

eugr · March 18, 2026, 2:11am

The main benefit is stability - NVFP4 quants with flashinfer are known to crash (mostly for Nemotron models though).

sjug · March 20, 2026, 3:32pm

@gpieceoffice I can’t get RedHatAI/Qwen3.5-122B-A10B-NVFP4 to not OOM for some reason. (single spark)

chrm · March 21, 2026, 11:48am

vLLM v0.18.0 has been release some hours ago and PyTorch NGC 26.02-py3
Hi @eugr,
I just saw that

and

has been released, which might benefit your spark-vllm docker image :-).

eugr · March 21, 2026, 7:06pm

My repository already builds vLLM from main, so we get all the new features before they get released officially (as long as it passes the test pipeline).

I actually tried to switch to 26.02 pytorch base image a few weeks ago, and had a lot of problems with compatibility between vLLM and pytorch in that repo, so I had to roll back. I’ll give 26.02 a try again to see if it works now, but I’m actually considering switching back to more barebones base image now, because new vLLM builds fail on 26.01 as well. As an added benefit, it will shave ~9GB of stuff that we don’t need.

chrm · March 21, 2026, 11:06pm

Yes I agree.

Today, I had some time to rebuild the community docker image GitHub - eugr/spark-vllm-docker: Docker configuration for running VLLM on dual DGX Sparks · GitHub as is and got a build errors.

Next I also used your community docker image and build with PyTorch Release 26.02 - NVIDIA Docs. The build was successful, but the container did not start successfully and I had the error

”Executing command on head node: vllm serve Intel/Qwen3.5-122B-A10B-int4-AutoRound --trust-remote-code --load-format fastsafetensors --max-model-len 262144 --kv-cache-dtype fp8 --enable-prefix-caching --enable-chunked-prefill --max-num-batched-tokens 32768 --attention-backend flashinfer --enable-auto-tool-choice --tool-call-parser qwen3_coder --reasoning-parser qwen3 --chat-template unsloth.jinja --gpu-memory-utilization 0.80 --host 0.0.0.0 --port 8000
Traceback (most recent call last):
File “/usr/local/bin/vllm”, line 4, in
from vllm.entrypoints.cli.main import main
File “/usr/local/lib/python3.12/dist-packages/vllm/init.py”, line 14, in
import vllm.env_override # noqa: F401
^^^^^^^^^^^^^^^^^^^^^^^^
File “/usr/local/lib/python3.12/dist-packages/vllm/env_override.py”, line 90, in
from vllm.utils.torch_utils import is_torch_equal
File “/usr/local/lib/python3.12/dist-packages/vllm/utils/torch_utils.py”, line 746, in
from torch._opaque_base import OpaqueBase
ModuleNotFoundError: No module named ‘torch._opaque_base’

Stopping cluster…”

This seems to be a compatibility issue between Pytorch and VLLM.

FYI. I also came across the VLLM NVIDIA dgx spark playbook vLLM for Inference | DGX Spark. It seems to work for some LLMs, but I do not know how good this is optimized.

eugr · March 21, 2026, 11:22pm

That playbook works, but vLLM version they use is lagging a few versions behind the most recent release (and main). But at least it works now, back in October there were no working vLLM versions at all - that’s how the community Docker project was started.

chrm · March 21, 2026, 11:38pm

“playbook works, but vLLM version they use is lagging a few versions behind the most recent release”

This is why we are a community and build our own (mainly you) community docker image.

Many many many thanks for all the work you have done for this community. This is usually a job for NVIDIA, who want to sell their products with its performance as claimed.

aostang · March 22, 2026, 2:51am

gpieceoffice - Are you running vllm directly on the Spark or is this command embedded in a docker somehow? Sorry, still very new to how we’re doing things on the Spark and don’t want to mess up my boxes right out of the gate. The instructions that were included with the Spark gave examples of running vLLM in a docker, not directly installing it.

Btw, my goal here with this model is to try running 4 separate copies on each of the 4 sparks, not have it spread across them. Not sure the best way of going about that right now. Open to input from eugr and others on that goal.

gpieceoffice · March 22, 2026, 5:02am

This is not a guide for creating a venv and running vLLM directly on Spark.

What I shared in the post is about building a vLLM Docker image for Spark from the main branch and running the serving process inside that container, along with the necessary flags.

While it is possible to run vLLM in a virtual environment, I personally find using Docker much more convenient when setting up a deployment environment.

As for running four separate instances across four Sparks: the straightforward approach is to run one Docker container on each Spark node and serve the model independently. Each instance will expose its own API endpoint, which you can then integrate into your project as needed.

That’s the approach I’m aware of for running four independent copies across four Spark machines.

gpieceoffice · March 22, 2026, 5:05am

o run that model with --max-model-len 262144, please first check that you have at least 116 GiB of free memory when the system is idle (e.g., by running free -h).

If you don’t have that much available, try freeing up memory until you reach that level before running it again.

Also, set --gpu-memory-utilization to at least 0.85, but below 0.9.

aostang · March 22, 2026, 3:15pm

On fresh boot I only have 115GB free. Confirmed across all 4 Sparks. I haven’t installed anything besides sparkrun and eugr’s repo/scripts.

DannyTup · March 22, 2026, 7:37pm

What about if you run sudo sh -c 'sync; echo 3 > /proc/sys/vm/drop_caches'?

For reference here’s the numbers I get (with only a small dashboard running):

danny@toad:~$ sudo sh -c 'sync; echo 3 > /proc/sys/vm/drop_caches'

danny@toad:~$ free -h
               total        used        free      shared  buff/cache   available
Mem:           121Gi       2.4Gi       119Gi       1.0Mi       261Mi       119Gi
Swap:           15Gi        67Mi        15Gi

danny@toad:~$ free -h --si
               total        used        free      shared  buff/cache   available
Mem:            130G        2.6G        128G        1.0M        336M        128G
Swap:            17G         71M         17G

Though I do have set it to use the multi-user target (so it doesn’t load any desktop) because I only use it over SSH.

Topic		Replies	Views
New bleeding-edge vLLM Docker Image: avarok/vllm-nvfp4-gb10-sm120 DGX Spark / GB10 Projects	35	2482	December 31, 2025
Qwen/Qwen3.5-122B-A10B - Alibaba/Qwen thought about us... :-D DGX Spark / GB10	340	13040	March 24, 2026
Qwen3.5-397B-A17B run in dual spark! but I have a concern DGX Spark / GB10	173	3967	March 26, 2026
Qwen3.5-122B-A10B NVFP4 Quantized for DGX Spark — 234GB → 75GB, Runs on 128GB DGX Spark / GB10 Projects	33	5679	March 11, 2026
Two-Spark cluster with vLLM using tensor-parallel-size 2 causes one node to drop while the other's GPU goes 100% forever DGX Spark / GB10	36	973	February 13, 2026
HOW-TO: Run Qwen3-Coder-Next on Spark DGX Spark / GB10 llama	92	7507	March 24, 2026
PSA: State of FP4/NVFP4 Support for DGX Spark in VLLM DGX Spark / GB10	65	3664	March 27, 2026
Install and Use vLLM for Inference on two Sparks does not work DGX Spark / GB10	159	4628	December 9, 2025
Qwen3.5-397B-A17B + DGX Spark (duo) DGX Spark / GB10 Projects	48	3374	March 8, 2026
Help: Running NVFP4 model on 2x DGX Spark with vLLM + Ray (multi-node) DGX Spark / GB10 mistral-large	18	2107	December 25, 2025

RedHatAI/Qwen3.5-122B-A10B-NVFP4 seems to be the best option for a single Spark

Related topics