RedHatAI/Qwen3.5-122B-A10B-NVFP4 seems to be the best option for a single Spark

aostang · March 22, 2026, 10:56pm

What does that first command do?

DannyTup · March 23, 2026, 7:14am

Sorry I should’ve included more info. It’s mentioned in the docs for clearing buffers/caches (presumably the “buff/cache” column shown in free). I’m not sure if it’s taking up much for you after a reboot (I did not recently reboot but it was several GB in size jut before I ran it to get the stats above).

As a workaround for debugging purposes, you can flush the buffer cache manually with the following command:

sudo sh -c 'sync; echo 3 > /proc/sys/vm/drop_caches'

drifts_phobia.7k · March 23, 2026, 9:16am

Thanks for this.

A few of my findings and an unresolved error.

If you encounter OOM on first invocation:

I had to set the environment variable MAX_JOBS to something like 4. By default it spawns as many cuda kernel compilation jobs as cores. That was causing OOM, with the model being loaded as well.

For native installs under conda:

I needed to install libstdcxx-ng, because of an error `GLIBCXX_3.4.30’ not found"

conda install -c conda-forge libstdcxx-ng

also needed to upgrade to transformers 5.3.0

Unfortunately, even after all that I got an error:

CUDA warning: an illegal memory access was encountered (function destroyEvent)

(EngineCore pid=2517026) File “anaconda3/envs/vllm/lib/python3.13/site-packages/vllm/model_executor/layers/fused_moe/topk_weight_and_reduce.py”, line 72, in apply (EngineCore pid=2517026) output.copy_(fused_expert_output, non_blocking=True)

with some warnings early on that are probably not related:

(EngineCore pid=2517026) anaconda3/envs/vllm/lib/python3.13/site-packages/vllm/model_executor/layers/fla/ops/utils.py:113: UserWarning: Input tensor shape suggests potential format mismatch: seq_len(16) < num_heads (64). This may indicate the inputs were passed in head-first format [B, H, T, …] when head_first=False was specified. Please verify your input tensor format matches the expected shape [B, T, H,…].

vLLM Version: 0.18.1rc1.dev33+gf85e479e6.cu130

DGX spark, Driver Version: 590.48.01

PyTorch 2.10.0+cu130

snomile · March 23, 2026, 10:38am

In my personal logic and program tests, the nvfp4 version is much better than int4. for some hard questions, nvfp4 could deliver usable answers but the int4 is totally wrong. I am a bit surprise of the difference.
but I encouter errors when using marlin nvfp4 gemm backend, the outputs will repeat itself forever from time to time. so I am stuck with the unstable flashinfer backend.

VLLM_NVFP4_GEMM_BACKEND=marlin
VLLM_TEST_FORCE_FP8_MARLIN=‘1’
VLLM_MARLIN_USE_ATOMIC_ADD=‘1’

aostang · March 23, 2026, 8:55pm

This did bring me to 117GB. :-)

Now.. is there a playbook or something I should be running for this? As I had mentioned, I haven’t directly installed vllm on any of my Sparks yet. I’ve only been using sparkrun or eugr’s scripts. I’m not new to setting up python environments and running vLLM, I’m just trying to avoid making my Sparks hodgepodges of tons of different running methods.

sjug · March 24, 2026, 5:04pm

Closing the loop for posterity, unable to get the massive original model shards to load.
Resharded and pushed sjug/Qwen3.5-122B-A10B-NVFP4-resharded which is just RedHatAI/Qwen3.5-122B-A10B-NVFP4 with normal size shards and it worked immediately with the exact same parameters…

aostang · March 24, 2026, 8:30pm

Would you mind sharing exactly how to run what you just shared? Sorry, still getting used to how I should be running things on these Sparks. Not sure whether to use sparkrun, eugr’s script, some custom vllm somehow or what lol.

sjug · March 24, 2026, 9:21pm

Try everything, just one at a time. eugr has done very well.

Clone GitHub - eugr/spark-vllm-docker: Docker configuration for running VLLM on dual DGX Sparks · GitHub
Run whichever recipes you prefer

I don’t think there’s a Qwen3.5-122B recipe yet but you can make, just swap the model: with sjug/Qwen3.5-122B-A10B-NVFP4-resharded and it’ll work well.

aostang · March 24, 2026, 9:50pm

Yeah I have eugr’s script running, just wasn’t sure if I could just plug your model identifier in and go. Will give it a shot, thanks!

drifts_phobia.7k · March 26, 2026, 10:15pm

What driver and cuda version are you using with your setup.

I’m on 595.58.03, with vllm 0.18.1rc1.dev33+gf85e479e6.cu130, transformers 5.3.0 and PyTorch 2.10.0+cu130

The flashinfer_cutlass backend errors with CUDA issues

Marlin and vll_cutlass both load fine but the model outputs a never ending stream of “!” (which reflects my state of mind too…)

The Intel/Qwen3.5-122B-A10B-int4-AutoRound from eugr, works fine.

I’m new to vLLM. I’ve experimented with your recipe and others from eugr to no avail.

cheers!

Topic		Replies	Views
New bleeding-edge vLLM Docker Image: avarok/vllm-nvfp4-gb10-sm120 DGX Spark / GB10 Projects	35	2488	December 31, 2025
Qwen/Qwen3.5-122B-A10B - Alibaba/Qwen thought about us... :-D DGX Spark / GB10	340	13045	March 24, 2026
Qwen3.5-397B-A17B run in dual spark! but I have a concern DGX Spark / GB10	174	3983	March 27, 2026
Qwen3.5-122B-A10B NVFP4 Quantized for DGX Spark — 234GB → 75GB, Runs on 128GB DGX Spark / GB10 Projects	33	5691	March 11, 2026
Two-Spark cluster with vLLM using tensor-parallel-size 2 causes one node to drop while the other's GPU goes 100% forever DGX Spark / GB10	36	973	February 13, 2026
HOW-TO: Run Qwen3-Coder-Next on Spark DGX Spark / GB10 llama	92	7513	March 24, 2026
Install and Use vLLM for Inference on two Sparks does not work DGX Spark / GB10	159	4628	December 9, 2025
PSA: State of FP4/NVFP4 Support for DGX Spark in VLLM DGX Spark / GB10	68	3685	March 27, 2026
Qwen3.5-397B-A17B + DGX Spark (duo) DGX Spark / GB10 Projects	48	3377	March 8, 2026
Help: Running NVFP4 model on 2x DGX Spark with vLLM + Ray (multi-node) DGX Spark / GB10 mistral-large	18	2107	December 25, 2025

RedHatAI/Qwen3.5-122B-A10B-NVFP4 seems to be the best option for a single Spark

Related topics