What does that first command do?
Sorry I should’ve included more info. It’s mentioned in the docs for clearing buffers/caches (presumably the “buff/cache” column shown in free). I’m not sure if it’s taking up much for you after a reboot (I did not recently reboot but it was several GB in size jut before I ran it to get the stats above).
As a workaround for debugging purposes, you can flush the buffer cache manually with the following command:
sudo sh -c 'sync; echo 3 > /proc/sys/vm/drop_caches'
Thanks for this.
A few of my findings and an unresolved error.
If you encounter OOM on first invocation:
I had to set the environment variable MAX_JOBS to something like 4. By default it spawns as many cuda kernel compilation jobs as cores. That was causing OOM, with the model being loaded as well.
For native installs under conda:
I needed to install libstdcxx-ng, because of an error `GLIBCXX_3.4.30’ not found"
conda install -c conda-forge libstdcxx-ng
also needed to upgrade to transformers 5.3.0
Unfortunately, even after all that I got an error:
CUDA warning: an illegal memory access was encountered (function destroyEvent)
(EngineCore pid=2517026) File “anaconda3/envs/vllm/lib/python3.13/site-packages/vllm/model_executor/layers/fused_moe/topk_weight_and_reduce.py”, line 72, in apply (EngineCore pid=2517026) output.copy_(fused_expert_output, non_blocking=True)
with some warnings early on that are probably not related:
(EngineCore pid=2517026) anaconda3/envs/vllm/lib/python3.13/site-packages/vllm/model_executor/layers/fla/ops/utils.py:113: UserWarning: Input tensor shape suggests potential format mismatch: seq_len(16) < num_heads (64). This may indicate the inputs were passed in head-first format [B, H, T, …] when head_first=False was specified. Please verify your input tensor format matches the expected shape [B, T, H,…].
vLLM Version: 0.18.1rc1.dev33+gf85e479e6.cu130
DGX spark, Driver Version: 590.48.01
PyTorch 2.10.0+cu130
In my personal logic and program tests, the nvfp4 version is much better than int4. for some hard questions, nvfp4 could deliver usable answers but the int4 is totally wrong. I am a bit surprise of the difference.
but I encouter errors when using marlin nvfp4 gemm backend, the outputs will repeat itself forever from time to time. so I am stuck with the unstable flashinfer backend.
VLLM_NVFP4_GEMM_BACKEND=marlin
VLLM_TEST_FORCE_FP8_MARLIN=‘1’
VLLM_MARLIN_USE_ATOMIC_ADD=‘1’
This did bring me to 117GB. :-)
Now.. is there a playbook or something I should be running for this? As I had mentioned, I haven’t directly installed vllm on any of my Sparks yet. I’ve only been using sparkrun or eugr’s scripts. I’m not new to setting up python environments and running vLLM, I’m just trying to avoid making my Sparks hodgepodges of tons of different running methods.
Closing the loop for posterity, unable to get the massive original model shards to load.
Resharded and pushed sjug/Qwen3.5-122B-A10B-NVFP4-resharded which is just RedHatAI/Qwen3.5-122B-A10B-NVFP4 with normal size shards and it worked immediately with the exact same parameters…
Would you mind sharing exactly how to run what you just shared? Sorry, still getting used to how I should be running things on these Sparks. Not sure whether to use sparkrun, eugr’s script, some custom vllm somehow or what lol.
Try everything, just one at a time. eugr has done very well.
- Clone GitHub - eugr/spark-vllm-docker: Docker configuration for running VLLM on dual DGX Sparks · GitHub
- Run whichever recipes you prefer
I don’t think there’s a Qwen3.5-122B recipe yet but you can make, just swap the model: with sjug/Qwen3.5-122B-A10B-NVFP4-resharded and it’ll work well.
Yeah I have eugr’s script running, just wasn’t sure if I could just plug your model identifier in and go. Will give it a shot, thanks!
What driver and cuda version are you using with your setup.
I’m on 595.58.03, with vllm 0.18.1rc1.dev33+gf85e479e6.cu130, transformers 5.3.0 and PyTorch 2.10.0+cu130
The flashinfer_cutlass backend errors with CUDA issues
Marlin and vll_cutlass both load fine but the model outputs a never ending stream of “!” (which reflects my state of mind too…)
The Intel/Qwen3.5-122B-A10B-int4-AutoRound from eugr, works fine.
I’m new to vLLM. I’ve experimented with your recipe and others from eugr to no avail.
cheers!