Has anyone already tried running Qwen3.5-397B-A17B on two DGX Spark? A very interesting LLM has appeared in my opinion. Models now appear like mushrooms after the rain, it’s clear that you can’t keep track of everyone, but according to the descriptions of its capabilities, they are very impressive. Qwen/Qwen3.5-397B-A17B · Hugging Face
I have only one DGX Spark available, and I expect the second one in March. So I can’t answer my own question)
Right now, only a 2-bit variant (Q2_K_XL) is available. That file is around 148 GB, which already exceeds the memory capacity of a single DGX Spark.
Moreover, GGUF cannot be cleanly split across two Sparks, so even a dual-Spark setup does not really solve the problem.
So realistically, either a 1-bit GGUF release or an AWQ quantization would be required before this becomes feasible on Spark hardware.
This, running at 4-bit (NVFP4) would be a fantastic Nvidia selling point for 2x GB10s if it can swing 12-16 tok/s in a cluster of 2 with reasonable context. Really pushes the limit with the single cable NCCL architecture. The benchmarks look to be nearly toe to toe with current frontier models.
The weights would fit, open question is how efficient the architecture is with KV cache.
About running on one device, I myself understand that for such a large LLM, it’s not realistic and stupid to try, but with two DGX Spark it may be interesting)
Yes, I agree. With two Sparks DGX, it would certainly be interesting.
Now, we simply don’t have the right quantization for this model to run on two Sparks. There isn’t an AWQ NVFP4 version yet (or any other low-bit version compatible with ray tracing, only some MLX tools and the like).
So the limitation isn’t really the two-Spark setup itself, but rather the fact that there isn’t an optimized quantized version available yet. We’ll probably have to wait for such a version to appear on Hugging Face, and even then, two Sparks will still be just barely adequate.
There is an NVFP4 quant, but it is 240GB, so it’s just two big for two Sparks. We’ll see how large AWQ will be, but this model is just too big. Even if you fit the weights, there will be no memory left for KV cache. You need 4 nodes to run it comfortably.
You can use llama.cpp in RPC mode and lower GGUF quants, but the performance would be meh.
I’ve been playing with the Unsloth Qwen3.5-397B-A17B-UD-Q4_K_XL on my 2x MSI Sparks and getting 11t/s. Using the MXFP4_MOE version I only get about 8t/s.
Thinking mode takes a long time but non-thinking mode is totally usable. Does good OCR and captioning images.
My run command for non thinking mode:
GGML_CUDA_GRAPH_OPT=1 \
GGML_CUDA_ENABLE_UNIFIED_MEMORY=1 \
CUDA_SCALE_LAUNCH_QUEUES=4x \
~/llama.cpp/build/bin/llama-server \
--model ${MODEL_PATH} \
--mmproj ${MMPROJ_PATH} \
--alias ${MODEL_FILE} \
--rpc "192.168.200.16:50052" \
--host 0.0.0.0 \
--port 30000 \
--flash-attn on \
--no-mmap \
--n-gpu-layers 999 \
--tensor-split 1,1 \
--threads 9 \
--jinja \
--batch-size 2048 \
--ubatch-size 2048 \
--ctx-size 16384 \
--temp 0.7 \
--top-p 0.8 \
--top-k 20 \
--min-p 0.0 \
--presence-penalty 0.0 \
--repeat-penalty 1.0 \
--chat-template-kwargs "{\"enable_thinking\": false}"
hi, has anyone tried running https://huggingface.co/nvidia/Qwen3.5-397B-A17B-NVFP4 ?
It’s too big. 250GB just for weights. This alone is more than (typically) available VRAM on Spark (~115GB, can go to 120GB if you are careful with what is running on it). And then you need VRAM for KV cache and CUDA graphs (unless you want it to slow down to a crawl).
I though FP4 format should be able to reduce VRAM usage up to 1/4 or 1/8, wondering why that’s not the case this time, which is againts the purpose of having this format
It does. The original model is 807GB, but once you get below 8-bit you need to keep some weights unquantized, so good quants will never be exactly 1/4 of that. Maybe there will be some quants that get below 220GB.
Related: Saw this up there https://huggingface.co/Sehyo/Qwen3.5-122B-A10B-NVFP4
But wouldn’t run. Guessing it needs bleeding edge transformers and I don’t have the energy to futz with it at the moment.
this one worth to try, only 76GB size, just uploaded 1 hour ago, …. 397B –> 122B tho….
Need to rebuild the container with --rebuild-vllm - the fix for Qwen3.5 NVFP4 quants just landed in vLLM main a few hours ago (and broke some int4-autoround quants for qwen3-next models, lol, but I have a fix for that).
Perfect for our DGX Spark - Qwen3.5-122B-A10B, a celebration on our street)
Seems like this 122B-A10B version is only the Visual Languange part (image→text>image) … not the complete model, can anyone confirm this once suceesfull
I got Qwen3.5-122B-10B in FP8 working on two DGX Sparks using this: Shifusen/Qwen3.5-122B-A10B-FP8 · Hugging Face
It required a minor patch to transformers 5.3.0 with my setup, but it seems to run okay now on the latest vllm pull. Vision capability works well too using Open WebUI.
I tried the NVFP4 version in that link above, but it didn’t work for me.
What is the token generation rate per second that this model produces?
Here’s what I’m seeing using llama-benchy defaults:
| model | test | t/s | peak t/s | ttfr (ms) | est_ppt (ms) | e2e_ttft (ms) |
|:------------------|-------:|----------------:|-------------:|---------------:|---------------:|----------------:|
| Qwen3.5-122B-A10B | pp2048 | 2384.91 ± 11.05 | | 767.26 ± 17.87 | 766.33 ± 17.87 | 767.31 ± 17.87 |
| Qwen3.5-122B-A10B | tg32 | 21.60 ± 0.12 | 22.33 ± 0.47 | | | |