Qwen3.5-35B-A3B on NVIDIA DGX Spark

I just installed Qwen3.5-35B-A3B on NVIDIA DGX Spark, and I am really imprresed, i run some tests and until now is the first LLM that I can really use for internal RAG. Documentation and tests done by me here: GitHub - adadrag/qwen3.5-dgx-spark: Complete guide to running Qwen3.5-35B-A3B on NVIDIA DGX Spark (GB10) with vLLM - installation, benchmarks, vision features, and troubleshooting · GitHub .

Please let me know your thoughts

2 Likes

I like the sheep part. Good work.

Hello,

When you get 30 t/s, is it with the configuration shown in step 3? If so, how do you measure it? Is it including the thinking phase + the actual answer? For example, if I ask “Can you describe the sun in a few short sentences”, it took 16 seconds thinking and then write a two lines answer in 3 seconds. It is basically one line per second.

Let me know. Thank you.

ehfortin

Thank you, super helpful! Was working on trying to manually update nvcr.io/nvidia/vllm:26.02-py3 but kept hitting walls, so thanks again. Takes about 4 minutes to launch on my system.

Looks like the nightly build of vllm/vllm-openai:cu130-nightly is running cuda 13.0.1 but the nvidia vllm image is running CUDA 13.1.1. Any concerns or improvements there?

Your docker run does not have “–kv-cache-dtype fp8”, which I’ve seen others recommend for memory savings, any thoughts or suggestions there?
thank you