Nope, reverting fastsafetensors patch didn’t help either. Looks like it’s a bug in the custom Triton code that is used by this model that only manifests when running in Ray environment, and possibly on DGX Spark only. And this code is getting executed regardless of the attention or MoE backend too.
I’ll probably open an issue in vLLM for that if I don’t forget - can’t spend any more time on this model now…
I’ve been testing Qwen3-Coder-Next and it works really well overall. In particular, OpenClaw has been very useful — on a single node it honestly feels like it flies.
It would be very interesting to see how it performs on two nodes and how it scales compared to a single Spark setup. If anyone has already tested it in a multi-node configuration, I’d be curious to hear about the results or setup details.
Thanks for the post and github repo for vllm container. Got this model working on a single spark machine. how do I measure performance in terms of tokens/s. Logs in the server show different tokens/s for a taks I gave. Does anyone know what is the average token/s claude code opus does with API
Is it possible to add –load-format to the list of possible overrides in recipes?
I can never get fastsafetensors to work. Is there something I am missing there?
I always get the UserWarning: GDS is not supported in this platform but nogds is False. use nogds=True error
Also, I owe you a beer. The –eth-if & –ib-ifsaved my life. I have another subnet going between my PC & Sparks and couldn’t get anything to load. But once I figured out I could plug those variables in, was a huge weight off my shoulders. Appreciate it!
I’m going to try and see if I can cluster My Threadripper PC with 2X 5090 with the 2X Sparks. It only has 100GB ConnectX-5 though, so I am not sure if it has the juice.