fastsafetensors as load format for autoround wont work, try auto.
the format is gptq
fastsafetensors as load format for autoround wont work, try auto.
the format is gptq
not yet. just models I was starting with. so i can compare, but yes. due to small quant footprint of autoround and this efforts in speedups glm 5.0 and qwen3.5 are next candidate.
Still no success, but @flash3 vllm_next is working so I will stick with that
This works for me on the latest build of spark-vllm-docker:
$ ./launch-cluster.sh \
--apply-mod mods/fix-qwen3-next-autoround \
-e VLLM_MARLIN_USE_ATOMIC_ADD=1 \
-e HF_HUB_OFFLINE=1 \
--solo \
exec vllm serve Intel/Qwen3-Coder-Next-int4-AutoRound \
--max-model-len 262144 \
--gpu-memory-utilization 0.85 \
--port 8888 --host 0.0.0.0 \
--load-format fastsafetensors \
--dtype bfloat16 \
--enable-prefix-caching \
--enable-auto-tool-choice \
--tool-call-parser qwen3_coder
There is a weird problem when loading the model though…
The tokenizer you are loading from ‘/root/.cache/huggingface/hub/models–Intel–Qwen3-Coder-Next-int4-AutoRound/snapshots/79c8a6bb73b7946095d7ece1f8fc68535f7c9ab8’ with an incorrect regex pattern: mistralai/Mistral-Small-3.1-24B-Instruct-2503 · regex pattern . This will lead to incorrect tokenization. You should set the fix_mistral_regex=True flag when loading this tokenizer to fix this issue.
itshappening.gif
It works just fine, at least with my builds.
Did you rebuild with --rebuild-vllm flag? If not, it will just reuse the compiled wheels.
Can confirm, also runs fine for me :) probably the rebuild as eugr mentioned
./build-and-copy.sh -t vllm-node-tf5 --tf5 --rebuild-vllm
vLLM build failed — restoring previous wheels...
Missed this message. I did a docker purge and running it agian.
Compiled and runs now. Thanks @flash3 for getting me started with AutoRound. I wanted to stick with supporting the community docker so our efforts aren’t fragmented. Nevertheless I am still interested to follow your progress with Atlas.
I must commend @eugr the spark-vllm-docker, loads faster, runs reliably and is super impressive at hiding all the frustrations and gotchas. Thanks so much for this.
@AoE 's script worked. My recipe has a weird problem - it loads but the tool calls don’t appear work. I have diverted a lot of time to this, and I will chip away at it and try to contribute some benchmarks but I really need to get on with professional work – this i such a deep rabbit hole to get sucked in to.
Anyone interested in my efforts to Quantise Quen3.5 to INT4 AutoRound. It took 25 hours on my single spark. The machine ran at 95-100% the whole time. No overheating or complaints. Completed without any problems. I tried to upload to HF but my connection kept stalling out. When Intel released theirs it seemed redundant to continue. Scored it up to a good burn in procedure for my spark hardware and learnt a bunch along the way. I think a single spark is better suited to RL/Quant on models in the 8B range.
I’m surprised it took that long. I’m quantizing ~30B dense models to int4 in under 4 hours on my Spark with under half of the RAM used, but I haven’t yet tried Qwen3.5 MoE.
Looking up your script from earlier in the thread, did you enable torch compile? It looks like enable_torch_compile wasn’t set, and it defaults to False. Enabling it dramatically improves quantization speed.
I made a minimal Docker container for AutoRound work based on the NGC PyTorch container, if you want to compare notes.
Anyone here tried the EssentialAI/rnj-1-instruct · Hugging Face model? I wanted to run it in parallel, but the performance is so much worse than qwen3.5-122 … anyone tried to run a model like this.
I am trying to run the new Nemotron 3 super on DGX Spark, but vllm has issues with the quantization method, since the model’s hf_quant.json is mixed. Any suggestions on how to get it running?
I’ve just pushed a recipe to our community Docker: GitHub - eugr/spark-vllm-docker: Docker configuration for running VLLM on dual DGX Sparks · GitHub
… roughly stuck on qwen 3.5 on tp=2 with reap. a format hell. and vllm again in the lead role: the patient.
| # | Format | Beispiel-Key | Wer |
|---|---|---|---|
| 1 | Standard per-expert | experts.{id}.gate_proj.qweight |
122B AutoRound, vLLM nativ |
| 2 | Stacked fused 3D | experts.gate_up_proj [E, in, out] |
BF16 REAP 262B Quelle |
| 3 | Per-expert fused 2D | experts.gate_up_proj.{id}.qweight |
INT4 REAP 262B AutoRound |
| me | con-fused | … |