I expect this to be a very attractive model for single Spark use. Especially at 4 bits, with MTP already set up. With the recent active discussion around ideal 4 bit quants, Autoround vs AWQ vs NVFP4 will be very interesting to explore.
The 120B is 250 GB. Not sure if my HW can handle this.
Currently testing to push 35B thru the llm-compressor, but:
File "/data/quant/src/llm-compressor/src/llmcompressor/utils/dev.py", line 15, in <module>
from transformers.modeling_utils import TORCH_INIT_FUNCTIONS
ImportError: cannot import name 'TORCH_INIT_FUNCTIONS' from 'transformers.modeling_utils' (/data/quant/src/llm-compressor/.venv/lib/python3.12/site-packages/transformers/modeling_utils.py). Did you mean: 'ROPE_INIT_FUNCTIONS'?
Even with the latest transformers (5.3.0.dev0). Funny. Red Hat AI managed to quantize the big beast Qwen/Qwen3.5-397B-A17B to FP8.
I am on my way to test it for spark arena & publish the recipe for the @eugr Spark VLLM container.
Need some adjustment to run apparently. (last transformers indeed)
Does anybody have any links to guides that will help me to get ready for running this once the quant is available? I had been running qwen3-vl-30b and would love to replace it with this, but I’ll confess, getting 3-vl up and running was a nightmare a few months ago and I had hoped there were simpler options by now (it just seems like every version of everything I install is wrong for the chipset etc).
Something is wrong about that quant. It’s 30GB for 27B model. It could be that more than half of the weights are activation weights, or some issue during quantization.