Qwen/Qwen3.5-122B-A10B - Alibaba/Qwen thought about us... :-D

Seems that Qwen did also something for us (single) Spark users this time:

Interesting sizes that popped up until now (122B-A10B, 35B-A3B, 27B).

Will try my luck with llm-compressor again to get the 122B squeezed into one Spark.

I assume GGUFs will be available shortly by unsloth & Co.

2 Likes

I expect this to be a very attractive model for single Spark use. Especially at 4 bits, with MTP already set up. With the recent active discussion around ideal 4 bit quants, Autoround vs AWQ vs NVFP4 will be very interesting to explore.

1 Like

Indeed, the GGUF fine-tuned model is already in.

I’m starting to test the 35B model for inclusion in sparkrun’s default recipes. Been waiting for the smaller size Qwen3.5 models to drop!

122B one should be a good replacement for gpt-oss-120b, but need to wait for suitable quants.

3 Likes

The 120B is 250 GB. Not sure if my HW can handle this.

Currently testing to push 35B thru the llm-compressor, but:

  File "/data/quant/src/llm-compressor/src/llmcompressor/utils/dev.py", line 15, in <module>
    from transformers.modeling_utils import TORCH_INIT_FUNCTIONS
ImportError: cannot import name 'TORCH_INIT_FUNCTIONS' from 'transformers.modeling_utils' (/data/quant/src/llm-compressor/.venv/lib/python3.12/site-packages/transformers/modeling_utils.py). Did you mean: 'ROPE_INIT_FUNCTIONS'?

Even with the latest transformers (5.3.0.dev0). Funny. Red Hat AI managed to quantize the big beast Qwen/Qwen3.5-397B-A17B to FP8.

EDIT: No TF5 support yet

try mxfp6?

I’d wait for the quants, at least FP8 - we should get a native one from Qwen soon, I believe.

1 Like

I asked the Red Hat AI team what they did with the 397B.

I assume Alibaba’s Qwen team will also provide a FP8 as they did for that beast.

image

I am on my way to test it for spark arena & publish the recipe for the @eugr Spark VLLM container.
Need some adjustment to run apparently. (last transformers indeed)

3 Likes

Let’s go ! @raphael.amorim

3 Likes

LET’S GOOOOOOOOO 😂

2 Likes

Day 0 😂

1 Like

I can’t wait for NVFP4 quant!

1 Like

Does anybody have any links to guides that will help me to get ready for running this once the quant is available? I had been running qwen3-vl-30b and would love to replace it with this, but I’ll confess, getting 3-vl up and running was a nightmare a few months ago and I had hoped there were simpler options by now (it just seems like every version of everything I install is wrong for the chipset etc).

cyanwiki has already the first quant up:

You could try that one:

vllm serve Qwen/Qwen3.5-27B --port 8000  --max-model-len 262144 --reasoning-parser qwen3 --speculative-config '{"method":"qwen3_next_mtp","num_speculative_tokens":2}'
2 Likes

You can use our community docker builds: GitHub - eugr/spark-vllm-docker: Docker configuration for running VLLM on dual DGX Sparks

1 Like

Something is wrong about that quant. It’s 30GB for 27B model. It could be that more than half of the weights are activation weights, or some issue during quantization.

Thanks, I’ll try out the docker approach and that should help with the stability. I’d imagined things had gotten a bit more mature by now.

This is the dense model isn’t it? Not the MoE model. Would love a FP8 or NVFP4 quant of Qwen3.5-35B-A3B