Hi - I’m not having much luck getting a LoRA for recent SOTA MoE models to work with vLLM.
I tried this solution: Bf16 LoRA Fine-Tuning of Qwen3.5-35B-A3B on DGX Spark — No Quantization Required but it didn’t load successfully into vLLM apparently due to a mismatch between the Unsloth LoRA format of fused expert tensors. I also tried training on just the attention layers but the loss stayed really high. If you try to use Unsloth Studio directly you run out of memory (at least for Qwen3.5-35B-A3B and Gemma-4-26B-A4B).
If anyone has worked out how to do this without going OOM, and produced a LoRA that actually affects the output of the model, and could run in vLLM, I would love to copy your approach.
thanks