Fine-tuning Qwen/Qwen3-VL-30B-A3B-Instruct-FP8 with QLoRA on DGX Spark

You can try to merge the LoRA into a full Qwen2.5-VL checkpoint and serve that one instead.

On a HF/PEFT stack, merge the LoRA into the base model weights to produce a new, standalone Qwen2.5-VL checkpoint (no adapter at runtime). Then you can serve that merged checkpoint with vLLM, but without --lora flags, just as a regular model. The vLLM warning goes away, because there’s no LoRA to apply; the visual weights are already baked into the base.

The PEFT script would be something like this:

import torch

from transformers import AutoModelForVision2Seq, AutoProcessor

from peft import PeftModel



base_id = “Qwen/Qwen2.5-VL-7B-Instruct”

lora_path = “/path/to/your/qwen25vl_lora”  # local or HF repo



#1. Load base model



base_model = AutoModelForVision2Seq.from_pretrained(

base_id,

torch_dtype=torch.bfloat16,

device_map=“auto”,

)



#2. Attach LoRA



lora_model = PeftModel.from_pretrained(base_model, lora_path)



#3. Merge LoRA into base weights



merged_model = lora_model.merge_and_unload()  # ← key step



#4. Save as a new full checkpoint



save_dir = “/models/qwen25vl-7b-instruct-myft”

merged_model.save_pretrained(save_dir)



#Processor is unchanged – just re-save it with the model



processor = AutoProcessor.from_pretrained(base_id)

processor.save_pretrained(save_dir)


Then you run vLLM with:

vllm serve \

  --model /models/qwen2.5-VL-7B-instruct-myft \

  --dtype bfloat16 \

  --max-model-len 8192