You can try to merge the LoRA into a full Qwen2.5-VL checkpoint and serve that one instead.
On a HF/PEFT stack, merge the LoRA into the base model weights to produce a new, standalone Qwen2.5-VL checkpoint (no adapter at runtime). Then you can serve that merged checkpoint with vLLM, but without --lora flags, just as a regular model. The vLLM warning goes away, because there’s no LoRA to apply; the visual weights are already baked into the base.
The PEFT script would be something like this:
import torch
from transformers import AutoModelForVision2Seq, AutoProcessor
from peft import PeftModel
base_id = “Qwen/Qwen2.5-VL-7B-Instruct”
lora_path = “/path/to/your/qwen25vl_lora” # local or HF repo
#1. Load base model
base_model = AutoModelForVision2Seq.from_pretrained(
base_id,
torch_dtype=torch.bfloat16,
device_map=“auto”,
)
#2. Attach LoRA
lora_model = PeftModel.from_pretrained(base_model, lora_path)
#3. Merge LoRA into base weights
merged_model = lora_model.merge_and_unload() # ← key step
#4. Save as a new full checkpoint
save_dir = “/models/qwen25vl-7b-instruct-myft”
merged_model.save_pretrained(save_dir)
#Processor is unchanged – just re-save it with the model
processor = AutoProcessor.from_pretrained(base_id)
processor.save_pretrained(save_dir)
Then you run vLLM with:
vllm serve \
--model /models/qwen2.5-VL-7B-instruct-myft \
--dtype bfloat16 \
--max-model-len 8192