Fine-tuning Qwen/Qwen3-VL-30B-A3B-Instruct-FP8 with QLoRA on DGX Spark

raphael.amorim · November 16, 2025, 6:28pm

You can try to merge the LoRA into a full Qwen2.5-VL checkpoint and serve that one instead.

On a HF/PEFT stack, merge the LoRA into the base model weights to produce a new, standalone Qwen2.5-VL checkpoint (no adapter at runtime). Then you can serve that merged checkpoint with vLLM, but without --lora flags, just as a regular model. The vLLM warning goes away, because there’s no LoRA to apply; the visual weights are already baked into the base.

The PEFT script would be something like this:

import torch

from transformers import AutoModelForVision2Seq, AutoProcessor

from peft import PeftModel



base_id = “Qwen/Qwen2.5-VL-7B-Instruct”

lora_path = “/path/to/your/qwen25vl_lora”  # local or HF repo



#1. Load base model



base_model = AutoModelForVision2Seq.from_pretrained(

base_id,

torch_dtype=torch.bfloat16,

device_map=“auto”,

)



#2. Attach LoRA



lora_model = PeftModel.from_pretrained(base_model, lora_path)



#3. Merge LoRA into base weights



merged_model = lora_model.merge_and_unload()  # ← key step



#4. Save as a new full checkpoint



save_dir = “/models/qwen25vl-7b-instruct-myft”

merged_model.save_pretrained(save_dir)



#Processor is unchanged – just re-save it with the model



processor = AutoProcessor.from_pretrained(base_id)

processor.save_pretrained(save_dir)

Then you run vLLM with:

vllm serve \

  --model /models/qwen2.5-VL-7B-instruct-myft \

  --dtype bfloat16 \

  --max-model-len 8192

Topic		Replies	Views
Bf16 LoRA Fine-Tuning of Qwen3.5-35B-A3B on DGX Spark — No Quantization Required DGX Spark / GB10 Projects training , ai-model-training	5	973	April 6, 2026
Qwen3.5-122B-A10B NVFP4 Quantized for DGX Spark — 234GB → 75GB, Runs on 128GB DGX Spark / GB10 Projects	44	11318	April 9, 2026
NVFP4 LoRA training on Nemotron-3-Super-120B recipe DGX Spark / GB10 Projects cuda , ai-training , training , gpu , spark , jetson , nemotron , dgx	3	392	May 29, 2026
Running QuantTrio/Qwen3-VL-235B-A22B-Instruct-AWQ on 2 node spark DGX Spark / GB10	2	252	April 21, 2026
Qwen3.5-397B-A17B + DGX Spark (duo) DGX Spark / GB10 Projects	62	6087	June 14, 2026
Qwen3.5-122B-A10B on single Spark: 15 → 21.5 tok/s with hybrid GPTQ-INT4 + FP8 dense layers (https://github.com/rmstxrx/vllm-hybrid-quant) DGX Spark / GB10 cuda	9	771	March 20, 2026
HOW-TO: Run Qwen3-Coder-Next on Spark DGX Spark / GB10 llama	92	10220	March 24, 2026
Qwen/Qwen3.5-122B-A10B - Alibaba/Qwen thought about us... :-D DGX Spark / GB10	340	16895	March 24, 2026
Can not run Qwen3-VL-8B-Instruct-FP8 on Jetson AGX Thor using vllm Jetson Thor llm	7	219	April 13, 2026
NeMo AutoModel → NIM: Export path for Qwen3-VL-30B-A3B MoE after LoRA training DGX Spark / GB10 nim , nemo-framework	0	45	March 19, 2026

Fine-tuning Qwen/Qwen3-VL-30B-A3B-Instruct-FP8 with QLoRA on DGX Spark

Related topics