Fine-tuning Qwen/Qwen3-VL-30B-A3B-Instruct-FP8 with QLoRA on DGX Spark

Hi,

I’m trying to Fine-tuning the Qwen/Qwen3-VL-30B-A3B-Instruct-FP8 model with a dataset I created.

To do this, I ran the following command on my DGX Spark:

$ docker run --gpus all -it --rm -v $HOME/.cache/huggingface:/root/.cache/huggingface -v ${PWD}:/workspace -e HF_TOKEN=### -w /workspace nvcr.io/nvidia/pytorch:25.09-py3

Then, in the container:

$ pip install transformers accelerate bitsandbytes
$ pip install peft datasets Pillow
$ pip install --upgrade transformers peft datasets

When I run my training script, I’m getting the following error:

— Démarrage de la Préparation —

[Étape 1/4] 📦 Chargement du tokenizer et du modèle de base… → Le modèle Qwen/Qwen3-VL-30B-A3B-Instruct-FP8 va être chargé. → Si le modèle n’est pas en cache, le téléchargement (environ 70 Go) va commencer ici. → Attendez que les logs de chargement des ‘checkpoint shards’ apparaissent.torch_dtype is deprecated! Use dtype instead!Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00, 120.87it/s]Some weights of Qwen3VLMoeModel were not initialized from the model checkpoint at Qwen/Qwen3-VL-30B-A3B-Instruct-FP8 and are newly initialized: [‘language_model.embed_tokens.weight’, ‘language_model.layers.0.input_layernorm.weight’, ‘language_model.layers.0.mlp.experts.down_proj’, ‘language_model.layers.0.mlp.experts.gate_up_proj’, ‘language_model.layers.0.mlp.gate.weight’, ‘language_model.layers.0.post_attention_layernorm.weight’, ‘language_model.layers.0.self_attn.k_norm.weight’, ‘language_model.layers.0.self_attn.k_proj.weight’, ‘language_model.layers.0.self_attn.o_proj.weight’, ‘language_model.layers.0.self_attn.q_norm.weight’, ‘language_model.layers.0.self_attn.q_proj.weight’, ‘language_model.layers.0.self_attn.v_proj.weight’, ‘language_model.layers.1.input_layernorm.weight’, ‘language_model.layers.1.mlp.experts.down_proj’, ‘language_model.layers.1.mlp.experts.gate_up_proj’, ‘language_model.layers.1.mlp.gate.weight’, ‘language_model.layers.1.post_attention_layernorm.weight’, ‘language_model.layers.1.self_attn.k_norm.weight’, ‘language_model.layers.1.self_attn.k_proj.weight’, ‘language_model.layers.1.self_attn.o_proj.weight’, ‘language_model.layers.1.self_attn.q_norm.weight’, ‘language_model.layers.1.self_attn.q_proj.weight’, ‘language_model.layers.1.self_attn.v_proj.weight’, ‘language_model.layers.10.input_layernorm.weight’, ‘language_model.layers.10.mlp.experts.down_proj’, ‘language_model.layers.10.mlp.experts.gate_up_proj’, ‘language_model.layers.10.mlp.gate.weight’, ‘language_model.layers.10.post_attention_layernorm.weight’, ‘language_model.layers.10.self_attn.k_norm.weight’, ‘language_model.layers.10.self_attn.k_proj.weight’, ‘language_model.layers.10.self_attn.o_proj.weight’, ‘language_model.layers.10.self_attn.q_norm.weight’, ‘language_model.layers.10.self_attn.q_proj.weight’, ‘language_model.layers.10.self_attn.v_proj.weight’, ‘language_model.layers.11.input_layernorm.weight’, ‘language_model.layers.11.mlp.experts.down_proj’, ‘language_model.layers.11.mlp.experts.gate_up_proj’, ‘language_model.layers.11.mlp.gate.weight’, ‘language_model.layers.11.post_attention_layernorm.weight’, ‘language_model.layers.11.self_attn.k_norm.weight’, ‘language_model.layers.11.self_attn.k_proj.weight’, ‘language_model.layers.11.self_attn.o_proj.weight’, ‘language_model.layers.11.self_attn.q_norm.weight’, ‘language_model.layers.11.self_attn.q_proj.weight’, ‘language_model.layers.11.self_attn.v_proj.weight’, ‘language_model.layers.12.input_layernorm.weight’, ‘language_model.layers.12.mlp.experts.down_proj’, ‘language_model.layers.12.mlp.experts.gate_up_proj’, ‘language_model.layers.12.mlp.gate.weight’, ‘language_model.layers.12.post_attention_layernorm.weight’, ‘language_model.layers.12.self_attn.k_norm.weight’, ‘language_model.layers.12.self_attn.k_proj.weight’, ‘language_model.layers.12.self_attn.o_proj.weight’, ‘language_model.layers.12.self_attn.q_norm.weight’, ‘language_model.layers.12.self_attn.q_proj.weight’, ‘language_model.layers.12.self_attn.v_proj.weight’, ‘language_model.layers.13.input_layernorm.weight’, ‘language_model.layers.13.mlp.experts.down_proj’, ‘language_model.layers.13.mlp.experts.gate_up_proj’, ‘language_model.layers.13.mlp.gate.weight’, ‘language_model.layers.13.post_attention_layernorm.weight’, ‘language_model.layers.13.self_attn.k_norm.weight’, ‘language_model.layers.13.self_attn.k_proj.weight’, ‘language_model.layers.13.self_attn.o_proj.weight’, ‘language_model.layers.13.self_attn.q_norm.weight’, ‘language_model.layers.13.self_attn.q_proj.weight’, ‘language_model.layers.13.self_attn.v_proj.weight’, ‘language_model.layers.14.input_layernorm.weight’, ‘language_model.layers.14.mlp.experts.down_proj’, ‘language_model.layers.14.mlp.experts.gate_up_proj’, ‘language_model.layers.14.mlp.gate.weight’, ‘language_model.layers.14.post_attention_layernorm.weight’, ‘language_model.layers.14.self_attn.k_norm.weight’, ‘language_model.layers.14.self_attn.k_proj.weight’, ‘language_model.layers.14.self_attn.o_proj.weight’, ‘language_model.layers.14.self_attn.q_norm.weight’, ‘language_model.layers.14.self_attn.q_proj.weight’, ‘language_model.layers.14.self_attn.v_proj.weight’, ‘language_model.layers.15.input_layernorm.weight’, ‘language_model.layers.15.mlp.experts.down_proj’, ‘language_model.layers.15.mlp.experts.gate_up_proj’, ‘language_model.layers.15.mlp.gate.weight’, ‘language_model.layers.15.post_attention_layernorm.weight’, ‘language_model.layers.15.self_attn.k_norm.weight’, ‘language_model.layers.15.self_attn.k_proj.weight’, ‘language_model.layers.15.self_attn.o_proj.weight’, ‘language_model.layers.15.self_attn.q_norm.weight’, ‘language_model.layers.15.self_attn.q_proj.weight’, ‘language_model.layers.15.self_attn.v_proj.weight’, ‘language_model.layers.16.input_layernorm.weight’, ‘language_model.layers.16.mlp.experts.down_proj’, ‘language_model.layers.16.mlp.experts.gate_up_proj’, ‘language_model.layers.16.mlp.gate.weight’, ‘language_model.layers.16.post_attention_layernorm.weight’, ‘language_model.layers.16.self_attn.k_norm.weight’, ‘language_model.layers.16.self_attn.k_proj.weight’, ‘language_model.layers.16.self_attn.o_proj.weight’, ‘language_model.layers.16.self_attn.q_norm.weight’, ‘language_model.layers.16.self_attn.q_proj.weight’, ‘language_model.layers.16.self_attn.v_proj.weight’, ‘language_model.layers.17.input_layernorm.weight’, ‘language_model.layers.17.mlp.experts.down_proj’, ‘language_model.layers.17.mlp.experts.gate_up_proj’, ‘language_model.layers.17.mlp.gate.weight’, ‘language_model.layers.17.post_attention_layernorm.weight’, ‘language_model.layers.17.self_attn.k_norm.weight’, ‘language_model.layers.17.self_attn.k_proj.weight’, ‘language_model.layers.17.self_attn.o_proj.weight’, ‘language_model.layers.17.self_attn.q_norm.weight’, ‘language_model.layers.17.self_attn.q_proj.weight’, ‘language_model.layers.17.self_attn.v_proj.weight’, ‘language_model.layers.18.input_layernorm.weight’, ‘language_model.layers.18.mlp.experts.down_proj’, ‘language_model.layers.18.mlp.experts.gate_up_proj’, ‘language_model.layers.18.mlp.gate.weight’, ‘language_model.layers.18.post_attention_layernorm.weight’, ‘language_model.layers.18.self_attn.k_norm.weight’, ‘language_model.layers.18.self_attn.k_proj.weight’, ‘language_model.layers.18.self_attn.o_proj.weight’, ‘language_model.layers.18.self_attn.q_norm.weight’, ‘language_model.layers.18.self_attn.q_proj.weight’, ‘language_model.layers.18.self_attn.v_proj.weight’, ‘language_model.layers.19.input_layernorm.weight’, ‘language_model.layers.19.mlp.experts.down_proj’, ‘language_model.layers.19.mlp.experts.gate_up_proj’, ‘language_model.layers.19.mlp.gate.weight’, ‘language_model.layers.19.post_attention_layernorm.weight’, ‘language_model.layers.19.self_attn.k_norm.weight’, ‘language_model.layers.19.self_attn.k_proj.weight’, ‘language_model.layers.19.self_attn.o_proj.weight’, ‘language_model.layers.19.self_attn.q_norm.weight’, ‘language_model.layers.19.self_attn.q_proj.weight’, ‘language_model.layers.19.self_attn.v_proj.weight’, ‘language_model.layers.2.input_layernorm.weight’, ‘language_model.layers.2.mlp.experts.down_proj’, ‘language_model.layers.2.mlp.experts.gate_up_proj’, ‘language_model.layers.2.mlp.gate.weight’, ‘language_model.layers.2.post_attention_layernorm.weight’, ‘language_model.layers.2.self_attn.k_norm.weight’, ‘language_model.layers.2.self_attn.k_proj.weight’, ‘language_model.layers.2.self_attn.o_proj.weight’, ‘language_model.layers.2.self_attn.q_norm.weight’, ‘language_model.layers.2.self_attn.q_proj.weight’, ‘language_model.layers.2.self_attn.v_proj.weight’, ‘language_model.layers.20.input_layernorm.weight’, ‘language_model.layers.20.mlp.experts.down_proj’, ‘language_model.layers.20.mlp.experts.gate_up_proj’, ‘language_model.layers.20.mlp.gate.weight’, ‘language_model.layers.20.post_attention_layernorm.weight’, ‘language_model.layers.20.self_attn.k_norm.weight’, ‘language_model.layers.20.self_attn.k_proj.weight’, ‘language_model.layers.20.self_attn.o_proj.weight’, ‘language_model.layers.20.self_attn.q_norm.weight’, ‘language_model.layers.20.self_attn.q_proj.weight’, ‘language_model.layers.20.self_attn.v_proj.weight’, ‘language_model.layers.21.input_layernorm.weight’, ‘language_model.layers.21.mlp.experts.down_proj’, ‘language_model.layers.21.mlp.experts.gate_up_proj’, ‘language_model.layers.21.mlp.gate.weight’, ‘language_model.layers.21.post_attention_layernorm.weight’, ‘language_model.layers.21.self_attn.k_norm.weight’, ‘language_model.layers.21.self_attn.k_proj.weight’, ‘language_model.layers.21.self_attn.o_proj.weight’, ‘language_model.layers.21.self_attn.q_norm.weight’, ‘language_model.layers.21.self_attn.q_proj.weight’, ‘language_model.layers.21.self_attn.v_proj.weight’, ‘language_model.layers.22.input_layernorm.weight’, ‘language_model.layers.22.mlp.experts.down_proj’, ‘language_model.layers.22.mlp.experts.gate_up_proj’, ‘language_model.layers.22.mlp.gate.weight’, ‘language_model.layers.22.post_attention_layernorm.weight’, ‘language_model.layers.22.self_attn.k_norm.weight’, ‘language_model.layers.22.self_attn.k_proj.weight’, ‘language_model.layers.22.self_attn.o_proj.weight’, ‘language_model.layers.22.self_attn.q_norm.weight’, ‘language_model.layers.22.self_attn.q_proj.weight’, ‘language_model.layers.22.self_attn.v_proj.weight’, ‘language_model.layers.23.input_layernorm.weight’, ‘language_model.layers.23.mlp.experts.down_proj’, ‘language_model.layers.23.mlp.experts.gate_up_proj’, ‘language_model.layers.23.mlp.gate.weight’, ‘language_model.layers.23.post_attention_layernorm.weight’, ‘language_model.layers.23.self_attn.k_norm.weight’, ‘language_model.layers.23.self_attn.k_proj.weight’, ‘language_model.layers.23.self_attn.o_proj.weight’, ‘language_model.layers.23.self_attn.q_norm.weight’, ‘language_model.layers.23.self_attn.q_proj.weight’, ‘language_model.layers.23.self_attn.v_proj.weight’, ‘language_model.layers.24.input_layernorm.weight’, ‘language_model.layers.24.mlp.experts.down_proj’, ‘language_model.layers.24.mlp.experts.gate_up_proj’, ‘language_model.layers.24.mlp.gate.weight’, ‘language_model.layers.24.post_attention_layernorm.weight’, ‘language_model.layers.24.self_attn.k_norm.weight’, ‘language_model.layers.24.self_attn.k_proj.weight’, ‘language_model.layers.24.self_attn.o_proj.weight’, ‘language_model.layers.24.self_attn.q_norm.weight’, ‘language_model.layers.24.self_attn.q_proj.weight’, ‘language_model.layers.24.self_attn.v_proj.weight’, ‘language_model.layers.25.input_layernorm.weight’, ‘language_model.layers.25.mlp.experts.down_proj’, ‘language_model.layers.25.mlp.experts.gate_up_proj’, ‘language_model.layers.25.mlp.gate.weight’, ‘language_model.layers.25.post_attention_layernorm.weight’, ‘language_model.layers.25.self_attn.k_norm.weight’, ‘language_model.layers.25.self_attn.k_proj.weight’, ‘language_model.layers.25.self_attn.o_proj.weight’, ‘language_model.layers.25.self_attn.q_norm.weight’, ‘language_model.layers.25.self_attn.q_proj.weight’, ‘language_model.layers.25.self_attn.v_proj.weight’, ‘language_model.layers.26.input_layernorm.weight’, ‘language_model.layers.26.mlp.experts.down_proj’, ‘language_model.layers.26.mlp.experts.gate_up_proj’, ‘language_model.layers.26.mlp.gate.weight’, ‘language_model.layers.26.post_attention_layernorm.weight’, ‘language_model.layers.26.self_attn.k_norm.weight’, ‘language_model.layers.26.self_attn.k_proj.weight’, ‘language_model.layers.26.self_attn.o_proj.weight’, ‘language_model.layers.26.self_attn.q_norm.weight’, ‘language_model.layers.26.self_attn.q_proj.weight’, ‘language_model.layers.26.self_attn.v_proj.weight’, ‘language_model.layers.27.input_layernorm.weight’, ‘language_model.layers.27.mlp.experts.down_proj’, ‘language_model.layers.27.mlp.experts.gate_up_proj’, ‘language_model.layers.27.mlp.gate.weight’, ‘language_model.layers.27.post_attention_layernorm.weight’, ‘language_model.layers.27.self_attn.k_norm.weight’, ‘language_model.layers.27.self_attn.k_proj.weight’, ‘language_model.layers.27.self_attn.o_proj.weight’, ‘language_model.layers.27.self_attn.q_norm.weight’, ‘language_model.layers.27.self_attn.q_proj.weight’, ‘language_model.layers.27.self_attn.v_proj.weight’, ‘language_model.layers.28.input_layernorm.weight’, ‘language_model.layers.28.mlp.experts.down_proj’, ‘language_model.layers.28.mlp.experts.gate_up_proj’, ‘language_model.layers.28.mlp.gate.weight’, ‘language_model.layers.28.post_attention_layernorm.weight’, ‘language_model.layers.28.self_attn.k_norm.weight’, ‘language_model.layers.28.self_attn.k_proj.weight’, ‘language_model.layers.28.self_attn.o_proj.weight’, ‘language_model.layers.28.self_attn.q_norm.weight’, ‘language_model.layers.28.self_attn.q_proj.weight’, ‘language_model.layers.28.self_attn.v_proj.weight’, ‘language_model.layers.29.input_layernorm.weight’, ‘language_model.layers.29.mlp.experts.down_proj’, ‘language_model.layers.29.mlp.experts.gate_up_proj’, ‘language_model.layers.29.mlp.gate.weight’, ‘language_model.layers.29.post_attention_layernorm.weight’, ‘language_model.layers.29.self_attn.k_norm.weight’, ‘language_model.layers.29.self_attn.k_proj.weight’, ‘language_model.layers.29.self_attn.o_proj.weight’, ‘language_model.layers.29.self_attn.q_norm.weight’, ‘language_model.layers.29.self_attn.q_proj.weight’, ‘language_model.layers.29.self_attn.v_proj.weight’, ‘language_model.layers.3.input_layernorm.weight’, ‘language_model.layers.3.mlp.experts.down_proj’, ‘language_model.layers.3.mlp.experts.gate_up_proj’, ‘language_model.layers.3.mlp.gate.weight’, ‘language_model.layers.3.post_attention_layernorm.weight’, ‘language_model.layers.3.self_attn.k_norm.weight’, ‘language_model.layers.3.self_attn.k_proj.weight’, ‘language_model.layers.3.self_attn.o_proj.weight’, ‘language_model.layers.3.self_attn.q_norm.weight’, ‘language_model.layers.3.self_attn.q_proj.weight’, ‘language_model.layers.3.self_attn.v_proj.weight’, ‘language_model.layers.30.input_layernorm.weight’, ‘language_model.layers.30.mlp.experts.down_proj’, ‘language_model.layers.30.mlp.experts.gate_up_proj’, ‘language_model.layers.30.mlp.gate.weight’, ‘language_model.layers.30.post_attention_layernorm.weight’, ‘language_model.layers.30.self_attn.k_norm.weight’, ‘language_model.layers.30.self_attn.k_proj.weight’, ‘language_model.layers.30.self_attn.o_proj.weight’, ‘language_model.layers.30.self_attn.q_norm.weight’, ‘language_model.layers.30.self_attn.q_proj.weight’, ‘language_model.layers.30.self_attn.v_proj.weight’, ‘language_model.layers.31.input_layernorm.weight’, ‘language_model.layers.31.mlp.experts.down_proj’, ‘language_model.layers.31.mlp.experts.gate_up_proj’, ‘language_model.layers.31.mlp.gate.weight’, ‘language_model.layers.31.post_attention_layernorm.weight’, ‘language_model.layers.31.self_attn.k_norm.weight’, ‘language_model.layers.31.self_attn.k_proj.weight’, ‘language_model.layers.31.self_attn.o_proj.weight’, ‘language_model.layers.31.self_attn.q_norm.weight’, ‘language_model.layers.31.self_attn.q_proj.weight’, ‘language_model.layers.31.self_attn.v_proj.weight’, ‘language_model.layers.32.input_layernorm.weight’, ‘language_model.layers.32.mlp.experts.down_proj’, ‘language_model.layers.32.mlp.experts.gate_up_proj’, ‘language_model.layers.32.mlp.gate.weight’, ‘language_model.layers.32.post_attention_layernorm.weight’, ‘language_model.layers.32.self_attn.k_norm.weight’, ‘language_model.layers.32.self_attn.k_proj.weight’, ‘language_model.layers.32.self_attn.o_proj.weight’, ‘language_model.layers.32.self_attn.q_norm.weight’, ‘language_model.layers.32.self_attn.q_proj.weight’, ‘language_model.layers.32.self_attn.v_proj.weight’, ‘language_model.layers.33.input_layernorm.weight’, ‘language_model.layers.33.mlp.experts.down_proj’, ‘language_model.layers.33.mlp.experts.gate_up_proj’, ‘language_model.layers.33.mlp.gate.weight’, ‘language_model.layers.33.post_attention_layernorm.weight’, ‘language_model.layers.33.self_attn.k_norm.weight’, ‘language_model.layers.33.self_attn.k_proj.weight’, ‘language_model.layers.33.self_attn.o_proj.weight’, ‘language_model.layers.33.self_attn.q_norm.weight’, ‘language_model.layers.33.self_attn.q_proj.weight’, ‘language_model.layers.33.self_attn.v_proj.weight’, ‘language_model.layers.34.input_layernorm.weight’, ‘language_model.layers.34.mlp.experts.down_proj’, ‘language_model.layers.34.mlp.experts.gate_up_proj’, ‘language_model.layers.34.mlp.gate.weight’, ‘language_model.layers.34.post_attention_layernorm.weight’, ‘language_model.layers.34.self_attn.k_norm.weight’, ‘language_model.layers.34.self_attn.k_proj.weight’, ‘language_model.layers.34.self_attn.o_proj.weight’, ‘language_model.layers.34.self_attn.q_norm.weight’, ‘language_model.layers.34.self_attn.q_proj.weight’, ‘language_model.layers.34.self_attn.v_proj.weight’, ‘language_model.layers.35.input_layernorm.weight’, ‘language_model.layers.35.mlp.experts.down_proj’, ‘language_model.layers.35.mlp.experts.gate_up_proj’, ‘language_model.layers.35.mlp.gate.weight’, ‘language_model.layers.35.post_attention_layernorm.weight’, ‘language_model.layers.35.self_attn.k_norm.weight’, ‘language_model.layers.35.self_attn.k_proj.weight’, ‘language_model.layers.35.self_attn.o_proj.weight’, ‘language_model.layers.35.self_attn.q_norm.weight’, ‘language_model.layers.35.self_attn.q_proj.weight’, ‘language_model.layers.35.self_attn.v_proj.weight’, ‘language_model.layers.36.input_layernorm.weight’, ‘language_model.layers.36.mlp.experts.down_proj’, ‘language_model.layers.36.mlp.experts.gate_up_proj’, ‘language_model.layers.36.mlp.gate.weight’, ‘language_model.layers.36.post_attention_layernorm.weight’, ‘language_model.layers.36.self_attn.k_norm.weight’, ‘language_model.layers.36.self_attn.k_proj.weight’, ‘language_model.layers.36.self_attn.o_proj.weight’, ‘language_model.layers.36.self_attn.q_norm.weight’, ‘language_model.layers.36.self_attn.q_proj.weight’, ‘language_model.layers.36.self_attn.v_proj.weight’, ‘language_model.layers.37.input_layernorm.weight’, ‘language_model.layers.37.mlp.experts.down_proj’, ‘language_model.layers.37.mlp.experts.gate_up_proj’, ‘language_model.layers.37.mlp.gate.weight’, ‘language_model.layers.37.post_attention_layernorm.weight’, ‘language_model.layers.37.self_attn.k_norm.weight’, ‘language_model.layers.37.self_attn.k_proj.weight’, ‘language_model.layers.37.self_attn.o_proj.weight’, ‘language_model.layers.37.self_attn.q_norm.weight’, ‘language_model.layers.37.self_attn.q_proj.weight’, ‘language_model.layers.37.self_attn.v_proj.weight’, ‘language_model.layers.38.input_layernorm.weight’, ‘language_model.layers.38.mlp.experts.down_proj’, ‘language_model.layers.38.mlp.experts.gate_up_proj’, ‘language_model.layers.38.mlp.gate.weight’, ‘language_model.layers.38.post_attention_layernorm.weight’, ‘language_model.layers.38.self_attn.k_norm.weight’, ‘language_model.layers.38.self_attn.k_proj.weight’, ‘language_model.layers.38.self_attn.o_proj.weight’, ‘language_model.layers.38.self_attn.q_norm.weight’, ‘language_model.layers.38.self_attn.q_proj.weight’, ‘language_model.layers.38.self_attn.v_proj.weight’, ‘language_model.layers.39.input_layernorm.weight’, ‘language_model.layers.39.mlp.experts.down_proj’, ‘language_model.layers.39.mlp.experts.gate_up_proj’, ‘language_model.layers.39.mlp.gate.weight’, ‘language_model.layers.39.post_attention_layernorm.weight’, ‘language_model.layers.39.self_attn.k_norm.weight’, ‘language_model.layers.39.self_attn.k_proj.weight’, ‘language_model.layers.39.self_attn.o_proj.weight’, ‘language_model.layers.39.self_attn.q_norm.weight’, ‘language_model.layers.39.self_attn.q_proj.weight’, ‘language_model.layers.39.self_attn.v_proj.weight’, ‘language_model.layers.4.input_layernorm.weight’, ‘language_model.layers.4.mlp.experts.down_proj’, ‘language_model.layers.4.mlp.experts.gate_up_proj’, ‘language_model.layers.4.mlp.gate.weight’, ‘language_model.layers.4.post_attention_layernorm.weight’, ‘language_model.layers.4.self_attn.k_norm.weight’, ‘language_model.layers.4.self_attn.k_proj.weight’, ‘language_model.layers.4.self_attn.o_proj.weight’, ‘language_model.layers.4.self_attn.q_norm.weight’, ‘language_model.layers.4.self_attn.q_proj.weight’, ‘language_model.layers.4.self_attn.v_proj.weight’, ‘language_model.layers.40.input_layernorm.weight’, ‘language_model.layers.40.mlp.experts.down_proj’, ‘language_model.layers.40.mlp.experts.gate_up_proj’, ‘language_model.layers.40.mlp.gate.weight’, ‘language_model.layers.40.post_attention_layernorm.weight’, ‘language_model.layers.40.self_attn.k_norm.weight’, ‘language_model.layers.40.self_attn.k_proj.weight’, ‘language_model.layers.40.self_attn.o_proj.weight’, ‘language_model.layers.40.self_attn.q_norm.weight’, ‘language_model.layers.40.self_attn.q_proj.weight’, ‘language_model.layers.40.self_attn.v_proj.weight’, ‘language_model.layers.41.input_layernorm.weight’, ‘language_model.layers.41.mlp.experts.down_proj’, ‘language_model.layers.41.mlp.experts.gate_up_proj’, ‘language_model.layers.41.mlp.gate.weight’, ‘language_model.layers.41.post_attention_layernorm.weight’, ‘language_model.layers.41.self_attn.k_norm.weight’, ‘language_model.layers.41.self_attn.k_proj.weight’, ‘language_model.layers.41.self_attn.o_proj.weight’, ‘language_model.layers.41.self_attn.q_norm.weight’, ‘language_model.layers.41.self_attn.q_proj.weight’, ‘language_model.layers.41.self_attn.v_proj.weight’, ‘language_model.layers.42.input_layernorm.weight’, ‘language_model.layers.42.mlp.experts.down_proj’, ‘language_model.layers.42.mlp.experts.gate_up_proj’, ‘language_model.layers.42.mlp.gate.weight’, ‘language_model.layers.42.post_attention_layernorm.weight’, ‘language_model.layers.42.self_attn.k_norm.weight’, ‘language_model.layers.42.self_attn.k_proj.weight’, ‘language_model.layers.42.self_attn.o_proj.weight’, ‘language_model.layers.42.self_attn.q_norm.weight’, ‘language_model.layers.42.self_attn.q_proj.weight’, ‘language_model.layers.42.self_attn.v_proj.weight’, ‘language_model.layers.43.input_layernorm.weight’, ‘language_model.layers.43.mlp.experts.down_proj’, ‘language_model.layers.43.mlp.experts.gate_up_proj’, ‘language_model.layers.43.mlp.gate.weight’, ‘language_model.layers.43.post_attention_layernorm.weight’, ‘language_model.layers.43.self_attn.k_norm.weight’, ‘language_model.layers.43.self_attn.k_proj.weight’, ‘language_model.layers.43.self_attn.o_proj.weight’, ‘language_model.layers.43.self_attn.q_norm.weight’, ‘language_model.layers.43.self_attn.q_proj.weight’, ‘language_model.layers.43.self_attn.v_proj.weight’, ‘language_model.layers.44.input_layernorm.weight’, ‘language_model.layers.44.mlp.experts.down_proj’, ‘language_model.layers.44.mlp.experts.gate_up_proj’, ‘language_model.layers.44.mlp.gate.weight’, ‘language_model.layers.44.post_attention_layernorm.weight’, ‘language_model.layers.44.self_attn.k_norm.weight’, ‘language_model.layers.44.self_attn.k_proj.weight’, ‘language_model.layers.44.self_attn.o_proj.weight’, ‘language_model.layers.44.self_attn.q_norm.weight’, ‘language_model.layers.44.self_attn.q_proj.weight’, ‘language_model.layers.44.self_attn.v_proj.weight’, ‘language_model.layers.45.input_layernorm.weight’, ‘language_model.layers.45.mlp.experts.down_proj’, ‘language_model.layers.45.mlp.experts.gate_up_proj’, ‘language_model.layers.45.mlp.gate.weight’, ‘language_model.layers.45.post_attention_layernorm.weight’, ‘language_model.layers.45.self_attn.k_norm.weight’, ‘language_model.layers.45.self_attn.k_proj.weight’, ‘language_model.layers.45.self_attn.o_proj.weight’, ‘language_model.layers.45.self_attn.q_norm.weight’, ‘language_model.layers.45.self_attn.q_proj.weight’, ‘language_model.layers.45.self_attn.v_proj.weight’, ‘language_model.layers.46.input_layernorm.weight’, ‘language_model.layers.46.mlp.experts.down_proj’, ‘language_model.layers.46.mlp.experts.gate_up_proj’, ‘language_model.layers.46.mlp.gate.weight’, ‘language_model.layers.46.post_attention_layernorm.weight’, ‘language_model.layers.46.self_attn.k_norm.weight’, ‘language_model.layers.46.self_attn.k_proj.weight’, ‘language_model.layers.46.self_attn.o_proj.weight’, ‘language_model.layers.46.self_attn.q_norm.weight’, ‘language_model.layers.46.self_attn.q_proj.weight’, ‘language_model.layers.46.self_attn.v_proj.weight’, ‘language_model.layers.47.input_layernorm.weight’, ‘language_model.layers.47.mlp.experts.down_proj’, ‘language_model.layers.47.mlp.experts.gate_up_proj’, ‘language_model.layers.47.mlp.gate.weight’, ‘language_model.layers.47.post_attention_layernorm.weight’, ‘language_model.layers.47.self_attn.k_norm.weight’, ‘language_model.layers.47.self_attn.k_proj.weight’, ‘language_model.layers.47.self_attn.o_proj.weight’, ‘language_model.layers.47.self_attn.q_norm.weight’, ‘language_model.layers.47.self_attn.q_proj.weight’, ‘language_model.layers.47.self_attn.v_proj.weight’, ‘language_model.layers.5.input_layernorm.weight’, ‘language_model.layers.5.mlp.experts.down_proj’, ‘language_model.layers.5.mlp.experts.gate_up_proj’, ‘language_model.layers.5.mlp.gate.weight’, ‘language_model.layers.5.post_attention_layernorm.weight’, ‘language_model.layers.5.self_attn.k_norm.weight’, ‘language_model.layers.5.self_attn.k_proj.weight’, ‘language_model.layers.5.self_attn.o_proj.weight’, ‘language_model.layers.5.self_attn.q_norm.weight’, ‘language_model.layers.5.self_attn.q_proj.weight’, ‘language_model.layers.5.self_attn.v_proj.weight’, ‘language_model.layers.6.input_layernorm.weight’, ‘language_model.layers.6.mlp.experts.down_proj’, ‘language_model.layers.6.mlp.experts.gate_up_proj’, ‘language_model.layers.6.mlp.gate.weight’, ‘language_model.layers.6.post_attention_layernorm.weight’, ‘language_model.layers.6.self_attn.k_norm.weight’, ‘language_model.layers.6.self_attn.k_proj.weight’, ‘language_model.layers.6.self_attn.o_proj.weight’, ‘language_model.layers.6.self_attn.q_norm.weight’, ‘language_model.layers.6.self_attn.q_proj.weight’, ‘language_model.layers.6.self_attn.v_proj.weight’, ‘language_model.layers.7.input_layernorm.weight’, ‘language_model.layers.7.mlp.experts.down_proj’, ‘language_model.layers.7.mlp.experts.gate_up_proj’, ‘language_model.layers.7.mlp.gate.weight’, ‘language_model.layers.7.post_attention_layernorm.weight’, ‘language_model.layers.7.self_attn.k_norm.weight’, ‘language_model.layers.7.self_attn.k_proj.weight’, ‘language_model.layers.7.self_attn.o_proj.weight’, ‘language_model.layers.7.self_attn.q_norm.weight’, ‘language_model.layers.7.self_attn.q_proj.weight’, ‘language_model.layers.7.self_attn.v_proj.weight’, ‘language_model.layers.8.input_layernorm.weight’, ‘language_model.layers.8.mlp.experts.down_proj’, ‘language_model.layers.8.mlp.experts.gate_up_proj’, ‘language_model.layers.8.mlp.gate.weight’, ‘language_model.layers.8.post_attention_layernorm.weight’, ‘language_model.layers.8.self_attn.k_norm.weight’, ‘language_model.layers.8.self_attn.k_proj.weight’, ‘language_model.layers.8.self_attn.o_proj.weight’, ‘language_model.layers.8.self_attn.q_norm.weight’, ‘language_model.layers.8.self_attn.q_proj.weight’, ‘language_model.layers.8.self_attn.v_proj.weight’, ‘language_model.layers.9.input_layernorm.weight’, ‘language_model.layers.9.mlp.experts.down_proj’, ‘language_model.layers.9.mlp.experts.gate_up_proj’, ‘language_model.layers.9.mlp.gate.weight’, ‘language_model.layers.9.post_attention_layernorm.weight’, ‘language_model.layers.9.self_attn.k_norm.weight’, ‘language_model.layers.9.self_attn.k_proj.weight’, ‘language_model.layers.9.self_attn.o_proj.weight’, ‘language_model.layers.9.self_attn.q_norm.weight’, ‘language_model.layers.9.self_attn.q_proj.weight’, ‘language_model.layers.9.self_attn.v_proj.weight’, ‘language_model.norm.weight’, ‘visual.blocks.0.attn.proj.bias’, ‘visual.blocks.0.attn.proj.weight’, ‘visual.blocks.0.attn.qkv.bias’, ‘visual.blocks.0.attn.qkv.weight’, ‘visual.blocks.0.mlp.linear_fc1.bias’, ‘visual.blocks.0.mlp.linear_fc1.weight’, ‘visual.blocks.0.mlp.linear_fc2.bias’, ‘visual.blocks.0.mlp.linear_fc2.weight’, ‘visual.blocks.0.norm1.bias’, ‘visual.blocks.0.norm1.weight’, ‘visual.blocks.0.norm2.bias’, ‘visual.blocks.0.norm2.weight’, ‘visual.blocks.1.attn.proj.bias’, ‘visual.blocks.1.attn.proj.weight’, ‘visual.blocks.1.attn.qkv.bias’, ‘visual.blocks.1.attn.qkv.weight’, ‘visual.blocks.1.mlp.linear_fc1.bias’, ‘visual.blocks.1.mlp.linear_fc1.weight’, ‘visual.blocks.1.mlp.linear_fc2.bias’, ‘visual.blocks.1.mlp.linear_fc2.weight’, ‘visual.blocks.1.norm1.bias’, ‘visual.blocks.1.norm1.weight’, ‘visual.blocks.1.norm2.bias’, ‘visual.blocks.1.norm2.weight’, ‘visual.blocks.10.attn.proj.bias’, ‘visual.blocks.10.attn.proj.weight’, ‘visual.blocks.10.attn.qkv.bias’, ‘visual.blocks.10.attn.qkv.weight’, ‘visual.blocks.10.mlp.linear_fc1.bias’, ‘visual.blocks.10.mlp.linear_fc1.weight’, ‘visual.blocks.10.mlp.linear_fc2.bias’, ‘visual.blocks.10.mlp.linear_fc2.weight’, ‘visual.blocks.10.norm1.bias’, ‘visual.blocks.10.norm1.weight’, ‘visual.blocks.10.norm2.bias’, ‘visual.blocks.10.norm2.weight’, ‘visual.blocks.11.attn.proj.bias’, ‘visual.blocks.11.attn.proj.weight’, ‘visual.blocks.11.attn.qkv.bias’, ‘visual.blocks.11.attn.qkv.weight’, ‘visual.blocks.11.mlp.linear_fc1.bias’, ‘visual.blocks.11.mlp.linear_fc1.weight’, ‘visual.blocks.11.mlp.linear_fc2.bias’, ‘visual.blocks.11.mlp.linear_fc2.weight’, ‘visual.blocks.11.norm1.bias’, ‘visual.blocks.11.norm1.weight’, ‘visual.blocks.11.norm2.bias’, ‘visual.blocks.11.norm2.weight’, ‘visual.blocks.12.attn.proj.bias’, ‘visual.blocks.12.attn.proj.weight’, ‘visual.blocks.12.attn.qkv.bias’, ‘visual.blocks.12.attn.qkv.weight’, ‘visual.blocks.12.mlp.linear_fc1.bias’, ‘visual.blocks.12.mlp.linear_fc1.weight’, ‘visual.blocks.12.mlp.linear_fc2.bias’, ‘visual.blocks.12.mlp.linear_fc2.weight’, ‘visual.blocks.12.norm1.bias’, ‘visual.blocks.12.norm1.weight’, ‘visual.blocks.12.norm2.bias’, ‘visual.blocks.12.norm2.weight’, ‘visual.blocks.13.attn.proj.bias’, ‘visual.blocks.13.attn.proj.weight’, ‘visual.blocks.13.attn.qkv.bias’, ‘visual.blocks.13.attn.qkv.weight’, ‘visual.blocks.13.mlp.linear_fc1.bias’, ‘visual.blocks.13.mlp.linear_fc1.weight’, ‘visual.blocks.13.mlp.linear_fc2.bias’, ‘visual.blocks.13.mlp.linear_fc2.weight’, ‘visual.blocks.13.norm1.bias’, ‘visual.blocks.13.norm1.weight’, ‘visual.blocks.13.norm2.bias’, ‘visual.blocks.13.norm2.weight’, ‘visual.blocks.14.attn.proj.bias’, ‘visual.blocks.14.attn.proj.weight’, ‘visual.blocks.14.attn.qkv.bias’, ‘visual.blocks.14.attn.qkv.weight’, ‘visual.blocks.14.mlp.linear_fc1.bias’, ‘visual.blocks.14.mlp.linear_fc1.weight’, ‘visual.blocks.14.mlp.linear_fc2.bias’, ‘visual.blocks.14.mlp.linear_fc2.weight’, ‘visual.blocks.14.norm1.bias’, ‘visual.blocks.14.norm1.weight’, ‘visual.blocks.14.norm2.bias’, ‘visual.blocks.14.norm2.weight’, ‘visual.blocks.15.attn.proj.bias’, ‘visual.blocks.15.attn.proj.weight’, ‘visual.blocks.15.attn.qkv.bias’, ‘visual.blocks.15.attn.qkv.weight’, ‘visual.blocks.15.mlp.linear_fc1.bias’, ‘visual.blocks.15.mlp.linear_fc1.weight’, ‘visual.blocks.15.mlp.linear_fc2.bias’, ‘visual.blocks.15.mlp.linear_fc2.weight’, ‘visual.blocks.15.norm1.bias’, ‘visual.blocks.15.norm1.weight’, ‘visual.blocks.15.norm2.bias’, ‘visual.blocks.15.norm2.weight’, ‘visual.blocks.16.attn.proj.bias’, ‘visual.blocks.16.attn.proj.weight’, ‘visual.blocks.16.attn.qkv.bias’, ‘visual.blocks.16.attn.qkv.weight’, ‘visual.blocks.16.mlp.linear_fc1.bias’, ‘visual.blocks.16.mlp.linear_fc1.weight’, ‘visual.blocks.16.mlp.linear_fc2.bias’, ‘visual.blocks.16.mlp.linear_fc2.weight’, ‘visual.blocks.16.norm1.bias’, ‘visual.blocks.16.norm1.weight’, ‘visual.blocks.16.norm2.bias’, ‘visual.blocks.16.norm2.weight’, ‘visual.blocks.17.attn.proj.bias’, ‘visual.blocks.17.attn.proj.weight’, ‘visual.blocks.17.attn.qkv.bias’, ‘visual.blocks.17.attn.qkv.weight’, ‘visual.blocks.17.mlp.linear_fc1.bias’, ‘visual.blocks.17.mlp.linear_fc1.weight’, ‘visual.blocks.17.mlp.linear_fc2.bias’, ‘visual.blocks.17.mlp.linear_fc2.weight’, ‘visual.blocks.17.norm1.bias’, ‘visual.blocks.17.norm1.weight’, ‘visual.blocks.17.norm2.bias’, ‘visual.blocks.17.norm2.weight’, ‘visual.blocks.18.attn.proj.bias’, ‘visual.blocks.18.attn.proj.weight’, ‘visual.blocks.18.attn.qkv.bias’, ‘visual.blocks.18.attn.qkv.weight’, ‘visual.blocks.18.mlp.linear_fc1.bias’, ‘visual.blocks.18.mlp.linear_fc1.weight’, ‘visual.blocks.18.mlp.linear_fc2.bias’, ‘visual.blocks.18.mlp.linear_fc2.weight’, ‘visual.blocks.18.norm1.bias’, ‘visual.blocks.18.norm1.weight’, ‘visual.blocks.18.norm2.bias’, ‘visual.blocks.18.norm2.weight’, ‘visual.blocks.19.attn.proj.bias’, ‘visual.blocks.19.attn.proj.weight’, ‘visual.blocks.19.attn.qkv.bias’, ‘visual.blocks.19.attn.qkv.weight’, ‘visual.blocks.19.mlp.linear_fc1.bias’, ‘visual.blocks.19.mlp.linear_fc1.weight’, ‘visual.blocks.19.mlp.linear_fc2.bias’, ‘visual.blocks.19.mlp.linear_fc2.weight’, ‘visual.blocks.19.norm1.bias’, ‘visual.blocks.19.norm1.weight’, ‘visual.blocks.19.norm2.bias’, ‘visual.blocks.19.norm2.weight’, ‘visual.blocks.2.attn.proj.bias’, ‘visual.blocks.2.attn.proj.weight’, ‘visual.blocks.2.attn.qkv.bias’, ‘visual.blocks.2.attn.qkv.weight’, ‘visual.blocks.2.mlp.linear_fc1.bias’, ‘visual.blocks.2.mlp.linear_fc1.weight’, ‘visual.blocks.2.mlp.linear_fc2.bias’, ‘visual.blocks.2.mlp.linear_fc2.weight’, ‘visual.blocks.2.norm1.bias’, ‘visual.blocks.2.norm1.weight’, ‘visual.blocks.2.norm2.bias’, ‘visual.blocks.2.norm2.weight’, ‘visual.blocks.20.attn.proj.bias’, ‘visual.blocks.20.attn.proj.weight’, ‘visual.blocks.20.attn.qkv.bias’, ‘visual.blocks.20.attn.qkv.weight’, ‘visual.blocks.20.mlp.linear_fc1.bias’, ‘visual.blocks.20.mlp.linear_fc1.weight’, ‘visual.blocks.20.mlp.linear_fc2.bias’, ‘visual.blocks.20.mlp.linear_fc2.weight’, ‘visual.blocks.20.norm1.bias’, ‘visual.blocks.20.norm1.weight’, ‘visual.blocks.20.norm2.bias’, ‘visual.blocks.20.norm2.weight’, ‘visual.blocks.21.attn.proj.bias’, ‘visual.blocks.21.attn.proj.weight’, ‘visual.blocks.21.attn.qkv.bias’, ‘visual.blocks.21.attn.qkv.weight’, ‘visual.blocks.21.mlp.linear_fc1.bias’, ‘visual.blocks.21.mlp.linear_fc1.weight’, ‘visual.blocks.21.mlp.linear_fc2.bias’, ‘visual.blocks.21.mlp.linear_fc2.weight’, ‘visual.blocks.21.norm1.bias’, ‘visual.blocks.21.norm1.weight’, ‘visual.blocks.21.norm2.bias’, ‘visual.blocks.21.norm2.weight’, ‘visual.blocks.22.attn.proj.bias’, ‘visual.blocks.22.attn.proj.weight’, ‘visual.blocks.22.attn.qkv.bias’, ‘visual.blocks.22.attn.qkv.weight’, ‘visual.blocks.22.mlp.linear_fc1.bias’, ‘visual.blocks.22.mlp.linear_fc1.weight’, ‘visual.blocks.22.mlp.linear_fc2.bias’, ‘visual.blocks.22.mlp.linear_fc2.weight’, ‘visual.blocks.22.norm1.bias’, ‘visual.blocks.22.norm1.weight’, ‘visual.blocks.22.norm2.bias’, ‘visual.blocks.22.norm2.weight’, ‘visual.blocks.23.attn.proj.bias’, ‘visual.blocks.23.attn.proj.weight’, ‘visual.blocks.23.attn.qkv.bias’, ‘visual.blocks.23.attn.qkv.weight’, ‘visual.blocks.23.mlp.linear_fc1.bias’, ‘visual.blocks.23.mlp.linear_fc1.weight’, ‘visual.blocks.23.mlp.linear_fc2.bias’, ‘visual.blocks.23.mlp.linear_fc2.weight’, ‘visual.blocks.23.norm1.bias’, ‘visual.blocks.23.norm1.weight’, ‘visual.blocks.23.norm2.bias’, ‘visual.blocks.23.norm2.weight’, ‘visual.blocks.24.attn.proj.bias’, ‘visual.blocks.24.attn.proj.weight’, ‘visual.blocks.24.attn.qkv.bias’, ‘visual.blocks.24.attn.qkv.weight’, ‘visual.blocks.24.mlp.linear_fc1.bias’, ‘visual.blocks.24.mlp.linear_fc1.weight’, ‘visual.blocks.24.mlp.linear_fc2.bias’, ‘visual.blocks.24.mlp.linear_fc2.weight’, ‘visual.blocks.24.norm1.bias’, ‘visual.blocks.24.norm1.weight’, ‘visual.blocks.24.norm2.bias’, ‘visual.blocks.24.norm2.weight’, ‘visual.blocks.25.attn.proj.bias’, ‘visual.blocks.25.attn.proj.weight’, ‘visual.blocks.25.attn.qkv.bias’, ‘visual.blocks.25.attn.qkv.weight’, ‘visual.blocks.25.mlp.linear_fc1.bias’, ‘visual.blocks.25.mlp.linear_fc1.weight’, ‘visual.blocks.25.mlp.linear_fc2.bias’, ‘visual.blocks.25.mlp.linear_fc2.weight’, ‘visual.blocks.25.norm1.bias’, ‘visual.blocks.25.norm1.weight’, ‘visual.blocks.25.norm2.bias’, ‘visual.blocks.25.norm2.weight’, ‘visual.blocks.26.attn.proj.bias’, ‘visual.blocks.26.attn.proj.weight’, ‘visual.blocks.26.attn.qkv.bias’, ‘visual.blocks.26.attn.qkv.weight’, ‘visual.blocks.26.mlp.linear_fc1.bias’, ‘visual.blocks.26.mlp.linear_fc1.weight’, ‘visual.blocks.26.mlp.linear_fc2.bias’, ‘visual.blocks.26.mlp.linear_fc2.weight’, ‘visual.blocks.26.norm1.bias’, ‘visual.blocks.26.norm1.weight’, ‘visual.blocks.26.norm2.bias’, ‘visual.blocks.26.norm2.weight’, ‘visual.blocks.3.attn.proj.bias’, ‘visual.blocks.3.attn.proj.weight’, ‘visual.blocks.3.attn.qkv.bias’, ‘visual.blocks.3.attn.qkv.weight’, ‘visual.blocks.3.mlp.linear_fc1.bias’, ‘visual.blocks.3.mlp.linear_fc1.weight’, ‘visual.blocks.3.mlp.linear_fc2.bias’, ‘visual.blocks.3.mlp.linear_fc2.weight’, ‘visual.blocks.3.norm1.bias’, ‘visual.blocks.3.norm1.weight’, ‘visual.blocks.3.norm2.bias’, ‘visual.blocks.3.norm2.weight’, ‘visual.blocks.4.attn.proj.bias’, ‘visual.blocks.4.attn.proj.weight’, ‘visual.blocks.4.attn.qkv.bias’, ‘visual.blocks.4.attn.qkv.weight’, ‘visual.blocks.4.mlp.linear_fc1.bias’, ‘visual.blocks.4.mlp.linear_fc1.weight’, ‘visual.blocks.4.mlp.linear_fc2.bias’, ‘visual.blocks.4.mlp.linear_fc2.weight’, ‘visual.blocks.4.norm1.bias’, ‘visual.blocks.4.norm1.weight’, ‘visual.blocks.4.norm2.bias’, ‘visual.blocks.4.norm2.weight’, ‘visual.blocks.5.attn.proj.bias’, ‘visual.blocks.5.attn.proj.weight’, ‘visual.blocks.5.attn.qkv.bias’, ‘visual.blocks.5.attn.qkv.weight’, ‘visual.blocks.5.mlp.linear_fc1.bias’, ‘visual.blocks.5.mlp.linear_fc1.weight’, ‘visual.blocks.5.mlp.linear_fc2.bias’, ‘visual.blocks.5.mlp.linear_fc2.weight’, ‘visual.blocks.5.norm1.bias’, ‘visual.blocks.5.norm1.weight’, ‘visual.blocks.5.norm2.bias’, ‘visual.blocks.5.norm2.weight’, ‘visual.blocks.6.attn.proj.bias’, ‘visual.blocks.6.attn.proj.weight’, ‘visual.blocks.6.attn.qkv.bias’, ‘visual.blocks.6.attn.qkv.weight’, ‘visual.blocks.6.mlp.linear_fc1.bias’, ‘visual.blocks.6.mlp.linear_fc1.weight’, ‘visual.blocks.6.mlp.linear_fc2.bias’, ‘visual.blocks.6.mlp.linear_fc2.weight’, ‘visual.blocks.6.norm1.bias’, ‘visual.blocks.6.norm1.weight’, ‘visual.blocks.6.norm2.bias’, ‘visual.blocks.6.norm2.weight’, ‘visual.blocks.7.attn.proj.bias’, ‘visual.blocks.7.attn.proj.weight’, ‘visual.blocks.7.attn.qkv.bias’, ‘visual.blocks.7.attn.qkv.weight’, ‘visual.blocks.7.mlp.linear_fc1.bias’, ‘visual.blocks.7.mlp.linear_fc1.weight’, ‘visual.blocks.7.mlp.linear_fc2.bias’, ‘visual.blocks.7.mlp.linear_fc2.weight’, ‘visual.blocks.7.norm1.bias’, ‘visual.blocks.7.norm1.weight’, ‘visual.blocks.7.norm2.bias’, ‘visual.blocks.7.norm2.weight’, ‘visual.blocks.8.attn.proj.bias’, ‘visual.blocks.8.attn.proj.weight’, ‘visual.blocks.8.attn.qkv.bias’, ‘visual.blocks.8.attn.qkv.weight’, ‘visual.blocks.8.mlp.linear_fc1.bias’, ‘visual.blocks.8.mlp.linear_fc1.weight’, ‘visual.blocks.8.mlp.linear_fc2.bias’, ‘visual.blocks.8.mlp.linear_fc2.weight’, ‘visual.blocks.8.norm1.bias’, ‘visual.blocks.8.norm1.weight’, ‘visual.blocks.8.norm2.bias’, ‘visual.blocks.8.norm2.weight’, ‘visual.blocks.9.attn.proj.bias’, ‘visual.blocks.9.attn.proj.weight’, ‘visual.blocks.9.attn.qkv.bias’, ‘visual.blocks.9.attn.qkv.weight’, ‘visual.blocks.9.mlp.linear_fc1.bias’, ‘visual.blocks.9.mlp.linear_fc1.weight’, ‘visual.blocks.9.mlp.linear_fc2.bias’, ‘visual.blocks.9.mlp.linear_fc2.weight’, ‘visual.blocks.9.norm1.bias’, ‘visual.blocks.9.norm1.weight’, ‘visual.blocks.9.norm2.bias’, ‘visual.blocks.9.norm2.weight’, ‘visual.deepstack_merger_list.0.linear_fc1.bias’, ‘visual.deepstack_merger_list.0.linear_fc1.weight’, ‘visual.deepstack_merger_list.0.linear_fc2.bias’, ‘visual.deepstack_merger_list.0.linear_fc2.weight’, ‘visual.deepstack_merger_list.0.norm.bias’, ‘visual.deepstack_merger_list.0.norm.weight’, ‘visual.deepstack_merger_list.1.linear_fc1.bias’, ‘visual.deepstack_merger_list.1.linear_fc1.weight’, ‘visual.deepstack_merger_list.1.linear_fc2.bias’, ‘visual.deepstack_merger_list.1.linear_fc2.weight’, ‘visual.deepstack_merger_list.1.norm.bias’, ‘visual.deepstack_merger_list.1.norm.weight’, ‘visual.deepstack_merger_list.2.linear_fc1.bias’, ‘visual.deepstack_merger_list.2.linear_fc1.weight’, ‘visual.deepstack_merger_list.2.linear_fc2.bias’, ‘visual.deepstack_merger_list.2.linear_fc2.weight’, ‘visual.deepstack_merger_list.2.norm.bias’, ‘visual.deepstack_merger_list.2.norm.weight’, ‘visual.merger.linear_fc1.bias’, ‘visual.merger.linear_fc1.weight’, ‘visual.merger.linear_fc2.bias’, ‘visual.merger.linear_fc2.weight’, ‘visual.merger.norm.bias’, ‘visual.merger.norm.weight’, ‘visual.patch_embed.proj.bias’, ‘visual.patch_embed.proj.weight’, ‘visual.pos_embed.weight’]You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.Traceback (most recent call last):File “/workspace/scripts/train_lora.py”, line 165, in model = AutoModel.from_pretrained(^^^^^^^^^^^^^^^^^^^^^^^^^^File “/usr/local/lib/python3.12/dist-packages/transformers/models/auto/auto_factory.py”, line 604, in from_pretrainedreturn model_class.from_pretrained(^^^^^^^^^^^^^^^^^^^^^^^^^^^^File “/usr/local/lib/python3.12/dist-packages/transformers/modeling_utils.py”, line 277, in _wrapperreturn func(*args, **kwargs)^^^^^^^^^^^^^^^^^^^^^File “/usr/local/lib/python3.12/dist-packages/transformers/modeling_utils.py”, line 5140, in from_pretraineddispatch_model(model, **device_map_kwargs)File “/usr/local/lib/python3.12/dist-packages/accelerate/big_modeling.py”, line 502, in dispatch_modelmodel.to(device)File “/usr/local/lib/python3.12/dist-packages/transformers/modeling_utils.py”, line 4343, in toreturn super().to(*args, **kwargs)^^^^^^^^^^^^^^^^^^^^^^^^^^^File “/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py”, line 1371, in toreturn self._apply(convert)^^^^^^^^^^^^^^^^^^^^File “/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py”, line 930, in _applymodule._apply(fn)File “/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py”, line 930, in _applymodule._apply(fn)File “/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py”, line 930, in _applymodule._apply(fn)[Previous line repeated 2 more times]File “/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py”, line 957, in _applyparam_applied = fn(param)^^^^^^^^^File “/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py”, line 1364, in convertraise NotImplementedError(NotImplementedError: Cannot copy out of meta tensor; no data! Please use torch.nn.Module.to_empty() instead of torch.nn.Module.to() when moving module from meta to a different device.

Can someone explain why I’m getting this error? And how can I fix the problem?

Thanks

You should try Qwen/Qwen3-VL-30B-A3B-Instruct, because on Qwen/Qwen3-VL-30B-A3B-Instruct-FP8 · Hugging Face , it says: “Currently, Transformers does not support loading these weights directly. We recommend deploying the model using vLLM or SGLang…”.

The FP8 checkpoint is designed for inference with vLLM / SGLang, not for generic Transformers loading.

You didn’t do anything “wrong” on the DGX, you just ran into a limitation of the FP8 checkpoint.
When you call AutoModel.from_pretrained("Qwen/Qwen3-VL-30B-A3B-Instruct-FP8", device_map="auto", torch_dtype="auto"), Transformers tries to interpret this special FP8 format as a normal HF model → fails to load most weights → meta-tensor + NotImplementedError.

Also, instead of using something like this:

model = AutoModel.from_pretrained(
    "Qwen/Qwen3-VL-30B-A3B-Instruct-FP8",
    ...
)

use this instead:

from transformers import AutoModelForVision2Seq, AutoProcessor

model = AutoModelForVision2Seq.from_pretrained(
MODEL_ID,
torch_dtype=torch.bfloat16,
device_map=“auto”,           # or {“”: 0} etc.
trust_remote_code=True,
attn_implementation=“flash_attention_2”,  # or “sdpa”
)

processor = AutoProcessor.from_pretrained(MODEL_ID, trust_remote_code=True)
tokenizer = processor.tokenizer

Also check if transformers, accelerate and peft have the recommended versions.

You can also try to use unsloth to fine-tune it: Qwen3-VL: How to Run & Fine-tune | Unsloth Documentation

2 Likes

Thank you for your response.

As you said, I replaced the Qwen/Qwen3-VL-30B-A3B-Instruct-FP8 model with Qwen/Qwen3-VL-30B-A3B-Instruct.

But as you can see below, the progress remains at 0% even after 30 minutes.

$ python train_lora.py
--- Démarrage de la Préparation ---

[Étape 1/4] 📦 Chargement du tokenizer et du modèle de base...
    -> Le modèle Qwen/Qwen3-VL-30B-A3B-Instruct va être chargé.
    -> Si le modèle n'est pas en cache, le téléchargement va commencer ici.
    -> Attendez que les logs de chargement des 'checkpoint shards' apparaissent.
/usr/local/lib/python3.12/dist-packages/transformers/models/auto/modeling_auto.py:2284: FutureWarning: The class `AutoModelForVision2Seq` is deprecated and will be removed in v5.0. Please use `AutoModelForImageTextToText` instead.
  warnings.warn(
`torch_dtype` is deprecated! Use `dtype` instead!
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 13/13 [04:21<00:00, 20.10s/it]
    ✅ Modèle de base chargé en VRAM et initialisé.

[Étape 2/4] 🧩 Application des adaptateurs LoRA...
    ✅ Adaptateurs LoRA appliqués. Seuls les poids suivants seront entraînés :
trainable params: 8,650,752 || all params: 31,079,404,784 || trainable%: 0.0278

[Étape 3/4] 🖼️ Chargement, séparation et prétraitement du Dataset...
Dataset d'entraînement: 94 exemples
Dataset de validation: 11 exemples
    -> Démarrage du prétraitement (tokenisation et chargement des images).
Map (num_proc=1):   0%|                                                                                                                                                       | 0/94 [00:00<?, ? examples/s]

I ran the nvidia-smi command in the container but I don’t see any activity at the VRAM level.

$ nvidia-smi
Fri Nov 14 01:14:02 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.95.05              Driver Version: 580.95.05      CUDA Version: 13.0     |
+-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GB10                    On  |   0000000F:01:00.0 Off |                  N/A |
| N/A   38C    P0             10W /  N/A  | Not Supported          |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A            1504      C   python                                59522MiB |
+-----------------------------------------------------------------------------------------+

My script python:

import os
import torch
from transformers import (
    AutoTokenizer,
    TrainingArguments,
    Trainer,
    logging as hf_logging,
    AutoModel
)
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
from datasets import load_dataset, Features, Value, Image as DsImage
from PIL import Image
from transformers.image_utils import is_torch_tensor
from transformers import AutoModelForVision2Seq, Qwen3VLMoeForConditionalGeneration, AutoProcessor

# --- CONFIGURATION ---

# Chemins et Identifiants
MODEL_ID = "Qwen/Qwen3-VL-30B-A3B-Instruct"
DATASET_PATH = "../data/training_dataset.jsonl"  # Assurez-vous que ce chemin est correct
OUTPUT_DIR = "./qwen_lora_invoice_finetuned"

# Paramètres LoRA
LORA_R = 16
LORA_ALPHA = 32
LORA_DROPOUT = 0.05
# Modules ciblés : Qwen utilise les modules standard d'attention.
TARGET_MODULES = ["q_proj", "v_proj", "k_proj"]

# Paramètres d'Entraînement
BATCH_SIZE = 1                    # I also tried with 2
GRADIENT_ACCUMULATION_STEPS = 4
LEARNING_RATE = 2e-4
NUM_TRAIN_EPOCHS = 3
SAVE_STRATEGY = "epoch"
LOGGING_STEPS = 10

def preprocess_function(examples, tokenizer, processor):
    """
    Prétraite un batch d'exemples : charge l'image, tokenise le texte, et combine les entrées.
    """
    base_dir = os.path.dirname(DATASET_PATH)
    if isinstance(examples['image_path'][0], str):
        images = [Image.open(os.path.join(base_dir, path)).convert("RGB")
                  for path in examples['image_path']]
    else:
        images = examples['image_path']
    pixel_values = processor(images=images, return_tensors="pt")['pixel_values']
    text_inputs = tokenizer(
        examples['text'],
        padding="longest",
        truncation=True,
        max_length=2048,
        return_tensors="pt"
    )
    labels = text_inputs.input_ids.clone()
    return {
        'input_ids': text_inputs.input_ids,
        'attention_mask': text_inputs.attention_mask,
        'pixel_values': pixel_values,
        'labels': labels
    }

def custom_data_collator(features):
    """
    Collation des données. Essentiel pour gérer les tenseurs de vision et de langage ensemble.
    """
    input_ids = [f["input_ids"] for f in features]
    attention_mask = [f["attention_mask"] for f in features]
    labels = [f["labels"] for f in features]
    pixel_values = torch.stack([f["pixel_values"] for f in features])
    padded_input_ids = torch.nn.utils.rnn.pad_sequence(
        input_ids, batch_first=True, padding_value=tokenizer.pad_token_id
    )
    padded_attention_mask = torch.nn.utils.rnn.pad_sequence(
        attention_mask, batch_first=True, padding_value=0
    )
    padded_labels = torch.nn.utils.rnn.pad_sequence(
        labels, batch_first=True, padding_value=-100 # -100 est l'index ignoré par le loss
    )
    return {
        'input_ids': padded_input_ids.to(torch.bfloat16) if torch.cuda.is_available() else padded_input_ids,
        'attention_mask': padded_attention_mask,
        'pixel_values': pixel_values,
        'labels': padded_labels,
    }

# --- MAIN EXECUTION ---

if __name__ == "__main__":
    hf_logging.set_verbosity_warning()

    print("--- Démarrage de la Préparation ---")

    # 1. Configuration QLoRA et chargement du modèle
    print("\n[Étape 1/4] 📦 Chargement du tokenizer et du modèle de base...")

    processor = AutoProcessor.from_pretrained(MODEL_ID, trust_remote_code=True)
    tokenizer = processor.tokenizer

    print(f"    -> Le modèle {MODEL_ID} va être chargé.")
    print("    -> Si le modèle n'est pas en cache, le téléchargement va commencer ici.")
    print("    -> Attendez que les logs de chargement des 'checkpoint shards' apparaissent.")

    # Charger le modèle
    model = AutoModelForVision2Seq.from_pretrained(
        MODEL_ID,
        torch_dtype=torch.bfloat16,
        device_map="auto",
        trust_remote_code=True,
        attn_implementation="flash_attention_2",  # or “sdpa"
    )
    print("    ✅ Modèle de base chargé en VRAM et initialisé.")

    # 2. Préparation du modèle LoRA
    # Prépare le modèle quantifié pour l'entraînement (ajoute les 'requires_grad')

    lora_config = LoraConfig(
        r=LORA_R,
        lora_alpha=LORA_ALPHA,
        target_modules=TARGET_MODULES,
        lora_dropout=LORA_DROPOUT,
        bias="none",
        task_type="CAUSAL_LM",
    )
    model = get_peft_model(model, lora_config)

    print("\n[Étape 2/4] 🧩 Application des adaptateurs LoRA...")
    print("    ✅ Adaptateurs LoRA appliqués. Seuls les poids suivants seront entraînés :")
    model.print_trainable_parameters()

    # 3. Chargement et Prétraitement du Dataset
    print("\n[Étape 3/4] 🖼️ Chargement, séparation et prétraitement du Dataset...")

    # Définition des features (pour charger l'image_path comme chemin d'image)
    dataset_features = Features({
        'image_path': Value('string'),
        'text': Value('string'),
    })

    # Le split 'train' est obligatoire pour load_dataset
    raw_dataset = load_dataset(
        'json',
        data_files=DATASET_PATH,
        split='train',
        features=dataset_features
    )
    split_datasets = raw_dataset.train_test_split(test_size=0.1, seed=42)
    train_dataset = split_datasets['train']
    eval_dataset = split_datasets['test']
    print(f"Dataset d'entraînement: {len(train_dataset)} exemples")
    print(f"Dataset de validation: {len(eval_dataset)} exemples")

    # ... Prétraitement du Dataset ...
    print("    -> Démarrage du prétraitement (tokenisation et chargement des images).")
    processed_train_dataset = train_dataset.map(
        lambda examples: preprocess_function(examples, tokenizer, processor),
        batched=True,
        remove_columns=train_dataset.column_names,
        num_proc=1,
        load_from_cache_file=False
    )
    processed_eval_dataset = eval_dataset.map(
        lambda examples: preprocess_function(examples, tokenizer, processor),
        batched=True,
        remove_columns=eval_dataset.column_names,
        num_proc=1,
        load_from_cache_file=False
    )
    print("    ✅ Prétraitement terminé. Datasets prêts pour l'entraînement.")

    # 4. Arguments et Lancement de l'Entraînement
    training_args = TrainingArguments(
        output_dir=OUTPUT_DIR,
        num_train_epochs=NUM_TRAIN_EPOCHS,
        per_device_train_batch_size=BATCH_SIZE,
        gradient_accumulation_steps=GRADIENT_ACCUMULATION_STEPS,
        warmup_steps=100,
        learning_rate=LEARNING_RATE,
        logging_steps=LOGGING_STEPS,
        save_strategy=SAVE_STRATEGY,
        fp16=False, # Utiliser BFLOAT16 si possible sur le DGX Spark, sinon laissons le torch_dtype géré par bnb
        bf16=True,  # Activer bfloat16 si supporté par votre hardware NVIDIA
        gradient_checkpointing=True,
        ddp_find_unused_parameters=False, # Optimisation pour DDP
    )

    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=processed_train_dataset, # Maintenant, utilise le jeu d'entraînement
        eval_dataset=processed_eval_dataset,   # ⭐️ Ajout du jeu de validation pour le suivi
        data_collator=custom_data_collator,
        tokenizer=tokenizer,
    )

    print("\n[Étape 4/4] 🏃 Démarrage du Fine-tuning LoRA...")
    print(f"    -> Nombre d'époques : {NUM_TRAIN_EPOCHS}")
    print(f"    -> Taille du batch effectif : {BATCH_SIZE * GRADIENT_ACCUMULATION_STEPS}")
    print("\n*** L'entraînement commence. Suivez les logs du Trainer ci-dessous. ***\n")

    trainer.train()

    final_output_path = os.path.join(OUTPUT_DIR, "final_lora_adapters")
    trainer.model.save_pretrained(final_output_path)
    tokenizer.save_pretrained(final_output_path)
    print(f"\n✅ Fine-tuning terminé. Adaptateurs LoRA sauvegardés dans {final_output_path}")

Any idea?

Thx

I’m not on my spark right now, but could you try:

import os
from pathlib import Path
from dataclasses import dataclass
from typing import List, Dict, Any

import torch
from datasets import load_dataset, Features, Value
from PIL import Image
from transformers import (
    AutoProcessor,
    TrainingArguments,
    Trainer,
    logging as hf_logging,
)
from transformers.models.qwen3_vl_moe import Qwen3VLMoeForConditionalGeneration
from peft import LoraConfig, get_peft_model, TaskType


# --- CONFIGURATION --------------------------------------------------------

MODEL_ID = "Qwen/Qwen3-VL-30B-A3B-Instruct"
DATASET_PATH = "../data/training_dataset.jsonl"
OUTPUT_DIR = "./qwen_lora_invoice_finetuned"

# LoRA config (similar to official sft_30a3b_lora.sh defaults) 
LORA_R = 64
LORA_ALPHA = 128
LORA_DROPOUT = 0.05
TARGET_MODULES = ["q_proj", "k_proj", "v_proj", "o_proj"]

# Training hyperparameters (you can tune later)
BATCH_SIZE = 1
GRADIENT_ACCUMULATION_STEPS = 4
LEARNING_RATE = 1e-5
NUM_TRAIN_EPOCHS = 1.0
SAVE_STRATEGY = "epoch"
LOGGING_STEPS = 10
MAX_SEQ_LEN = 2048
IGNORE_INDEX = -100


# --- PREPROCESSING --------------------------------------------------------


def build_messages(image_placeholder: str, answer_text: str):
    """
    Build a 1-turn conversation in the same style as data_processor.py:
      user: <image> + generic instruction
      assistant: your target answer.
    """
    return [
        {
            "role": "user",
            "content": [
                {"type": "image", "image": image_placeholder},
                {
                    "type": "text",
                    "text": "Please read this invoice and extract all relevant information.",
                },
            ],
        },
        {
            "role": "assistant",
            "content": [{"type": "text", "text": answer_text}],
        },
    ]


def preprocess_function(examples, processor):
    """
    Batch preprocessing:
    - Load images from `image_path`
    - Build chat-style messages (user + assistant)
    - Use processor.apply_chat_template + processor(...) to get
      input_ids, attention_mask, pixel_values, labels.
    """
    base_dir = Path(os.path.dirname(DATASET_PATH))

    # Load images as PIL
    images = [
        Image.open(base_dir / p).convert("RGB")
        for p in examples["image_path"]
    ]

    # For each example, create messages and a fake "<image>" placeholder
    # (the actual image tensor is passed via processor(images=...)).
    batch_messages = []
    for text in examples["text"]:
        # The placeholder value here is arbitrary; the important part is that
        # the content has {"type": "image", "image": ...} so the processor
        # knows there is an image.
        msgs = build_messages("<image>", text)
        batch_messages.append(msgs)

    # Build text inputs using the Qwen chat template
    texts = [
        processor.apply_chat_template(
            msgs,
            tokenize=False,
            add_generation_prompt=False,
        )
        for msgs in batch_messages
    ]

    # Processor builds input_ids, attention_mask, pixel_values
    # We pass a list of lists of images, one per sample.
    model_inputs = processor(
        text=texts,
        images=[[img] for img in images],
        return_tensors="pt",
        padding="longest",
        truncation=True,
        max_length=MAX_SEQ_LEN,
    )

    input_ids = model_inputs["input_ids"]
    attention_mask = model_inputs["attention_mask"]
    pixel_values = model_inputs["pixel_values"]

    # Labels: copy input_ids, ignore padding
    labels = input_ids.clone()
    if processor.tokenizer.pad_token_id is not None:
        labels[labels == processor.tokenizer.pad_token_id] = IGNORE_INDEX

    return {
        "input_ids": input_ids,
        "attention_mask": attention_mask,
        "pixel_values": pixel_values,
        "labels": labels,
    }


# --- DATA COLLATOR --------------------------------------------------------


@dataclass
class QwenVLDataCollator:
    pad_token_id: int

    def __call__(self, features: List[Dict[str, Any]]) -> Dict[str, Any]:
        """
        Collate already-tensorized features into a batch.
        DO NOT cast input_ids/labels to bf16 – they must stay as long.
        """
        input_ids = torch.stack([f["input_ids"] for f in features])
        attention_mask = torch.stack([f["attention_mask"] for f in features])
        pixel_values = torch.stack([f["pixel_values"] for f in features])
        labels = torch.stack([f["labels"] for f in features])

        input_ids = input_ids.long()
        labels = labels.long()

        batch = {
            "input_ids": input_ids,
            "attention_mask": attention_mask,
            "pixel_values": pixel_values,
            "labels": labels,
        }
        return batch


# --- MAIN -----------------------------------------------------------------


if __name__ == "__main__":
    hf_logging.set_verbosity_warning()
    torch.backends.cuda.matmul.allow_tf32 = True

    print("--- Démarrage de la Préparation ---")

    # 1. Processor & tokenizer
    print("\n[Étape 1/4] 📦 Chargement du processor et du modèle de base...")
    processor = AutoProcessor.from_pretrained(MODEL_ID, trust_remote_code=True)
    tokenizer = processor.tokenizer

    if tokenizer.pad_token is None:
        tokenizer.pad_token = tokenizer.eos_token
        tokenizer.pad_token_id = tokenizer.eos_token_id

    print(f"    -> Le modèle {MODEL_ID} va être chargé (MoE).")

    # 2. Load Qwen3-VL MoE model (as in train_qwen.py) 
    model = Qwen3VLMoeForConditionalGeneration.from_pretrained(
        MODEL_ID,
        attn_implementation="flash_attention_2",
        torch_dtype=torch.bfloat16,
    )
    model.config.use_cache = False
    model.to("cuda")

    # 3. Apply LoRA in the same style as official repo
    print("\n[Étape 2/4] 🧩 Application des adaptateurs LoRA...")

    # Freeze all base params first
    for p in model.parameters():
        p.requires_grad = False

    lora_config = LoraConfig(
        r=LORA_R,
        lora_alpha=LORA_ALPHA,
        lora_dropout=LORA_DROPOUT,
        target_modules=TARGET_MODULES,
        bias="none",
        task_type=TaskType.CAUSAL_LM,
    )

    # Required for gradient checkpointing with LoRA
    if hasattr(model, "enable_input_require_grads"):
        model.enable_input_require_grads()

    model = get_peft_model(model, lora_config)
    print("    ✅ LoRA appliqué. Paramètres entraînables :")
    model.print_trainable_parameters()

    # 4. Load and preprocess dataset
    print("\n[Étape 3/4] 🖼️ Chargement, split et prétraitement du Dataset...")

    dataset_features = Features({
        "image_path": Value("string"),
        "text": Value("string"),
    })

    raw_dataset = load_dataset(
        "json",
        data_files=DATASET_PATH,
        split="train",
        features=dataset_features,
    )

    split_datasets = raw_dataset.train_test_split(test_size=0.1, seed=42)
    train_dataset = split_datasets["train"]
    eval_dataset = split_datasets["test"]

    print(f"Dataset d'entraînement: {len(train_dataset)} exemples")
    print(f"Dataset de validation: {len(eval_dataset)} exemples")
    print("    -> Démarrage du prétraitement (chat template + images).")

    processed_train_dataset = train_dataset.map(
        lambda batch: preprocess_function(batch, processor),
        batched=True,
        remove_columns=train_dataset.column_names,
        num_proc=1,
        load_from_cache_file=False,
    )

    processed_eval_dataset = eval_dataset.map(
        lambda batch: preprocess_function(batch, processor),
        batched=True,
        remove_columns=eval_dataset.column_names,
        num_proc=1,
        load_from_cache_file=False,
    )

    processed_train_dataset.set_format(type="torch")
    processed_eval_dataset.set_format(type="torch")

    print("    ✅ Prétraitement terminé. Datasets prêts pour l'entraînement.")

    # 5. TrainingArguments + Trainer
    training_args = TrainingArguments(
        output_dir=OUTPUT_DIR,
        num_train_epochs=NUM_TRAIN_EPOCHS,
        per_device_train_batch_size=BATCH_SIZE,
        gradient_accumulation_steps=GRADIENT_ACCUMULATION_STEPS,
        warmup_steps=100,
        learning_rate=LEARNING_RATE,
        logging_steps=LOGGING_STEPS,
        save_strategy=SAVE_STRATEGY,
        bf16=True,
        fp16=False,
        gradient_checkpointing=True,
        ddp_find_unused_parameters=False,
        remove_unused_columns=False,
        dataloader_num_workers=4,
        report_to="none",
    )

    data_collator = QwenVLDataCollator(pad_token_id=tokenizer.pad_token_id)

    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=processed_train_dataset,
        eval_dataset=processed_eval_dataset,
        data_collator=data_collator,
        tokenizer=tokenizer,
    )

    print("\n[Étape 4/4] 🏃 Démarrage du Fine-tuning LoRA...")
    print(f"    -> Nombre d'époques : {NUM_TRAIN_EPOCHS}")
    print(f"    -> Taille du batch effectif : {BATCH_SIZE * GRADIENT_ACCUMULATION_STEPS}")
    print("\n*** L'entraînement commence. Surveillez maintenant l'utilisation GPU dans nvidia-smi. ***\n")

    trainer.train()

    final_output_path = os.path.join(OUTPUT_DIR, "final_lora_adapters")
    trainer.model.save_pretrained(final_output_path)
    tokenizer.save_pretrained(final_output_path)
    processor.save_pretrained(final_output_path)

    print(f"\n✅ Fine-tuning terminé. Adaptateurs LoRA sauvegardés dans {final_output_path}")

check also: Qwen3-VL/qwen-vl-finetune at main · QwenLM/Qwen3-VL · GitHub

I chose the qwen2.5-VL-7B-Instruct model because qwen3-VL-30B-A3B-Instruct is too large. I was able to create a LoRA. Then I wanted to use vllm, but when I start the server, I get the following warning:

02:03:55 [models.py:477] Regarding multimodal models, vLLM currently only supports adding LoRA to the language model; visual.patch_embed.proj will be ignored.

This warning indicates that vllm ignores the visual part of the LoRa model.

So I looked into using another inference server like huggingface TGI or TensorRT LLM.

But there isn’t a compatible version of TGI for ARM architecture.

And for TensorRT LLM, you have to convert the file size from SafeTensors format to a format compatible with TensorRT LLM.

Does anyone know how to convert weights?

You can try to merge the LoRA into a full Qwen2.5-VL checkpoint and serve that one instead.

On a HF/PEFT stack, merge the LoRA into the base model weights to produce a new, standalone Qwen2.5-VL checkpoint (no adapter at runtime). Then you can serve that merged checkpoint with vLLM, but without --lora flags, just as a regular model. The vLLM warning goes away, because there’s no LoRA to apply; the visual weights are already baked into the base.

The PEFT script would be something like this:

import torch

from transformers import AutoModelForVision2Seq, AutoProcessor

from peft import PeftModel



base_id = “Qwen/Qwen2.5-VL-7B-Instruct”

lora_path = “/path/to/your/qwen25vl_lora”  # local or HF repo



#1. Load base model



base_model = AutoModelForVision2Seq.from_pretrained(

base_id,

torch_dtype=torch.bfloat16,

device_map=“auto”,

)



#2. Attach LoRA



lora_model = PeftModel.from_pretrained(base_model, lora_path)



#3. Merge LoRA into base weights



merged_model = lora_model.merge_and_unload()  # ← key step



#4. Save as a new full checkpoint



save_dir = “/models/qwen25vl-7b-instruct-myft”

merged_model.save_pretrained(save_dir)



#Processor is unchanged – just re-save it with the model



processor = AutoProcessor.from_pretrained(base_id)

processor.save_pretrained(save_dir)


Then you run vLLM with:

vllm serve \

  --model /models/qwen2.5-VL-7B-instruct-myft \

  --dtype bfloat16 \

  --max-model-len 8192
1 Like

it works.

Thx a lot

1 Like

Glad to help!

1 Like

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.