Hello.
I have written a code that is supposed to finetune an LLM to write reports. I have provided it 18 or so reports for it to train on. nothing major. However, everytime I run the code no matter what LLM I use, the NVIDIA AI Workbench freezes and then entire Spark crashes and restarts. Is it even possible to finetune an LLM on DGX Spark? I have provided my code. Please tell me what I am doing wrong.
import json
import torch
from transformers import AutoModelForCausalLM, AutoProcessor, AutoTokenizer, TrainingArguments, Trainer
from peft import LoraConfig, get_peft_model
from dataset_llama32_vision import LlamaVisionDataset
MODEL_NAME = “Meta-LLaMA-3-2-11B-Vision”
Load model
model = AutoModelForCausalLM.from_pretrained(
MODEL_NAME,
torch_dtype=torch.bfloat16,
device_map=“auto”,
load_in_4bit=True
)
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
processor = AutoProcessor.from_pretrained(MODEL_NAME)
LoRA config
lora_config = LoraConfig(
r=16,
lora_alpha=32,
lora_dropout=0.05,
target_modules=[“q_proj”, “v_proj”],
task_type=“CAUSAL_LM”
)
model = get_peft_model(model, lora_config)
Dataset
train_dataset = LlamaVisionDataset(“dataset.jsonl”, tokenizer, processor)
Training arguments
args = TrainingArguments(
output_dir=“lora_output”,
per_device_train_batch_size=1,
gradient_accumulation_steps=16,
num_train_epochs=3,
learning_rate=2e-4,
fp16=False,
bf16=True,
logging_steps=10,
save_steps=500,
)
trainer = Trainer(
model=model,
args=args,
train_dataset=train_dataset,
)
trainer.train()
model.save_pretrained(“lora_output/final_lora”)