How to Transfer a LoRA Model from NeMo to NIM After Fine-Tuning with Megatron's Script?

Hi everyone,

I’ve been working with NVIDIA NeMo to fine-tune a LLaMA 3.1 8B model using the megatron_finetune.py script. The training went well, and I successfully saved the LoRA weights (.nemo file) in the results/ directory. Now, I want to use this fine-tuned LoRA model with NVIDIA NIM, but I’m not sure where exactly to place the .nemo file or how to register it with the NIM server.

Here’s a summary of what I did:

  1. Fine-Tuning Process:
  • Used the megatron_finetune.py script in NeMo to fine-tune the LLaMA 3.1 8B model.
  • The resulting LoRA weights were saved as megatron_gpt_peft_lora_tuning.nemo in the directory /workspace/results/Meta-llama3.1-8B-Instruct-titlegen/checkpoints/.
  1. Moving the Model to NIM:
  • I copied the .nemo file to the NIM container using the command:
docker cp /workspace/results/Meta-llama3.1-8B-Instruct-titlegen/checkpoints/megatron_gpt_peft_lora_tuning.nemo nim:/workspace/loras/
  • I verified the file was present in the NIM container:
docker exec -it nim ls /workspace/loras
  • The output shows that the file is indeed there.
  1. Setting Up NIM:
  • My Docker Compose configuration includes:
services:
  nim:
    image: nim_custom:v1
    ports:
      - "8000:8000"
    environment:
      - NIM_PEFT_SOURCE=/workspace/loras
      - NIM_PEFT_REFRESH_INTERVAL=3600
    volumes:
      - /path/to/loras:/workspace/loras
    networks:
      - verb-network
  1. Problem: When I try to use the LoRA model in NIM by making a request, I get a 404 error saying that the model llama3.1-8b-law-titlegen does not exist:
url = 'http://0.0.0.0:8000/v1/completions'
headers = {
    'accept': 'application/json',
    'Content-Type': 'application/json'
}
data = {
    "model": "llama3.1-8b-law-titlegen",
    "prompt": "Generate a concise, engaging title for the following legal question...",
    "max_tokens": 50
}
response = requests.post(url, headers=headers, json=data)

The response I get is:

{
  "object": "error",
  "message": "The model `llama3.1-8b-law-titlegen` does not exist.",
  "type": "NotFoundError",
  "param": null,
  "code": 404
}

I suspect that I may need to configure or register the .nemo file in NIM differently. Does anyone know the correct location and method to register the fine-tuned LoRA weights in NIM so that it can recognize and serve the model?

Any help or suggestions would be greatly appreciated!

Scripts and Configuration Snippets:

  • docker cp command to transfer the .nemo file:
docker cp /workspace/results/Meta-llama3.1-8B-Instruct-titlegen/checkpoints/megatron_gpt_peft_lora_tuning.nemo nim:/workspace/loras/
  • Docker Compose configuration for NIM:
services:
  nim:
    image: nim_custom:v1
    ports:
      - "8000:8000"
    environment:
      - NIM_PEFT_SOURCE=/workspace/loras
      - NIM_PEFT_REFRESH_INTERVAL=3600
    volumes:
      - /path/to/loras:/workspace/loras
    networks:
      - verb-network

How should I properly register the fine-tuned .nemo file so that NIM can serve the model without any errors? Or better yet, in which directory should I place my .nemo file resulting from the fine-tuning within the NIM container to ensure it is properly loaded and recognized?

Hi @marcelosousa – we have an example file structure here: Parameter-Efficient Fine-Tuning - NVIDIA Docs.

Basically, you need to place the adapter in a subdirectory of your NIM_PEFT_SOURCE, where the name of that subdirectory is the model name that you want to use to send requests to it. So in your example, it would look something like

/workspace/loras
└── llama3.1-8b-law-titlegen
    └── megatron_gpt_peft_lora_tuning.nemo

Additionally, NIM checks the contents of that directory, and then again every NIM_PEFT_REFRESH_INTERVAL seconds. So if you copy the adapter weights into an already running container, you’ll have to wait for NIM_PEFT_REFRESH_INTERVAL seconds to elapse before NIM will recognize that it’s there.

You can check which adapters are recognized by sending a request to 0.0.0.0:8000/v1/models – that should list out the base model alongside any adapters recognized.

Let me know if that helps!

I followed your instructions regarding the directory, but I received the following error:

root@ncc17362:/mnt/c/Users/marcelosousa/documents/NlpGeresim# 
curl -X 'POST' 'http://0.0.0.0:8000/v1/chat/completions' -H 'accept: application/json' -H 'Content-Type: application/json' -d '{
    "model": "llama3.1-8b-law-titlegen",
    "messages": [{"role":"user", "content":"Write a limerick about the wonders of GPU computing."}],
    "max_tokens": 64
}'
{"object":"error","message":"The model `llama3.1-8b-law-titlegen` does not exist.","type":"NotFoundError","param":null,"code":404}

Can you share the contents of your NIM_PEFT_SOURCE directory? Also, can you share the output of the curl http://0.0.0.0:8000/v1/models command?

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.