Llama-3.1-70b-instruct

msethi2 · September 30, 2024, 1:55pm

For Llama 3.1 how do I convert the model_weights to a .nemo format.
I see utlities such as megatron_lm_ckpt_to_nemo however I have tried using them and I have been unsuccessful since these utilities require a checkpoint file.

neal.vaidya · September 30, 2024, 9:05pm

Hi @msethi2 – .nemo files are basically just tar files that contain the model_weights folder and the model_config.yaml file. So if you download the nemo weights (i.e. from here Llama 3.1 70B | NVIDIA NGC) then you can just point to the downloaded directory as if it were a single .nemo file and get the same result.

msethi2 · October 1, 2024, 12:45pm

Thank you for clarifying.

r99941068 · December 2, 2024, 3:32pm

Hello Neal,

When finetuning with SCHEME=“qlora” on Nemo 24.07 and 4 A10 GPUs (PP_size is 4) for llama3.1 70b-instruct, I got the value error below.

Why is there a discrepancy in bytes for model weights? Any clues? Checkpoint corrupted?

I used CLI to download nvidia/nemo/llama-3_1-70b-instruct-nemo:1.0
And load the model as below (please note there is no .nemo ckpt as in llama-3_1-8b-instruct-nemo:1.0 so I pointed it to the downloaded directory as you suggested)

MODEL=“./llama-3_1-70b-instruct-nemo_v1.0/”

ValueError: FAILED_PRECONDITION: Error reading local file “llama-3_1-70b-instruct-nemo_v1.0/model_weights/model.decoder.layers.mlp.linear_fc1.weight/23.1.0”: Uncompressed chunk is 425654916 bytes, but should be 469762048 bytes [source locations=‘tensorstore/internal/cache/kvs_backed_chunk_cache.cc:52\ntensorstore/kvstore/kvstore.cc:268’]

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.
[2024-12-02 15:53:04,115] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 3047 closing signal SIGTERM
[2024-12-02 15:53:04,116] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 3048 closing signal SIGTERM
[2024-12-02 15:53:08,507] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 3045) of binary: /usr/bin/python
Traceback (most recent call last):
File “/usr/local/bin/torchrun”, line 33, in
sys.exit(load_entry_point(‘torch==2.3.0a0+40ec155e58.nv24.3’, ‘console_scripts’, ‘torchrun’)())
File “/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/multiprocessing/errors/init.py”, line 347, in wrapper
return f(*args, kwargs)
File “/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py”, line 834, in main
run(args)
File “/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py”, line 825, in run
elastic_launch(
File “/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py”, line 137, in call**
return launch_agent(self._config, self._entrypoint, list(args))
File “/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py”, line 271, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

/opt/NeMo/examples/nlp/language_modeling/tuning/megatron_gpt_finetuning.py FAILED

Failures:
[1]:
time : 2024-12-02_15:53:04
host : jupyter-n26130841
rank : 1 (local_rank: 1)
exitcode : 1 (pid: 3046)
error_file: <N/A>
traceback : To enable traceback see: Error Propagation — PyTorch 2.5 documentation

Root Cause (first observed failure):
[0]:
time : 2024-12-02_15:53:04
host : jupyter-n26130841
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 3045)
error_file: <N/A>
traceback : To enable traceback see: Error Propagation — PyTorch 2.5 documentation

r99941068 · December 2, 2024, 4:38pm

Changed PP_size to 2, and the value error disappeared, but ChildFailedError persists.

[2024-12-02 16:09:47,734] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 3821 closing signal SIGTERM
[2024-12-02 16:09:47,735] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 3822 closing signal SIGTERM
[2024-12-02 16:09:47,738] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 3823 closing signal SIGTERM
[2024-12-02 16:09:51,364] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: -9) local_rank: 0 (pid: 3820) of binary: /usr/bin/python
Traceback (most recent call last):
File “/usr/local/bin/torchrun”, line 33, in
sys.exit(load_entry_point(‘torch==2.3.0a0+40ec155e58.nv24.3’, ‘console_scripts’, ‘torchrun’)())
File “/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/multiprocessing/errors/init.py”, line 347, in wrapper
return f(*args, kwargs)
File “/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py”, line 834, in main
run(args)
File “/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py”, line 825, in run
elastic_launch(
File “/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py”, line 137, in call**
return launch_agent(self._config, self._entrypoint, list(args))
File “/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py”, line 271, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

/opt/NeMo/examples/nlp/language_modeling/tuning/megatron_gpt_finetuning.py FAILED

Failures:
<NO_OTHER_FAILURES>

Root Cause (first observed failure):
[0]:
time : 2024-12-02_16:09:47
host : jupyter-n26130841
rank : 0 (local_rank: 0)
exitcode : -9 (pid: 3820)
error_file: <N/A>
traceback : Signal 9 (SIGKILL) received by PID 3820

Topic		Replies	Views
How to Transfer a LoRA Model from NeMo to NIM After Fine-Tuning with Megatron's Script? Models nemo , nim , llama	4	150	October 9, 2024
Failed to convert Nemo model to Riva using nemo2riva for ASR Riva riva	1	37	January 24, 2025
NeMo Tutorial ModuleNotFoundError: No module named 'megatron.core' AI Foundation Models and Endpoints nemo	2	1989	April 11, 2024
LoRA swapping inference Llama-3.1-8b-instruct \| Exception: lora format could not be determined Models nim , llama3-8b-instruct , llama-31-8b-instruct , llama	4	96	October 22, 2024
Riva Build fails for finetuned conformer NeMo models with batch size 1 Riva	2	745	November 1, 2022
Error while downloading VIA Visual AI Agent llama	20	252	September 23, 2024
Failed to convert Nemo model to Riva (nemo2riva) - ASR Riva nemo	4	1148	May 31, 2023
Reusing a stored model (llama-3.1-8b-instruct) with a proper profile Models nim , llama-31-8b-instruct , llama	0	109	October 30, 2024
[TLT3.0][RIVA][Jasper] KeyError manifest.yaml not found TAO Toolkit riva	26	1031	September 7, 2021
Jetpack6.2+TensorRT OOM issue Jetson Orin Nano generative_ai , llama	7	46	February 21, 2025

Llama-3.1-70b-instruct

/opt/NeMo/examples/nlp/language_modeling/tuning/megatron_gpt_finetuning.py FAILED

Failures: [1]: time : 2024-12-02_15:53:04 host : jupyter-n26130841 rank : 1 (local_rank: 1) exitcode : 1 (pid: 3046) error_file: <N/A> traceback : To enable traceback see: Error Propagation — PyTorch 2.5 documentation

Root Cause (first observed failure): [0]: time : 2024-12-02_15:53:04 host : jupyter-n26130841 rank : 0 (local_rank: 0) exitcode : 1 (pid: 3045) error_file: <N/A> traceback : To enable traceback see: Error Propagation — PyTorch 2.5 documentation

/opt/NeMo/examples/nlp/language_modeling/tuning/megatron_gpt_finetuning.py FAILED

Failures: <NO_OTHER_FAILURES>

Root Cause (first observed failure): [0]: time : 2024-12-02_16:09:47 host : jupyter-n26130841 rank : 0 (local_rank: 0) exitcode : -9 (pid: 3820) error_file: <N/A> traceback : Signal 9 (SIGKILL) received by PID 3820

Related topics

Failures:
[1]:
time : 2024-12-02_15:53:04
host : jupyter-n26130841
rank : 1 (local_rank: 1)
exitcode : 1 (pid: 3046)
error_file: <N/A>
traceback : To enable traceback see: Error Propagation — PyTorch 2.5 documentation

Root Cause (first observed failure):
[0]:
time : 2024-12-02_15:53:04
host : jupyter-n26130841
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 3045)
error_file: <N/A>
traceback : To enable traceback see: Error Propagation — PyTorch 2.5 documentation

Failures:
<NO_OTHER_FAILURES>

Root Cause (first observed failure):
[0]:
time : 2024-12-02_16:09:47
host : jupyter-n26130841
rank : 0 (local_rank: 0)
exitcode : -9 (pid: 3820)
error_file: <N/A>
traceback : Signal 9 (SIGKILL) received by PID 3820