Llama-3.1-70b-instruct

For Llama 3.1 how do I convert the model_weights to a .nemo format.
I see utlities such as megatron_lm_ckpt_to_nemo however I have tried using them and I have been unsuccessful since these utilities require a checkpoint file.

Hi @msethi2.nemo files are basically just tar files that contain the model_weights folder and the model_config.yaml file. So if you download the nemo weights (i.e. from here Llama 3.1 70B | NVIDIA NGC) then you can just point to the downloaded directory as if it were a single .nemo file and get the same result.

1 Like

Thank you for clarifying.

Hello Neal,

When finetuning with SCHEME=“qlora” on Nemo 24.07 and 4 A10 GPUs (PP_size is 4) for llama3.1 70b-instruct, I got the value error below.

Why is there a discrepancy in bytes for model weights? Any clues? Checkpoint corrupted?

I used CLI to download nvidia/nemo/llama-3_1-70b-instruct-nemo:1.0
And load the model as below (please note there is no .nemo ckpt as in llama-3_1-8b-instruct-nemo:1.0 so I pointed it to the downloaded directory as you suggested)

MODEL=“./llama-3_1-70b-instruct-nemo_v1.0/”


ValueError: FAILED_PRECONDITION: Error reading local file “llama-3_1-70b-instruct-nemo_v1.0/model_weights/model.decoder.layers.mlp.linear_fc1.weight/23.1.0”: Uncompressed chunk is 425654916 bytes, but should be 469762048 bytes [source locations=‘tensorstore/internal/cache/kvs_backed_chunk_cache.cc:52\ntensorstore/kvstore/kvstore.cc:268’]

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.
[2024-12-02 15:53:04,115] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 3047 closing signal SIGTERM
[2024-12-02 15:53:04,116] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 3048 closing signal SIGTERM
[2024-12-02 15:53:08,507] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 3045) of binary: /usr/bin/python
Traceback (most recent call last):
File “/usr/local/bin/torchrun”, line 33, in
sys.exit(load_entry_point(‘torch==2.3.0a0+40ec155e58.nv24.3’, ‘console_scripts’, ‘torchrun’)())
File “/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/multiprocessing/errors/init.py”, line 347, in wrapper
return f(*args, **kwargs)
File “/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py”, line 834, in main
run(args)
File “/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py”, line 825, in run
elastic_launch(
File “/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py”, line 137, in call
return launch_agent(self._config, self._entrypoint, list(args))
File “/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py”, line 271, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

/opt/NeMo/examples/nlp/language_modeling/tuning/megatron_gpt_finetuning.py FAILED

Failures:
[1]:
time : 2024-12-02_15:53:04
host : jupyter-n26130841
rank : 1 (local_rank: 1)
exitcode : 1 (pid: 3046)
error_file: <N/A>
traceback : To enable traceback see: Error Propagation — PyTorch 2.5 documentation

Root Cause (first observed failure):
[0]:
time : 2024-12-02_15:53:04
host : jupyter-n26130841
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 3045)
error_file: <N/A>
traceback : To enable traceback see: Error Propagation — PyTorch 2.5 documentation

Changed PP_size to 2, and the value error disappeared, but ChildFailedError persists.

[2024-12-02 16:09:47,734] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 3821 closing signal SIGTERM
[2024-12-02 16:09:47,735] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 3822 closing signal SIGTERM
[2024-12-02 16:09:47,738] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 3823 closing signal SIGTERM
[2024-12-02 16:09:51,364] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: -9) local_rank: 0 (pid: 3820) of binary: /usr/bin/python
Traceback (most recent call last):
File “/usr/local/bin/torchrun”, line 33, in
sys.exit(load_entry_point(‘torch==2.3.0a0+40ec155e58.nv24.3’, ‘console_scripts’, ‘torchrun’)())
File “/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/multiprocessing/errors/init.py”, line 347, in wrapper
return f(*args, **kwargs)
File “/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py”, line 834, in main
run(args)
File “/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py”, line 825, in run
elastic_launch(
File “/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py”, line 137, in call
return launch_agent(self._config, self._entrypoint, list(args))
File “/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py”, line 271, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

/opt/NeMo/examples/nlp/language_modeling/tuning/megatron_gpt_finetuning.py FAILED

Failures:
<NO_OTHER_FAILURES>

Root Cause (first observed failure):
[0]:
time : 2024-12-02_16:09:47
host : jupyter-n26130841
rank : 0 (local_rank: 0)
exitcode : -9 (pid: 3820)
error_file: <N/A>
traceback : Signal 9 (SIGKILL) received by PID 3820