For Llama 3.1 how do I convert the model_weights to a .nemo format.
I see utlities such as megatron_lm_ckpt_to_nemo however I have tried using them and I have been unsuccessful since these utilities require a checkpoint file.
Hi @msethi2 – .nemo
files are basically just tar
files that contain the model_weights
folder and the model_config.yaml
file. So if you download the nemo weights (i.e. from here Llama 3.1 70B | NVIDIA NGC) then you can just point to the downloaded directory as if it were a single .nemo
file and get the same result.
Thank you for clarifying.
Hello Neal,
When finetuning with SCHEME=“qlora” on Nemo 24.07 and 4 A10 GPUs (PP_size is 4) for llama3.1 70b-instruct, I got the value error below.
Why is there a discrepancy in bytes for model weights? Any clues? Checkpoint corrupted?
I used CLI to download nvidia/nemo/llama-3_1-70b-instruct-nemo:1.0
And load the model as below (please note there is no .nemo ckpt as in llama-3_1-8b-instruct-nemo:1.0 so I pointed it to the downloaded directory as you suggested)
MODEL=“./llama-3_1-70b-instruct-nemo_v1.0/”
ValueError: FAILED_PRECONDITION: Error reading local file “llama-3_1-70b-instruct-nemo_v1.0/model_weights/model.decoder.layers.mlp.linear_fc1.weight/23.1.0”: Uncompressed chunk is 425654916 bytes, but should be 469762048 bytes [source locations=‘tensorstore/internal/cache/kvs_backed_chunk_cache.cc:52\ntensorstore/kvstore/kvstore.cc:268’]
Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.
[2024-12-02 15:53:04,115] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 3047 closing signal SIGTERM
[2024-12-02 15:53:04,116] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 3048 closing signal SIGTERM
[2024-12-02 15:53:08,507] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 3045) of binary: /usr/bin/python
Traceback (most recent call last):
File “/usr/local/bin/torchrun”, line 33, in
sys.exit(load_entry_point(‘torch==2.3.0a0+40ec155e58.nv24.3’, ‘console_scripts’, ‘torchrun’)())
File “/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/multiprocessing/errors/init.py”, line 347, in wrapper
return f(*args, **kwargs)
File “/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py”, line 834, in main
run(args)
File “/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py”, line 825, in run
elastic_launch(
File “/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py”, line 137, in call
return launch_agent(self._config, self._entrypoint, list(args))
File “/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py”, line 271, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
/opt/NeMo/examples/nlp/language_modeling/tuning/megatron_gpt_finetuning.py FAILED
Failures:
[1]:
time : 2024-12-02_15:53:04
host : jupyter-n26130841
rank : 1 (local_rank: 1)
exitcode : 1 (pid: 3046)
error_file: <N/A>
traceback : To enable traceback see: Error Propagation — PyTorch 2.5 documentation
Root Cause (first observed failure):
[0]:
time : 2024-12-02_15:53:04
host : jupyter-n26130841
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 3045)
error_file: <N/A>
traceback : To enable traceback see: Error Propagation — PyTorch 2.5 documentation
Changed PP_size to 2, and the value error disappeared, but ChildFailedError persists.