Example ran out of memory on dgxspark

The example at PyTorch | NVIDIA NGC
ran out of memory, as below.. wonder if that’s expected?

root@93e69296bb14:/workspace# huggingface-cli login
#<input your huggingface token.

⚠️ Warning: ‘huggingface-cli login’ is deprecated. Use ‘hf auth login’ instead.

_|    _|  _|    _|    _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|_|_|_|    _|_|      _|_|_|  _|_|_|_|
_|    _|  _|    _|  _|        _|          _|    _|_|    _|  _|            _|        _|    _|  _|        _|
_|_|_|_|  _|    _|  _|  _|_|  _|  _|_|    _|    _|  _|  _|  _|  _|_|      _|_|_|    _|_|_|_|  _|        _|_|_|
_|    _|  _|    _|  _|    _|  _|    _|    _|    _|    _|_|  _|    _|      _|        _|    _|  _|        _|
_|    _|    _|_|      _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|        _|    _|    _|_|_|  _|_|_|_|

A token is already saved on your machine. Run `hf auth whoami` to get more information or `hf auth logout` if you want to log out.
Setting a new token will erase the existing one.
To log in, `huggingface_hub` requires a token generated from https://huggingface.co/settings/tokens .

Enter your token (input will not be visible):
Add token as git credential? (Y/n)
Token is valid (permission: read).
The token readonly has been saved to /root/.cache/huggingface/stored_tokens
Cannot authenticate through git-credential as no helper is defined on your machine.
You might have to re-authenticate when pushing to the Hugging Face Hub.
Run the following command in your terminal in case you want to set the ‘store’ credential helper as default.

git config --global credential.helper store

Read Git - Credential Storage for more details.
Token has not been saved to git credential helper.
Your token has been saved to /root/.cache/huggingface/token
Login successful.
The current active token is: readonly
root@93e69296bb14:/workspace# git clone GitHub - NVIDIA/dgx-spark-playbooks: Collection of step-by-step playbooks for setting up AI/ML workloads on NVIDIA DGX Spark devices with Blackwell architecture.
cd dgx-spark-playbooks/nvidia/pytorch-fine-tune/assets
fatal: destination path ‘dgx-spark-playbooks’ already exists and is not an empty directory.
root@93e69296bb14:/workspace/dgx-spark-playbooks/nvidia/pytorch-fine-tune/assets# python Llama3_8B_LoRA_finetuning.py
/usr/local/lib/python3.12/dist-packages/torch/cuda/init.py:63: FutureWarning: The pynvml package is deprecated. Please install nvidia-ml-py instead. If you did not install pynvml directly, please report this to the maintainers of the package that installed pynvml for you.
import pynvml # type: ignore[import]

============================================================

LLAMA 3.1 8B LoRA FINE-TUNING CONFIGURATION

Model: meta-llama/Llama-3.1-8B-Instruct

Batch size: 4
Sequence length: 2048
Number of epochs: 1
Learning rate: 0.0001
LoRA rank: 8
Dataset size: 500
Torch compile: False

Loading model: meta-llama/Llama-3.1-8B-Instruct
config.json: 100%|██████████████████████████████████████████████████████████████████████████████████████████| 855/855 [00:00<00:00, 11.7MB/s]
model.safetensors.index.json: 100%|██████████████████████████████████████████████████████████████████████| 23.9k/23.9k [00:00<00:00, 213MB/s]
model-00004-of-00004.safetensors: 100%|█████████████████████████████████████████████████████████████████| 1.17G/1.17G [00:56<00:00, 20.5MB/s]
model-00002-of-00004.safetensors: 100%|█████████████████████████████████████████████████████████████████| 5.00G/5.00G [01:59<00:00, 41.9MB/s]
model-00003-of-00004.safetensors: 100%|█████████████████████████████████████████████████████████████████| 4.92G/4.92G [01:59<00:00, 41.1MB/s]
model-00001-of-00004.safetensors: 100%|█████████████████████████████████████████████████████████████████| 4.98G/4.98G [01:59<00:00, 41.6MB/s]
Fetching 4 files: 100%|████████████████████████████████████████████████████████████████████████████████████████| 4/4 [01:59<00:00, 29.97s/it]
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████| 4/4 [01:23<00:00, 20.95s/it]
generation_config.json: 100%|███████████████████████████████████████████████████████████████████████████████| 184/184 [00:00<00:00, 3.46MB/s]
tokenizer_config.json: 100%|████████████████████████████████████████████████████████████████████████████| 55.4k/55.4k [00:00<00:00, 29.0MB/s]
tokenizer.json: 100%|███████████████████████████████████████████████████████████████████████████████████| 9.09M/9.09M [00:00<00:00, 37.1MB/s]
special_tokens_map.json: 100%|██████████████████████████████████████████████████████████████████████████████| 296/296 [00:00<00:00, 5.47MB/s]
Trainable parameters = 8,030,261,248
Loading dataset with 500 samples…
README.md: 7.47kB [00:00, 48.1MB/s]
data/train-00000-of-00001-a09b74b3ef9c3b(…): 100%|██████████████████████████████████████████████████████| 24.2M/24.2M [00:00<00:00, 45.6MB/s]
Generating train split: 100%|███████████████████████████████████████████████████████████████| 52002/52002 [00:00<00:00, 866811.05 examples/s]
Map: 100%|███████████████████████████████████████████████████████████████████████████████████████| 500/500 [00:00<00:00, 63848.02 examples/s]

Starting LoRA fine-tuning for 1 epoch(s)…
Adding EOS to train dataset: 100%|███████████████████████████████████████████████████████████████| 500/500 [00:00<00:00, 55085.29 examples/s]
Tokenizing train dataset: 100%|███████████████████████████████████████████████████████████████████| 500/500 [00:00<00:00, 7510.00 examples/s]
Truncating train dataset: 100%|█████████████████████████████████████████████████████████████████| 500/500 [00:00<00:00, 364215.35 examples/s]
The model is already on multiple devices. Skipping the move to device specified in args.
The tokenizer has new PAD/BOS/EOS tokens that differ from the model config and generation config. The model config and generation config were aligned accordingly, being updated with the tokenizer’s values. Updated tokens: {‘eos_token_id’: 128009, ‘pad_token_id’: 128009}.
{‘loss’: 2.4297, ‘grad_norm’: 2.4276559352874756, ‘learning_rate’: 0.0001, ‘num_tokens’: 327.0, ‘mean_token_accuracy’: 0.44272446632385254, ‘epoch’: 0.01}
{‘loss’: 2.4489, ‘grad_norm’: 2.21441912651062, ‘learning_rate’: 9.92e-05, ‘num_tokens’: 607.0, ‘mean_token_accuracy’: 0.4420289993286133, ‘epoch’: 0.02}
{‘loss’: 2.2727, ‘grad_norm’: 1.8421859741210938, ‘learning_rate’: 9.84e-05, ‘num_tokens’: 972.0, ‘mean_token_accuracy’: 0.4653739631175995, ‘epoch’: 0.02}
{‘loss’: 1.9471, ‘grad_norm’: 2.0293779373168945, ‘learning_rate’: 9.76e-05, ‘num_tokens’: 1310.0, ‘mean_token_accuracy’: 0.544910192489624, ‘epoch’: 0.03}
{‘loss’: 1.8222, ‘grad_norm’: 1.555105447769165, ‘learning_rate’: 9.680000000000001e-05, ‘num_tokens’: 1736.0, ‘mean_token_accuracy’: 0.5379146933555603, ‘epoch’: 0.04}
{‘loss’: 1.9147, ‘grad_norm’: 1.879024863243103, ‘learning_rate’: 9.6e-05, ‘num_tokens’: 2073.0, ‘mean_token_accuracy’: 0.5255255103111267, ‘epoch’: 0.05}
{‘loss’: 1.5928, ‘grad_norm’: 1.8202698230743408, ‘learning_rate’: 9.52e-05, ‘num_tokens’: 2475.0, ‘mean_token_accuracy’: 0.5678392052650452, ‘epoch’: 0.06}
{‘loss’: 1.3482, ‘grad_norm’: 1.1439828872680664, ‘learning_rate’: 9.44e-05, ‘num_tokens’: 3194.0, ‘mean_token_accuracy’: 0.6321678161621094, ‘epoch’: 0.06}
{‘loss’: 1.1313, ‘grad_norm’: 2.1846466064453125, ‘learning_rate’: 9.360000000000001e-05, ‘num_tokens’: 3596.0, ‘mean_token_accuracy’: 0.7185929417610168, ‘epoch’: 0.07}
{‘loss’: 0.9036, ‘grad_norm’: 1.8318400382995605, ‘learning_rate’: 9.28e-05, ‘num_tokens’: 3896.0, ‘mean_token_accuracy’: 0.7905405163764954, ‘epoch’: 0.08}
{‘loss’: 1.0326, ‘grad_norm’: 1.6926945447921753, ‘learning_rate’: 9.200000000000001e-05, ‘num_tokens’: 4204.0, ‘mean_token_accuracy’: 0.7598684430122375, ‘epoch’: 0.09}
{‘loss’: 0.7486, ‘grad_norm’: 1.8774569034576416, ‘learning_rate’: 9.120000000000001e-05, ‘num_tokens’: 4457.0, ‘mean_token_accuracy’: 0.827309250831604, ‘epoch’: 0.1}
{‘loss’: 1.2772, ‘grad_norm’: 1.2132618427276611, ‘learning_rate’: 9.04e-05, ‘num_tokens’: 4954.0, ‘mean_token_accuracy’: 0.6612576246261597, ‘epoch’: 0.1}
{‘loss’: 0.8355, ‘grad_norm’: 1.5392227172851562, ‘learning_rate’: 8.960000000000001e-05, ‘num_tokens’: 5324.0, ‘mean_token_accuracy’: 0.8005464673042297, ‘epoch’: 0.11}
{‘loss’: 1.2723, ‘grad_norm’: 1.4280879497528076, ‘learning_rate’: 8.88e-05, ‘num_tokens’: 5842.0, ‘mean_token_accuracy’: 0.6692606806755066, ‘epoch’: 0.12}
{‘loss’: 0.8924, ‘grad_norm’: 1.839540719985962, ‘learning_rate’: 8.800000000000001e-05, ‘num_tokens’: 6152.0, ‘mean_token_accuracy’: 0.7908496856689453, ‘epoch’: 0.13}
{‘loss’: 1.0591, ‘grad_norm’: 2.2333710193634033, ‘learning_rate’: 8.72e-05, ‘num_tokens’: 6451.0, ‘mean_token_accuracy’: 0.7728813290596008, ‘epoch’: 0.14}
{‘loss’: 0.9101, ‘grad_norm’: 1.9379322528839111, ‘learning_rate’: 8.64e-05, ‘num_tokens’: 6771.0, ‘mean_token_accuracy’: 0.7753164768218994, ‘epoch’: 0.14}
{‘loss’: 1.0684, ‘grad_norm’: 1.392130732536316, ‘learning_rate’: 8.560000000000001e-05, ‘num_tokens’: 7275.0, ‘mean_token_accuracy’: 0.7179999947547913, ‘epoch’: 0.15}
{‘loss’: 1.2026, ‘grad_norm’: 1.2867733240127563, ‘learning_rate’: 8.48e-05, ‘num_tokens’: 7803.0, ‘mean_token_accuracy’: 0.6908397078514099, ‘epoch’: 0.16}
{‘loss’: 1.1573, ‘grad_norm’: 1.793595790863037, ‘learning_rate’: 8.4e-05, ‘num_tokens’: 8234.0, ‘mean_token_accuracy’: 0.7377049326896667, ‘epoch’: 0.17}
{‘loss’: 1.2218, ‘grad_norm’: 1.4026941061019897, ‘learning_rate’: 8.32e-05, ‘num_tokens’: 8763.0, ‘mean_token_accuracy’: 0.668571412563324, ‘epoch’: 0.18}
{‘loss’: 1.1421, ‘grad_norm’: 1.4774272441864014, ‘learning_rate’: 8.24e-05, ‘num_tokens’: 9222.0, ‘mean_token_accuracy’: 0.7230769395828247, ‘epoch’: 0.18}
{‘loss’: 1.0783, ‘grad_norm’: 1.753013253211975, ‘learning_rate’: 8.16e-05, ‘num_tokens’: 9566.0, ‘mean_token_accuracy’: 0.7676470875740051, ‘epoch’: 0.19}
{‘loss’: 0.9833, ‘grad_norm’: 1.2679294347763062, ‘learning_rate’: 8.080000000000001e-05, ‘num_tokens’: 9971.0, ‘mean_token_accuracy’: 0.7431421279907227, ‘epoch’: 0.2}
{‘loss’: 1.1269, ‘grad_norm’: 1.1676483154296875, ‘learning_rate’: 8e-05, ‘num_tokens’: 10339.0, ‘mean_token_accuracy’: 0.7445054650306702, ‘epoch’: 0.21}
{‘loss’: 1.1686, ‘grad_norm’: 0.8117290735244751, ‘learning_rate’: 7.920000000000001e-05, ‘num_tokens’: 10974.0, ‘mean_token_accuracy’: 0.6988906264305115, ‘epoch’: 0.22}
{‘loss’: 0.6824, ‘grad_norm’: 1.1198982000350952, ‘learning_rate’: 7.840000000000001e-05, ‘num_tokens’: 11290.0, ‘mean_token_accuracy’: 0.817307710647583, ‘epoch’: 0.22}
{‘loss’: 0.8553, ‘grad_norm’: 0.8958135843276978, ‘learning_rate’: 7.76e-05, ‘num_tokens’: 11719.0, ‘mean_token_accuracy’: 0.7694117426872253, ‘epoch’: 0.23}
{‘loss’: 1.0595, ‘grad_norm’: 0.9197903275489807, ‘learning_rate’: 7.680000000000001e-05, ‘num_tokens’: 12160.0, ‘mean_token_accuracy’: 0.7437071204185486, ‘epoch’: 0.24}
{‘loss’: 0.9603, ‘grad_norm’: 0.844883918762207, ‘learning_rate’: 7.6e-05, ‘num_tokens’: 12587.0, ‘mean_token_accuracy’: 0.73758864402771, ‘epoch’: 0.25}
{‘loss’: 1.0903, ‘grad_norm’: 0.77141934633255, ‘learning_rate’: 7.52e-05, ‘num_tokens’: 13115.0, ‘mean_token_accuracy’: 0.6965649127960205, ‘epoch’: 0.26}
{‘loss’: 0.9826, ‘grad_norm’: 0.8744301795959473, ‘learning_rate’: 7.44e-05, ‘num_tokens’: 13500.0, ‘mean_token_accuracy’: 0.7427821755409241, ‘epoch’: 0.26}
{‘loss’: 0.8191, ‘grad_norm’: 0.8501279354095459, ‘learning_rate’: 7.36e-05, ‘num_tokens’: 13943.0, ‘mean_token_accuracy’: 0.7972665429115295, ‘epoch’: 0.27}
{‘loss’: 0.6783, ‘grad_norm’: 0.8382713794708252, ‘learning_rate’: 7.280000000000001e-05, ‘num_tokens’: 14355.0, ‘mean_token_accuracy’: 0.8259803652763367, ‘epoch’: 0.28}
{‘loss’: 0.9703, ‘grad_norm’: 1.1469954252243042, ‘learning_rate’: 7.2e-05, ‘num_tokens’: 14649.0, ‘mean_token_accuracy’: 0.7724137902259827, ‘epoch’: 0.29}
{‘loss’: 0.9933, ‘grad_norm’: 0.8908654451370239, ‘learning_rate’: 7.12e-05, ‘num_tokens’: 15066.0, ‘mean_token_accuracy’: 0.7554479241371155, ‘epoch’: 0.3}
{‘loss’: 0.8967, ‘grad_norm’: 1.1093345880508423, ‘learning_rate’: 7.04e-05, ‘num_tokens’: 15415.0, ‘mean_token_accuracy’: 0.7681159377098083, ‘epoch’: 0.3}
{‘loss’: 1.0946, ‘grad_norm’: 0.8427884578704834, ‘learning_rate’: 6.96e-05, ‘num_tokens’: 15833.0, ‘mean_token_accuracy’: 0.717391312122345, ‘epoch’: 0.31}
{‘loss’: 0.793, ‘grad_norm’: 0.9886052012443542, ‘learning_rate’: 6.879999999999999e-05, ‘num_tokens’: 16155.0, ‘mean_token_accuracy’: 0.7767295837402344, ‘epoch’: 0.32}
{‘loss’: 0.9634, ‘grad_norm’: 1.4802045822143555, ‘learning_rate’: 6.800000000000001e-05, ‘num_tokens’: 16499.0, ‘mean_token_accuracy’: 0.7647058963775635, ‘epoch’: 0.33}
{‘loss’: 1.1896, ‘grad_norm’: 0.8107177019119263, ‘learning_rate’: 6.720000000000001e-05, ‘num_tokens’: 16994.0, ‘mean_token_accuracy’: 0.6883910298347473, ‘epoch’: 0.34}
{‘loss’: 0.824, ‘grad_norm’: 0.8903188109397888, ‘learning_rate’: 6.64e-05, ‘num_tokens’: 17294.0, ‘mean_token_accuracy’: 0.7972972989082336, ‘epoch’: 0.34}
{‘loss’: 1.0551, ‘grad_norm’: 0.8312821388244629, ‘learning_rate’: 6.560000000000001e-05, ‘num_tokens’: 17680.0, ‘mean_token_accuracy’: 0.727748692035675, ‘epoch’: 0.35}
{‘loss’: 1.2114, ‘grad_norm’: 0.796442449092865, ‘learning_rate’: 6.48e-05, ‘num_tokens’: 18069.0, ‘mean_token_accuracy’: 0.6935064792633057, ‘epoch’: 0.36}
{‘loss’: 0.8082, ‘grad_norm’: 0.843403697013855, ‘learning_rate’: 6.400000000000001e-05, ‘num_tokens’: 18487.0, ‘mean_token_accuracy’: 0.8236715197563171, ‘epoch’: 0.37}
{‘loss’: 1.1618, ‘grad_norm’: 0.6775524616241455, ‘learning_rate’: 6.32e-05, ‘num_tokens’: 19020.0, ‘mean_token_accuracy’: 0.6843100190162659, ‘epoch’: 0.38}
{‘loss’: 0.8947, ‘grad_norm’: 0.8043442368507385, ‘learning_rate’: 6.24e-05, ‘num_tokens’: 19444.0, ‘mean_token_accuracy’: 0.7666666507720947, ‘epoch’: 0.38}
{‘loss’: 1.0764, ‘grad_norm’: 0.8690796494483948, ‘learning_rate’: 6.16e-05, ‘num_tokens’: 19818.0, ‘mean_token_accuracy’: 0.708108127117157, ‘epoch’: 0.39}
{‘loss’: 1.0017, ‘grad_norm’: 0.8811030983924866, ‘learning_rate’: 6.08e-05, ‘num_tokens’: 20166.0, ‘mean_token_accuracy’: 0.75, ‘epoch’: 0.4}
{‘loss’: 1.0592, ‘grad_norm’: 0.6704208850860596, ‘learning_rate’: 6e-05, ‘num_tokens’: 20702.0, ‘mean_token_accuracy’: 0.6936089992523193, ‘epoch’: 0.41}
{‘loss’: 1.0387, ‘grad_norm’: 0.8671566247940063, ‘learning_rate’: 5.92e-05, ‘num_tokens’: 21129.0, ‘mean_token_accuracy’: 0.7446808218955994, ‘epoch’: 0.42}
{‘loss’: 1.1665, ‘grad_norm’: 0.5534923076629639, ‘learning_rate’: 5.8399999999999997e-05, ‘num_tokens’: 21892.0, ‘mean_token_accuracy’: 0.6930171251296997, ‘epoch’: 0.42}
{‘loss’: 1.0165, ‘grad_norm’: 0.8867106437683105, ‘learning_rate’: 5.76e-05, ‘num_tokens’: 22289.0, ‘mean_token_accuracy’: 0.732824444770813, ‘epoch’: 0.43}
{‘loss’: 0.6073, ‘grad_norm’: 0.9316121935844421, ‘learning_rate’: 5.68e-05, ‘num_tokens’: 22577.0, ‘mean_token_accuracy’: 0.8239436745643616, ‘epoch’: 0.44}
{‘loss’: 1.0502, ‘grad_norm’: 0.8222288489341736, ‘learning_rate’: 5.6000000000000006e-05, ‘num_tokens’: 23120.0, ‘mean_token_accuracy’: 0.7050092816352844, ‘epoch’: 0.45}
{‘loss’: 0.8943, ‘grad_norm’: 0.7794366478919983, ‘learning_rate’: 5.520000000000001e-05, ‘num_tokens’: 23545.0, ‘mean_token_accuracy’: 0.7434679269790649, ‘epoch’: 0.46}
{‘loss’: 0.9137, ‘grad_norm’: 0.7706211805343628, ‘learning_rate’: 5.440000000000001e-05, ‘num_tokens’: 23965.0, ‘mean_token_accuracy’: 0.7475961446762085, ‘epoch’: 0.46}
{‘loss’: 1.2693, ‘grad_norm’: 0.8112915754318237, ‘learning_rate’: 5.360000000000001e-05, ‘num_tokens’: 24458.0, ‘mean_token_accuracy’: 0.7014315128326416, ‘epoch’: 0.47}
{‘loss’: 0.8237, ‘grad_norm’: 0.8800790905952454, ‘learning_rate’: 5.28e-05, ‘num_tokens’: 24777.0, ‘mean_token_accuracy’: 0.7936508059501648, ‘epoch’: 0.48}
{‘loss’: 1.1128, ‘grad_norm’: 0.7878955006599426, ‘learning_rate’: 5.2000000000000004e-05, ‘num_tokens’: 25227.0, ‘mean_token_accuracy’: 0.7152466177940369, ‘epoch’: 0.49}
{‘loss’: 0.9216, ‘grad_norm’: 0.5940482020378113, ‘learning_rate’: 5.1200000000000004e-05, ‘num_tokens’: 25821.0, ‘mean_token_accuracy’: 0.7576271295547485, ‘epoch’: 0.5}
{‘loss’: 1.0222, ‘grad_norm’: 0.6234289407730103, ‘learning_rate’: 5.0400000000000005e-05, ‘num_tokens’: 26336.0, ‘mean_token_accuracy’: 0.7260273694992065, ‘epoch’: 0.5}
{‘loss’: 0.6002, ‘grad_norm’: 1.3347560167312622, ‘learning_rate’: 4.96e-05, ‘num_tokens’: 26601.0, ‘mean_token_accuracy’: 0.8275862336158752, ‘epoch’: 0.51}
{‘loss’: 0.7721, ‘grad_norm’: 0.7469112277030945, ‘learning_rate’: 4.88e-05, ‘num_tokens’: 26994.0, ‘mean_token_accuracy’: 0.7789202928543091, ‘epoch’: 0.52}
{‘loss’: 0.952, ‘grad_norm’: 0.7399258017539978, ‘learning_rate’: 4.8e-05, ‘num_tokens’: 27418.0, ‘mean_token_accuracy’: 0.738095223903656, ‘epoch’: 0.53}
{‘loss’: 1.0597, ‘grad_norm’: 0.6677412986755371, ‘learning_rate’: 4.72e-05, ‘num_tokens’: 27899.0, ‘mean_token_accuracy’: 0.7316561937332153, ‘epoch’: 0.54}
{‘loss’: 0.8557, ‘grad_norm’: 0.8391092419624329, ‘learning_rate’: 4.64e-05, ‘num_tokens’: 28319.0, ‘mean_token_accuracy’: 0.7884615659713745, ‘epoch’: 0.54}
{‘loss’: 0.8582, ‘grad_norm’: 0.8569543361663818, ‘learning_rate’: 4.5600000000000004e-05, ‘num_tokens’: 28650.0, ‘mean_token_accuracy’: 0.7706422209739685, ‘epoch’: 0.55}
{‘loss’: 0.5632, ‘grad_norm’: 0.9192879796028137, ‘learning_rate’: 4.4800000000000005e-05, ‘num_tokens’: 28902.0, ‘mean_token_accuracy’: 0.850806474685669, ‘epoch’: 0.56}
{‘loss’: 0.8237, ‘grad_norm’: 0.8685383796691895, ‘learning_rate’: 4.4000000000000006e-05, ‘num_tokens’: 29323.0, ‘mean_token_accuracy’: 0.8105515837669373, ‘epoch’: 0.57}
{‘loss’: 0.8744, ‘grad_norm’: 0.8276363611221313, ‘learning_rate’: 4.32e-05, ‘num_tokens’: 29752.0, ‘mean_token_accuracy’: 0.7623529434204102, ‘epoch’: 0.58}
{‘loss’: 1.1644, ‘grad_norm’: 0.7240146994590759, ‘learning_rate’: 4.24e-05, ‘num_tokens’: 30361.0, ‘mean_token_accuracy’: 0.6826446056365967, ‘epoch’: 0.58}
{‘loss’: 1.0378, ‘grad_norm’: 0.687714695930481, ‘learning_rate’: 4.16e-05, ‘num_tokens’: 30784.0, ‘mean_token_accuracy’: 0.7279236316680908, ‘epoch’: 0.59}
{‘loss’: 1.1608, ‘grad_norm’: 0.7010581493377686, ‘learning_rate’: 4.08e-05, ‘num_tokens’: 31333.0, ‘mean_token_accuracy’: 0.6825687885284424, ‘epoch’: 0.6}
{‘loss’: 0.9305, ‘grad_norm’: 0.5929700136184692, ‘learning_rate’: 4e-05, ‘num_tokens’: 31910.0, ‘mean_token_accuracy’: 0.7591623067855835, ‘epoch’: 0.61}
{‘loss’: 0.9421, ‘grad_norm’: 0.7968059778213501, ‘learning_rate’: 3.9200000000000004e-05, ‘num_tokens’: 32315.0, ‘mean_token_accuracy’: 0.7581047415733337, ‘epoch’: 0.62}
{‘loss’: 0.7236, ‘grad_norm’: 0.7512580156326294, ‘learning_rate’: 3.8400000000000005e-05, ‘num_tokens’: 32689.0, ‘mean_token_accuracy’: 0.7972972989082336, ‘epoch’: 0.62}
{‘loss’: 1.2601, ‘grad_norm’: 0.824474573135376, ‘learning_rate’: 3.76e-05, ‘num_tokens’: 33111.0, ‘mean_token_accuracy’: 0.6961722373962402, ‘epoch’: 0.63}
{‘loss’: 0.6462, ‘grad_norm’: 1.0161582231521606, ‘learning_rate’: 3.68e-05, ‘num_tokens’: 33417.0, ‘mean_token_accuracy’: 0.8543046116828918, ‘epoch’: 0.64}
{‘loss’: 1.0452, ‘grad_norm’: 0.7181888818740845, ‘learning_rate’: 3.6e-05, ‘num_tokens’: 33900.0, ‘mean_token_accuracy’: 0.7223381996154785, ‘epoch’: 0.65}
{‘loss’: 0.8724, ‘grad_norm’: 0.777743399143219, ‘learning_rate’: 3.52e-05, ‘num_tokens’: 34284.0, ‘mean_token_accuracy’: 0.7868421077728271, ‘epoch’: 0.66}
{‘loss’: 1.0664, ‘grad_norm’: 0.5561301112174988, ‘learning_rate’: 3.4399999999999996e-05, ‘num_tokens’: 34924.0, ‘mean_token_accuracy’: 0.7122641801834106, ‘epoch’: 0.66}
{‘loss’: 0.868, ‘grad_norm’: 0.741447389125824, ‘learning_rate’: 3.3600000000000004e-05, ‘num_tokens’: 35335.0, ‘mean_token_accuracy’: 0.7665847539901733, ‘epoch’: 0.67}
{‘loss’: 1.0478, ‘grad_norm’: 0.7623888850212097, ‘learning_rate’: 3.2800000000000004e-05, ‘num_tokens’: 35757.0, ‘mean_token_accuracy’: 0.739234447479248, ‘epoch’: 0.68}
{‘loss’: 0.7051, ‘grad_norm’: 0.8632932901382446, ‘learning_rate’: 3.2000000000000005e-05, ‘num_tokens’: 36051.0, ‘mean_token_accuracy’: 0.8034482598304749, ‘epoch’: 0.69}
{‘loss’: 0.866, ‘grad_norm’: 0.785815417766571, ‘learning_rate’: 3.12e-05, ‘num_tokens’: 36408.0, ‘mean_token_accuracy’: 0.776203989982605, ‘epoch’: 0.7}
{‘loss’: 0.5515, ‘grad_norm’: 0.8568090200424194, ‘learning_rate’: 3.04e-05, ‘num_tokens’: 36726.0, ‘mean_token_accuracy’: 0.843949019908905, ‘epoch’: 0.7}
{‘loss’: 0.8848, ‘grad_norm’: 1.025905728340149, ‘learning_rate’: 2.96e-05, ‘num_tokens’: 37016.0, ‘mean_token_accuracy’: 0.7972028255462646, ‘epoch’: 0.71}
{‘loss’: 0.9432, ‘grad_norm’: 1.1072651147842407, ‘learning_rate’: 2.88e-05, ‘num_tokens’: 37363.0, ‘mean_token_accuracy’: 0.7638484239578247, ‘epoch’: 0.72}
{‘loss’: 1.2575, ‘grad_norm’: 0.7587688565254211, ‘learning_rate’: 2.8000000000000003e-05, ‘num_tokens’: 37881.0, ‘mean_token_accuracy’: 0.6789883375167847, ‘epoch’: 0.73}
{‘loss’: 1.1379, ‘grad_norm’: 0.8427833318710327, ‘learning_rate’: 2.7200000000000004e-05, ‘num_tokens’: 38231.0, ‘mean_token_accuracy’: 0.7803468108177185, ‘epoch’: 0.74}
{‘loss’: 1.1752, ‘grad_norm’: 0.6902854442596436, ‘learning_rate’: 2.64e-05, ‘num_tokens’: 38779.0, ‘mean_token_accuracy’: 0.6985294222831726, ‘epoch’: 0.74}
{‘loss’: 0.8447, ‘grad_norm’: 0.9362207651138306, ‘learning_rate’: 2.5600000000000002e-05, ‘num_tokens’: 39109.0, ‘mean_token_accuracy’: 0.8220859169960022, ‘epoch’: 0.75}
{‘loss’: 0.6088, ‘grad_norm’: 0.8514158725738525, ‘learning_rate’: 2.48e-05, ‘num_tokens’: 39449.0, ‘mean_token_accuracy’: 0.8303571343421936, ‘epoch’: 0.76}
{‘loss’: 0.6748, ‘grad_norm’: 0.7468445897102356, ‘learning_rate’: 2.4e-05, ‘num_tokens’: 39833.0, ‘mean_token_accuracy’: 0.821052610874176, ‘epoch’: 0.77}
{‘loss’: 1.0145, ‘grad_norm’: 0.7340139150619507, ‘learning_rate’: 2.32e-05, ‘num_tokens’: 40218.0, ‘mean_token_accuracy’: 0.7427821755409241, ‘epoch’: 0.78}
{‘loss’: 0.7727, ‘grad_norm’: 0.8198543190956116, ‘learning_rate’: 2.2400000000000002e-05, ‘num_tokens’: 40577.0, ‘mean_token_accuracy’: 0.8084506988525391, ‘epoch’: 0.78}
{‘loss’: 0.6121, ‘grad_norm’: 0.9006021618843079, ‘learning_rate’: 2.16e-05, ‘num_tokens’: 40915.0, ‘mean_token_accuracy’: 0.817365288734436, ‘epoch’: 0.79}
{‘loss’: 0.8519, ‘grad_norm’: 1.0014283657073975, ‘learning_rate’: 2.08e-05, ‘num_tokens’: 41262.0, ‘mean_token_accuracy’: 0.795918345451355, ‘epoch’: 0.8}
{‘loss’: 1.5224, ‘grad_norm’: 0.8215137124061584, ‘learning_rate’: 2e-05, ‘num_tokens’: 41755.0, ‘mean_token_accuracy’: 0.6441717743873596, ‘epoch’: 0.81}
{‘loss’: 1.2125, ‘grad_norm’: 0.7653736472129822, ‘learning_rate’: 1.9200000000000003e-05, ‘num_tokens’: 42158.0, ‘mean_token_accuracy’: 0.7067669034004211, ‘epoch’: 0.82}
{‘loss’: 0.9036, ‘grad_norm’: 0.807945966720581, ‘learning_rate’: 1.84e-05, ‘num_tokens’: 42505.0, ‘mean_token_accuracy’: 0.7580174803733826, ‘epoch’: 0.82}
{‘loss’: 0.9499, ‘grad_norm’: 0.8192100524902344, ‘learning_rate’: 1.76e-05, ‘num_tokens’: 42851.0, ‘mean_token_accuracy’: 0.7602339386940002, ‘epoch’: 0.83}
{‘loss’: 1.3851, ‘grad_norm’: 0.8119988441467285, ‘learning_rate’: 1.6800000000000002e-05, ‘num_tokens’: 43284.0, ‘mean_token_accuracy’: 0.687645673751831, ‘epoch’: 0.84}
{‘loss’: 1.3119, ‘grad_norm’: 0.523651123046875, ‘learning_rate’: 1.6000000000000003e-05, ‘num_tokens’: 44154.0, ‘mean_token_accuracy’: 0.6501154899597168, ‘epoch’: 0.85}
{‘loss’: 0.7855, ‘grad_norm’: 0.6848351955413818, ‘learning_rate’: 1.52e-05, ‘num_tokens’: 44562.0, ‘mean_token_accuracy’: 0.7945544719696045, ‘epoch’: 0.86}
{‘loss’: 1.0922, ‘grad_norm’: 0.6767337918281555, ‘learning_rate’: 1.44e-05, ‘num_tokens’: 45124.0, ‘mean_token_accuracy’: 0.7114695310592651, ‘epoch’: 0.86}
{‘loss’: 0.8144, ‘grad_norm’: 0.7469391226768494, ‘learning_rate’: 1.3600000000000002e-05, ‘num_tokens’: 45455.0, ‘mean_token_accuracy’: 0.7737002968788147, ‘epoch’: 0.87}
{‘loss’: 0.6775, ‘grad_norm’: 1.0495522022247314, ‘learning_rate’: 1.2800000000000001e-05, ‘num_tokens’: 45729.0, ‘mean_token_accuracy’: 0.8259259462356567, ‘epoch’: 0.88}
{‘loss’: 1.1363, ‘grad_norm’: 0.8359189033508301, ‘learning_rate’: 1.2e-05, ‘num_tokens’: 46066.0, ‘mean_token_accuracy’: 0.759759783744812, ‘epoch’: 0.89}
{‘loss’: 0.978, ‘grad_norm’: 0.6999582648277283, ‘learning_rate’: 1.1200000000000001e-05, ‘num_tokens’: 46518.0, ‘mean_token_accuracy’: 0.7232142686843872, ‘epoch’: 0.9}
{‘loss’: 1.3642, ‘grad_norm’: 0.9769337177276611, ‘learning_rate’: 1.04e-05, ‘num_tokens’: 46894.0, ‘mean_token_accuracy’: 0.6881720423698425, ‘epoch’: 0.9}
{‘loss’: 0.836, ‘grad_norm’: 0.7793369889259338, ‘learning_rate’: 9.600000000000001e-06, ‘num_tokens’: 47297.0, ‘mean_token_accuracy’: 0.7744361162185669, ‘epoch’: 0.91}
{‘loss’: 1.2216, ‘grad_norm’: 0.6924943923950195, ‘learning_rate’: 8.8e-06, ‘num_tokens’: 47797.0, ‘mean_token_accuracy’: 0.6875, ‘epoch’: 0.92}
{‘loss’: 0.7638, ‘grad_norm’: 0.8773249387741089, ‘learning_rate’: 8.000000000000001e-06, ‘num_tokens’: 48132.0, ‘mean_token_accuracy’: 0.7945619225502014, ‘epoch’: 0.93}
{‘loss’: 0.9549, ‘grad_norm’: 0.6660798788070679, ‘learning_rate’: 7.2e-06, ‘num_tokens’: 48579.0, ‘mean_token_accuracy’: 0.740406334400177, ‘epoch’: 0.94}
{‘loss’: 0.9976, ‘grad_norm’: 0.6012172698974609, ‘learning_rate’: 6.4000000000000006e-06, ‘num_tokens’: 49320.0, ‘mean_token_accuracy’: 0.7245590090751648, ‘epoch’: 0.94}
{‘loss’: 0.8786, ‘grad_norm’: 0.7032813429832458, ‘learning_rate’: 5.600000000000001e-06, ‘num_tokens’: 49703.0, ‘mean_token_accuracy’: 0.7704485654830933, ‘epoch’: 0.95}
{‘loss’: 1.0169, ‘grad_norm’: 0.6385502815246582, ‘learning_rate’: 4.800000000000001e-06, ‘num_tokens’: 50175.0, ‘mean_token_accuracy’: 0.7414529919624329, ‘epoch’: 0.96}
{‘loss’: 0.928, ‘grad_norm’: 0.8942468762397766, ‘learning_rate’: 4.000000000000001e-06, ‘num_tokens’: 50556.0, ‘mean_token_accuracy’: 0.7692307829856873, ‘epoch’: 0.97}
{‘loss’: 1.1939, ‘grad_norm’: 0.7218092679977417, ‘learning_rate’: 3.2000000000000003e-06, ‘num_tokens’: 51055.0, ‘mean_token_accuracy’: 0.6949495077133179, ‘epoch’: 0.98}
{‘loss’: 1.0075, ‘grad_norm’: 0.7607448697090149, ‘learning_rate’: 2.4000000000000003e-06, ‘num_tokens’: 51487.0, ‘mean_token_accuracy’: 0.7383177280426025, ‘epoch’: 0.98}
{‘loss’: 1.2592, ‘grad_norm’: 0.6758527159690857, ‘learning_rate’: 1.6000000000000001e-06, ‘num_tokens’: 51980.0, ‘mean_token_accuracy’: 0.6973415017127991, ‘epoch’: 0.99}
{‘loss’: 0.9474, ‘grad_norm’: 0.7013502717018127, ‘learning_rate’: 8.000000000000001e-07, ‘num_tokens’: 52435.0, ‘mean_token_accuracy’: 0.7538802623748779, ‘epoch’: 1.0}
{‘train_runtime’: 98.9744, ‘train_samples_per_second’: 5.052, ‘train_steps_per_second’: 1.263, ‘train_loss’: 1.0408768682479859, ‘epoch’: 1.0}
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████| 125/125 [01:38<00:00, 1.26it/s]

============================================================

TRAINING COMPLETED

Training runtime: 98.97 seconds

Samples per second: 5.05
Steps per second: 1.26
Train loss: 1.0409

root@93e69296bb14:/workspace/dgx-spark-playbooks/nvidia/pytorch-fine-tune/assets# python Llama3_70B_qLoRA_finetuning.py
/usr/local/lib/python3.12/dist-packages/torch/cuda/init.py:63: FutureWarning: The pynvml package is deprecated. Please install nvidia-ml-py instead. If you did not install pynvml directly, please report this to the maintainers of the package that installed pynvml for you.
import pynvml # type: ignore[import]

============================================================

LLAMA 3.1 70B QLoRA FINE-TUNING

Model: meta-llama/Llama-3.1-70B-Instruct

Training mode: QLoRA (4-bit quantization)
Batch size: 8
Gradient accumulation: 1
Effective batch size: 8
Sequence length: 2048
Number of epochs: 1
Learning rate: 0.0001
LoRA rank: 8
Dataset size: 500
Gradient checkpointing: False
Torch compile: False

Loading model: meta-llama/Llama-3.1-70B-Instruct
Training mode: QLoRA (4-bit quantization)
config.json: 100%|██████████████████████████████████████████████████████████████████████████████████████████| 855/855 [00:00<00:00, 13.3MB/s]
model.safetensors.index.json: 100%|█████████████████████████████████████████████████████████████████████| 59.6k/59.6k [00:00<00:00, 14.1MB/s]
model-00002-of-00030.safetensors: 100%|█████████████████████████████████████████████████████████████████| 4.66G/4.66G [03:08<00:00, 24.8MB/s]
model-00009-of-00030.safetensors: 100%|█████████████████████████████████████████████████████████████████| 4.97G/4.97G [01:53<00:00, 43.7MB/s]
model-00004-of-00030.safetensors: 100%|█████████████████████████████████████████████████████████████████| 4.97G/4.97G [05:21<00:00, 15.5MB/s]
model-00006-of-00030.safetensors: 100%|█████████████████████████████████████████████████████████████████| 4.66G/4.66G [05:27<00:00, 14.3MB/s]
model-00008-of-00030.safetensors: 100%|█████████████████████████████████████████████████████████████████| 5.00G/5.00G [05:35<00:00, 14.9MB/s]
model-00001-of-00030.safetensors: 100%|█████████████████████████████████████████████████████████████████| 4.58G/4.58G [05:39<00:00, 13.5MB/s]
model-00007-of-00030.safetensors: 100%|█████████████████████████████████████████████████████████████████| 4.66G/4.66G [05:40<00:00, 13.7MB/s]
model-00005-of-00030.safetensors: 100%|█████████████████████████████████████████████████████████████████| 4.66G/4.66G [05:49<00:00, 13.4MB/s]
model-00003-of-00030.safetensors: 100%|█████████████████████████████████████████████████████████████████| 5.00G/5.00G [05:52<00:00, 14.2MB/s]
model-00013-of-00030.safetensors: 100%|█████████████████████████████████████████████████████████████████| 5.00G/5.00G [02:47<00:00, 29.8MB/s]
model-00016-of-00030.safetensors: 100%|█████████████████████████████████████████████████████████████████| 4.66G/4.66G [02:44<00:00, 28.3MB/s]
model-00010-of-00030.safetensors: 100%|█████████████████████████████████████████████████████████████████| 4.66G/4.66G [05:23<00:00, 14.4MB/s]
model-00014-of-00030.safetensors: 100%|█████████████████████████████████████████████████████████████████| 4.97G/4.97G [05:22<00:00, 15.4MB/s]
model-00011-of-00030.safetensors: 100%|█████████████████████████████████████████████████████████████████| 4.66G/4.66G [05:55<00:00, 13.1MB/s]
model-00012-of-00030.safetensors: 100%|█████████████████████████████████████████████████████████████████| 4.66G/4.66G [05:59<00:00, 13.0MB/s]
model-00015-of-00030.safetensors: 100%|█████████████████████████████████████████████████████████████████| 4.66G/4.66G [05:49<00:00, 13.4MB/s]
model-00017-of-00030.safetensors: 100%|█████████████████████████████████████████████████████████████████| 4.66G/4.66G [05:39<00:00, 13.7MB/s]
model-00019-of-00030.safetensors: 100%|█████████████████████████████████████████████████████████████████| 4.97G/4.97G [03:20<00:00, 24.8MB/s]
model-00018-of-00030.safetensors: 100%|█████████████████████████████████████████████████████████████████| 5.00G/5.00G [04:48<00:00, 17.3MB/s]
model-00023-of-00030.safetensors: 100%|█████████████████████████████████████████████████████████████████| 5.00G/5.00G [02:40<00:00, 31.1MB/s]
model-00025-of-00030.safetensors: 100%|█████████████████████████████████████████████████████████████████| 4.66G/4.66G [02:54<00:00, 26.7MB/s]
model-00020-of-00030.safetensors: 100%|█████████████████████████████████████████████████████████████████| 4.66G/4.66G [05:42<00:00, 13.6MB/s]
model-00024-of-00030.safetensors: 100%|█████████████████████████████████████████████████████████████████| 4.97G/4.97G [05:11<00:00, 15.9MB/s]
model-00027-of-00030.safetensors: 100%|█████████████████████████████████████████████████████████████████| 4.66G/4.66G [03:48<00:00, 20.4MB/s]
model-00022-of-00030.safetensors: 100%|█████████████████████████████████████████████████████████████████| 4.66G/4.66G [05:59<00:00, 13.0MB/s]
model-00030-of-00030.safetensors: 100%|█████████████████████████████████████████████████████████████████| 2.10G/2.10G [01:08<00:00, 30.5MB/s]
model-00021-of-00030.safetensors: 100%|█████████████████████████████████████████████████████████████████| 4.66G/4.66G [06:15<00:00, 12.4MB/s]
model-00028-of-00030.safetensors: 100%|█████████████████████████████████████████████████████████████████| 5.00G/5.00G [03:13<00:00, 25.9MB/s]
model-00029-of-00030.safetensors: 100%|█████████████████████████████████████████████████████████████████| 4.97G/4.97G [02:54<00:00, 28.4MB/s]
model-00026-of-00030.safetensors: 100%|█████████████████████████████████████████████████████████████████| 4.66G/4.66G [05:27<00:00, 14.2MB/s]
Fetching 30 files: 100%|█████████████████████████████████████████████████████████████████████████████████████| 30/30 [17:22<00:00, 34.74s/it]
Traceback (most recent call last):100%|█████████████████████████████████████████████████████████████████| 4.66G/4.66G [06:15<00:00, 17.8MB/s]
File “/workspace/dgx-spark-playbooks/nvidia/pytorch-fine-tune/assets/Llama3_70B_qLoRA_finetuning.py”, line 230, in 00:08, 81.2MB/s]
main(args)-00030.safetensors: 99%|████████████████████████████████████████████████████████████████ | 4.60G/4.66G [05:24<00:01, 65.4MB/s]
File “/workspace/dgx-spark-playbooks/nvidia/pytorch-fine-tune/assets/Llama3_70B_qLoRA_finetuning.py”, line 66, in main02:50<00:05, 102MB/s]
model = AutoModelForCausalLM.from_pretrained(███████████████████████████████████████████████████████| 4.66G/4.66G [05:27<00:00, 45.0MB/s]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^████████████████████████████████████████████████████████| 4.97G/4.97G [02:54<00:00, 217MB/s]
File “/usr/local/lib/python3.12/dist-packages/transformers/models/auto/auto_factory.py”, line 604, in from_pretrained
return model_class.from_pretrained(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/usr/local/lib/python3.12/dist-packages/transformers/modeling_utils.py”, line 277, in _wrapper
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File “/usr/local/lib/python3.12/dist-packages/transformers/modeling_utils.py”, line 5048, in from_pretrained
) = cls._load_pretrained_model(
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/usr/local/lib/python3.12/dist-packages/transformers/modeling_utils.py”, line 5432, in _load_pretrained_model
caching_allocator_warmup(model, expanded_device_map, hf_quantizer)
File “/usr/local/lib/python3.12/dist-packages/transformers/modeling_utils.py”, line 6090, in caching_allocator_warmup
device_memory = torch_accelerator_module.mem_get_info(index)[0]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/usr/local/lib/python3.12/dist-packages/torch/cuda/memory.py”, line 838, in mem_get_info
return torch.cuda.cudart().cudaMemGetInfo(device)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
torch.AcceleratorError: CUDA error: out of memory
Search for cudaErrorMemoryAllocation' in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html for more information. CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1 Compile with TORCH_USE_CUDA_DSA` to enable device-side assertions.

root@93e69296bb14:/workspace/dgx-spark-playbooks/nvidia/pytorch-fine-tune/assets# python Llama3_70B_qLoRA_finetuning.py
/usr/local/lib/python3.12/dist-packages/torch/cuda/init.py:63: FutureWarning: The pynvml package is deprecated. Please install nvidia-ml-py instead. If you did not install pynvml directly, please report this to the maintainers of the package that installed pynvml for you.
import pynvml # type: ignore[import]

============================================================

LLAMA 3.1 70B QLoRA FINE-TUNING

Model: meta-llama/Llama-3.1-70B-Instruct

Training mode: QLoRA (4-bit quantization)
Batch size: 8
Gradient accumulation: 1
Effective batch size: 8
Sequence length: 2048
Number of epochs: 1
Learning rate: 0.0001
LoRA rank: 8
Dataset size: 500
Gradient checkpointing: False
Torch compile: False

Loading model: meta-llama/Llama-3.1-70B-Instruct
Training mode: QLoRA (4-bit quantization)
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████| 30/30 [13:47<00:00, 27.60s/it]
generation_config.json: 100%|████████████████████████████████████████████████████████████████████████████████| 183/183 [00:00<00:00, 426kB/s]
tokenizer_config.json: 100%|████████████████████████████████████████████████████████████████████████████| 55.4k/55.4k [00:00<00:00, 10.4MB/s]
tokenizer.json: 100%|███████████████████████████████████████████████████████████████████████████████████| 9.09M/9.09M [00:00<00:00, 51.2MB/s]
special_tokens_map.json: 100%|██████████████████████████████████████████████████████████████████████████████| 296/296 [00:00<00:00, 2.94MB/s]
Preparing model for QLoRA (4-bit) with rank 8…
Trainable parameters: 2,102,665,216 (10.94%)
Loading dataset with 500 samples…
Map: 100%|███████████████████████████████████████████████████████████████████████████████████████| 500/500 [00:00<00:00, 37341.79 examples/s]

Starting QLoRA fine-tuning for 1 epoch(s)…
Adding EOS to train dataset: 100%|███████████████████████████████████████████████████████████████| 500/500 [00:00<00:00, 33023.42 examples/s]
Tokenizing train dataset: 100%|███████████████████████████████████████████████████████████████████| 500/500 [00:00<00:00, 5677.63 examples/s]
Truncating train dataset: 100%|█████████████████████████████████████████████████████████████████| 500/500 [00:00<00:00, 159479.24 examples/s]
The tokenizer has new PAD/BOS/EOS tokens that differ from the model config and generation config. The model config and generation config were aligned accordingly, being updated with the tokenizer’s values. Updated tokens: {‘eos_token_id’: 128009, ‘pad_token_id’: 128009}.
{‘loss’: 2.1963, ‘grad_norm’: 1.42030930519104, ‘learning_rate’: 0.0001, ‘num_tokens’: 607.0, ‘mean_token_accuracy’: 0.49248749017715454, ‘epoch’: 0.02}
{‘loss’: 1.9799, ‘grad_norm’: 0.8890447020530701, ‘learning_rate’: 9.841269841269841e-05, ‘num_tokens’: 1310.0, ‘mean_token_accuracy’: 0.5597122311592102, ‘epoch’: 0.03}
{‘loss’: 1.9243, ‘grad_norm’: 1.0022180080413818, ‘learning_rate’: 9.682539682539682e-05, ‘num_tokens’: 2073.0, ‘mean_token_accuracy’: 0.5523178577423096, ‘epoch’: 0.05}
5%|█████ | 3/63 [00:48<15:55, 15.93s/it]Traceback (most recent call last):
File “/workspace/dgx-spark-playbooks/nvidia/pytorch-fine-tune/assets/Llama3_70B_qLoRA_finetuning.py”, line 230, in
main(args)
File “/workspace/dgx-spark-playbooks/nvidia/pytorch-fine-tune/assets/Llama3_70B_qLoRA_finetuning.py”, line 143, in main
trainer_stats = trainer.train()
^^^^^^^^^^^^^^^
File “/usr/local/lib/python3.12/dist-packages/transformers/trainer.py”, line 2325, in train
return inner_training_loop(
^^^^^^^^^^^^^^^^^^^^
File “/usr/local/lib/python3.12/dist-packages/transformers/trainer.py”, line 2674, in _inner_training_loop
tr_loss_step = self.training_step(model, inputs, num_items_in_batch)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/usr/local/lib/python3.12/dist-packages/trl/trainer/sft_trainer.py”, line 872, in training_step
return super().training_step(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/usr/local/lib/python3.12/dist-packages/transformers/trainer.py”, line 4020, in training_step
loss = self.compute_loss(model, inputs, num_items_in_batch=num_items_in_batch)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/usr/local/lib/python3.12/dist-packages/trl/trainer/sft_trainer.py”, line 826, in compute_loss
(loss, outputs) = super().compute_loss(
^^^^^^^^^^^^^^^^^^^^^
File “/usr/local/lib/python3.12/dist-packages/transformers/trainer.py”, line 4110, in compute_loss
outputs = model(**inputs)
^^^^^^^^^^^^^^^
File “/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py”, line 1775, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py”, line 1786, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/usr/local/lib/python3.12/dist-packages/accelerate/utils/operations.py”, line 819, in forward
return model_forward(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/usr/local/lib/python3.12/dist-packages/accelerate/utils/operations.py”, line 807, in call
return convert_to_fp32(self.model_forward(*args, **kwargs))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/usr/local/lib/python3.12/dist-packages/torch/amp/autocast_mode.py”, line 44, in decorate_autocast
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File “/usr/local/lib/python3.12/dist-packages/peft/peft_model.py”, line 1923, in forward
return self.base_model(
^^^^^^^^^^^^^^^^
File “/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py”, line 1775, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py”, line 1786, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/usr/local/lib/python3.12/dist-packages/peft/tuners/tuners_utils.py”, line 308, in forward
return self.model.forward(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/usr/local/lib/python3.12/dist-packages/transformers/utils/generic.py”, line 918, in wrapper
output = func(self, *args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/usr/local/lib/python3.12/dist-packages/transformers/models/llama/modeling_llama.py”, line 459, in forward
outputs: BaseModelOutputWithPast = self.model(
^^^^^^^^^^^
File “/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py”, line 1775, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py”, line 1786, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/usr/local/lib/python3.12/dist-packages/transformers/utils/generic.py”, line 1072, in wrapper
outputs = func(self, *args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/usr/local/lib/python3.12/dist-packages/transformers/models/llama/modeling_llama.py”, line 395, in forward
hidden_states = decoder_layer(
^^^^^^^^^^^^^^
File “/usr/local/lib/python3.12/dist-packages/transformers/modeling_layers.py”, line 94, in call
return super().call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py”, line 1775, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py”, line 1786, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/usr/local/lib/python3.12/dist-packages/transformers/utils/deprecation.py”, line 172, in wrapped_func
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File “/usr/local/lib/python3.12/dist-packages/transformers/models/llama/modeling_llama.py”, line 309, in forward
hidden_states = self.mlp(hidden_states)
^^^^^^^^^^^^^^^^^^^^^^^
File “/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py”, line 1775, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py”, line 1786, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/usr/local/lib/python3.12/dist-packages/transformers/models/llama/modeling_llama.py”, line 155, in forward
down_proj = self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py”, line 1775, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py”, line 1786, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/usr/local/lib/python3.12/dist-packages/peft/tuners/lora/bnb.py”, line 547, in forward
result = self.base_layer(x, *args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py”, line 1775, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py”, line 1786, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/usr/local/lib/python3.12/dist-packages/bitsandbytes/nn/modules.py”, line 532, in forward
return bnb.matmul_4bit(x, weight, bias=bias, quant_state=self.weight.quant_state).to(inp_dtype)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/usr/local/lib/python3.12/dist-packages/bitsandbytes/autograd/_functions.py”, line 448, in matmul_4bit
return MatMul4Bit.apply(A, B, out, bias, quant_state)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/usr/local/lib/python3.12/dist-packages/torch/autograd/function.py”, line 581, in apply
return super().apply(*args, **kwargs) # type: ignore[misc]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/usr/local/lib/python3.12/dist-packages/bitsandbytes/autograd/_functions.py”, line 373, in forward
output = torch.nn.functional.linear(A, F.dequantize_4bit(B, quant_state).to(A.dtype).t(), bias)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/usr/local/lib/python3.12/dist-packages/bitsandbytes/functional.py”, line 994, in dequantize_4bit
out = torch.ops.bitsandbytes.dequantize_4bit.default(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/usr/local/lib/python3.12/dist-packages/torch/_ops.py”, line 840, in call
return self._op(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^
File “/usr/local/lib/python3.12/dist-packages/torch/_compile.py”, line 53, in inner
return disable_fn(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/usr/local/lib/python3.12/dist-packages/torch/_dynamo/eval_frame.py”, line 1005, in _fn
return fn(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^
File “/usr/local/lib/python3.12/dist-packages/torch/library.py”, line 731, in func_no_dynamo
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File “/usr/local/lib/python3.12/dist-packages/bitsandbytes/backends/cuda/ops.py”, line 360, in _
out = torch.empty(shape, dtype=dtype, device=A.device)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 119.70 GiB of which 632.07 MiB is free. Including non-PyTorch memory, this process has 114.51 GiB memory in use. Of the allocated memory 110.96 GiB is allocated by PyTorch, and 3.34 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management ( CUDA semantics — PyTorch 2.9 documentation )
5%|████▊

Hi,

I have moved your post over to the DGX Spark forum for better visibility. Thanks for posting on the forums!

Best,

AHarpster

If you have been running other workloads, you may need to clear your cache to free up some memory as detailed on our FAQ: DGX Spark / GB10 FAQ - #7

I got this to run by lowering the dataset size and other settings. I think the default push the memory/GPU too hard.

python Llama3_70B_qLoRA_finetuning.py \

 --batch_size 2 \

 --seq_length 2048 \ 

--gradient_accumulation_steps 2 \ 

--dataset_size 250



============================================================ 
TRAINING COMPLETED 
============================================================ 
Training runtime: 485.74 seconds 
Samples per second: 0.52 
Steps per second: 0.13 
Train loss: 1.0958 
============================================================