[SUPPORT] Workbench Example Project: Llama 3 Finetune

Hi! This is the support thread for the Llama 3 8B Finetuning Example Project on GitHub. Any major updates we push to the project will be announced here. Further, feel free to discuss, raise issues, and ask for assistance in this thread.

Please keep discussion in this thread project-related. Any issues with the Workbench application should be raised as a standalone thread. Thanks!

(8/26/2024) Updated readme with deeplinking

(10/02) Updated deep link landing page

I have been working on an innovative AI agent that interacts with conceptual realities in its internal framework, treating abstract ideas as real within its own system. This approach allows the AI to engage in recursive self-reasoning and handle complex conceptual modeling in a way that goes beyond traditional AI systems…and I dont know who needs to confirm but I feel like I need someone to see it now…but I have no idea who to talk to .

Hi, is there any explanation for Host Mount Configuration? What Source Directory should be for a Windows host? I think it is explained in a documentation but quite equivocally.

E.g. when cloning or creating the project, source directory (for a local device) is initiated (something like /host/workbench/nvidia-workbench/…/)
Why not to put it by default to the Environment configuration (now it should be remembered and copypasted by hand for some reason)

The reason we prompt the user to configure a host mount is to ensure the saved, finetuned model can live on the host machine the project is running on.

These models are often times quite large and take up several GBs of space, so keeping this as part of the project container can be impractical. Progress is lost, for example, when the container is stopped.

Once mounted to the underlying host machine, however, the notebook auto-saves outputs to the host and it becomes easy to access the results of your finetuning workflow even after your project container is shut down.

This is a runtime configuration, and since every system (and user) is different, we prompt the user for their desired location to save the finetuned model files. Ultimately. this is the design choice we made when building this example, but you can also delete the mount if you would like from the Environment tab.

As for messaging, I’ve updated the mount description with examples to help make the desired path clearer for the user. This information already exists in the README for the project but agreed, it should be surfaced to the user while working in AIWB.

Does it require the /mnt/c/Users/[user] folder to be created by the user on the host machine? Or will the mount create the folder at built ?

user needs to create it, which means you need to hop on the instance.

workbench will alias into ssh once workbench is installed, so you can ssh into it to create that directory manually.

Thank you so much, I will try this

I”m trying to run this using AI Workbench on the DGX Spark. I updated the PyTorch in the container using the Workbench “Update” button and it is PyTorch 2.6 base with CUDA 12.6.3

First, there’s this warning–does it matter?

/usr/local/lib/python3.12/dist-packages/torch/cuda/__init__.py:235: UserWarning: 
NVIDIA GB10 with CUDA capability sm_121 is not compatible with the current PyTorch installation.
The current PyTorch install supports CUDA capabilities sm_80 sm_86 sm_90 compute_90.
If you want to use the NVIDIA GB10 GPU with PyTorch, please check the instructions at https://pytorch.org/get-started/locally/

Second, in the cell defining the trainer there is this warning:

/home/ubuntu/.local/lib/python3.12/site-packages/trl/trainer/sft_trainer.py:246: UserWarning: You didn't pass a `max_seq_length` argument to the SFTTrainer, this will default to 1024

Third, in the trainer.train() cell there are two warnings:

`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.
/usr/local/lib/python3.12/dist-packages/torch/_dynamo/eval_frame.py:632: UserWarning: torch.utils.checkpoint: the use_reentrant parameter should be passed explicitly. In version 2.5 we will raise an exception if use_reentrant is not passed. use_reentrant=False is recommended, but if you need to preserve the current default behavior, you can pass use_reentrant=True. Refer to docs for more details on the differences between the two variants.
  return fn(*args, **kwargs)

I provide this background because the real problem is that the kernel of the Jupyter notebook fails after 5 - 10 minutes and restarts itself. Thus I can’t run the training.

Do you have any advice? This is the second Example Project I’ve tried to run and both fail (for different reasons).