I’ve been trying to work within the framework of AI Workbench, but it’s getting a bit frustrating. The latest annoyance is that there is something about how AI Workbench manages the containers on the server which leads to them being shut down if the AI Workbench client gets disconnected. This most often happens if I have connected to the Spark over a VPN or something, and the VPN disconnects.
This is very, very annoying, if the Spark was running a multi-day training or fine-tuning!
I now believe it’s time to start working directly within docker containers, ideally based on the latest ngc pytorch 2.9 container, and just build a persistent environment inside there that I launch from the command line inside a tmux or something.
Ideally, I would still have a simple way to run jupyter lab and VS Code into the container, is there any prior art for getting this running while still maybe using a Custom app in the nVidia Sync app or something like that, to port-forward to more persistent container-exposed ports on the spark?
Then maybe I could actually start recording epoch stats in Weights & Biases for example.
Any guidance appreciated!