Our research team have a new DGX A100 system installed with 8 GPUs.
We can ssh into server and do some training work.
However, we don’t want every user could access all gpus at any time.
Our plan is to split DGX into N separate VM/compute node e.g. 1gpu + 2gpu + 2gpu + 3gpu.
Set User permission to only use some of this VM (e.g junior user can only use 1gpu VM).
All the user’s information is synchronized with host. (ssh to Host then ssh to VM with same credentials).
Config other resources to each VM: CPU, RAM, Disk
Easy to manipulate VM (e.g delete, create new) and does not effect user data.
We dont have much knowledge about system admin things.
So could you give us some documents/tutorials/best practices about how to install such system.
Or, if you have any better idea that fits above requirements, please tell me.
I’d strongly discourage you from going the VM route. Not only is it not supported (by NVIDIA), but it’s overkill if you just need resource sharing. What about something like https://developer.nvidia.com/blog/deploying-rich-cluster-api-on-dgx-for-multi-user-sharing/ instead? That gives you the ability to control resources used by each user, batch scheduling, and other goodies without having to deal with the VM overhead and headache.
There are a number of other ways to deploy Slurm or similar tools, but the more proper deployments include additional (non-GPU) servers to act as login nodes, etc. What is described in that Blog doesn’t require any additional servers, so is “easy” for small configurations of a single system like you’re describing.
Post again or make a Github issue against DeepOps if you hit any issues!