Our research team have a new DGX A100 system installed with 8 GPUs.
We can ssh into server and do some training work.
However, we don’t want every user could access all gpus at any time.
- Our plan is to split DGX into N separate VM/compute node e.g. 1gpu + 2gpu + 2gpu + 3gpu.
- Set User permission to only use some of this VM (e.g junior user can only use 1gpu VM).
- All the user’s information is synchronized with host. (ssh to Host then ssh to VM with same credentials).
- Config other resources to each VM: CPU, RAM, Disk
- Easy to manipulate VM (e.g delete, create new) and does not effect user data.
We dont have much knowledge about system admin things.
So could you give us some documents/tutorials/best practices about how to install such system.
Or, if you have any better idea that fits above requirements, please tell me.
Unless your users really need ssh access, you could consider a docker based setup with nvidia-docker and jupyterhub
Thank you, but I think this does not really match our requirements.
Hi @ductm ,
I’d strongly discourage you from going the VM route. Not only is it not supported (by NVIDIA), but it’s overkill if you just need resource sharing. What about something like https://developer.nvidia.com/blog/deploying-rich-cluster-api-on-dgx-for-multi-user-sharing/ instead? That gives you the ability to control resources used by each user, batch scheduling, and other goodies without having to deal with the VM overhead and headache.
I’ve read that on deepops repo and I was not sure this is the best solution. So I created this post.
Now I’m certain I’ll put this solution into action.
Thanks a lot for your help.
There are a number of other ways to deploy Slurm or similar tools, but the more proper deployments include additional (non-GPU) servers to act as login nodes, etc. What is described in that Blog doesn’t require any additional servers, so is “easy” for small configurations of a single system like you’re describing.
Post again or make a Github issue against DeepOps if you hit any issues!
This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.