Our research team have a new DGX A100 system installed with 8 GPUs.
We can ssh into server and do some training work.
However, we don’t want every user could access all gpus at any time.
- Our plan is to split DGX into N separate VM/compute node e.g. 1gpu + 2gpu + 2gpu + 3gpu.
- Set User permission to only use some of this VM (e.g junior user can only use 1gpu VM).
- All the user’s information is synchronized with host. (ssh to Host then ssh to VM with same credentials).
- Config other resources to each VM: CPU, RAM, Disk
- Easy to manipulate VM (e.g delete, create new) and does not effect user data.
We dont have much knowledge about system admin things.
So could you give us some documents/tutorials/best practices about how to install such system.
Or, if you have any better idea that fits above requirements, please tell me.