I currently have an Alienware Area 51 setup with Ubuntu 16.04 LTS, 4 GTX 1080 (GPUs) connected with two 2-way SLIs. I have multiple users connecting to this machine using SSH connection. My question here is, can multiple users use the GPUs at the same time to run deep learning models. Is it even possible? If so can you please suggest the best practices to do the same.
So far what I have been noticing is that when an user is running deep learning model, the GPUs are locked and any other user is not able to create a session to run their models. Please advise.
Generally, it should be possible.
I wouldn’t use SLI in this case (i.e. I would physically remove the SLI bridges). If you want to know why, go to the CUDA C programming guide, and search for SLI, and start reading. I’m not saying it’s definitely the cause of all your problems, I’m just saying I wouldn’t do it.
Various DL systems may “greedily” use GPUs. For example TF uses all the memory on a GPU, whether it needs it or not (this behavior can be changed, but it is the default behavior). You should be aware of greedy GPU usage.
My recommendation would be:
- users launching DL jobs should know which GPUs they intend to use.
- users launching DL jobs should issue a export CUDA_VISIBLE_DEVICES=“X,Y,Z” command in their session, before launching the DL framework. The X,Y,Z needs to be changed to reflect the actual GPUs they intend/expect to use.
- users should not attempt to use GPUs that are in use by someone else
Obviously there are corner cases here, and this requires a lot of explicit cooperation between users. If that is a problem, it is why job schedulers such as slurm were invented - to sort these out.
Please google for help with basic questions like “how does CUDA_VISIBLE_DEVICES work”. That topic is covered in many places including in the CUDA C programming guide.