When using cloud vendor machines with GPUs that expect the driver directories and files to be mounted into the container I am having issues getting the management library installed so that I can query the GPU. If I use docker to create the machine and force the nvidia drivers in, along with the standard cuda software to get a copy of the shared library, when the machine starts on Azure or AWS I get errors from Kubernetes that is being used to start these saying that the directories are getting in the way of the nvidia-docker style mounts.
So, if I cannot get the nvidia drivers on how on earth do I get the shared library and tools like nvidia-smi into containers ?
use nvidia container runtime plugin: (previously called nvidia-docker 2.0)
[url]https://github.com/NVIDIA/nvidia-container-runtime[/url]
When you do that, you don’t install the driver bits into the container (the runtime does it for you).
with respect to Kubernetes, this may be useful:
[url]https://kubernetes.io/docs/tasks/manage-gpus/scheduling-gpus/[/url]
Note that this situation is changing pretty rapidly (as indicated on kubernetes page) and so the “recipe” may be different 6 months from now.
Do you know of an Azure recipe for the plugin approach on k8s ?
Thanks
a google search on “gpu kubernetes azure” seemed to turn up several promising hits
I was not able to find any using google etc that mention the plugin approach. Many older articles and blogs abound for the older style alpha.kubernetes.io/nvidia-gpu: 1 approaches.
I will try Microsoft and see if I can find anything out about the ‘nvidia container runtime plugin’ approach but it does not seem to have been on the MSDN radar at least.