What happens if I install the nvidia drivers separately again so that the other users on the system can use it? Will there be a conflict?
Thank you. But we currently don’t have the bandwidth to revert back all of this, and then follow the other process to install. What if I were to just install the nvidia drivers on top of the bare-metal setup. Would I be able to access the GPUs via nvidia-smi as well then? Would it interfere with the TAO-API?
If at time of deployment you specified that you want the driver that comes from GPU Operator, and then you install later a host driver, it will cause GPU Operator failures.
For TAO 5.0, there is a new parameter in quickstart_api_bare_metal/gpu-operator-values.yml.
install_driver: true:The deployment script will remove host drivers and will let GPU Operator start it’s driver pod.
It is highly suggested to let the deployment script remove host drivers and install GPU Operator drivers pod. That way, you have the correct driver version.
install_driver: false: When there is the latest host driver and you don’t want to remove it. If you want to keep existing host drivers, then make sure you are using latest drivers.
But please note that this setting comes with a huge warning that any use of the GPU outside of K8s will cause conflict as K8s GPU jobs will get scheduled on GPUs you currently use outside of K8s.
If you change your mind later on, you can re-install GPU Operator manually via it’s Helm chart, one can specify if he wants the GPU Operator to include its own driver pod with --set driver.enabled=true. If you want to use host driver instead, you need to --set driver.enabled=false.
This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.