NVIDIA Driver Installation skipped during bare-metal install

amogh.dabholkar · July 19, 2023, 1:13pm

What happens if I install the nvidia drivers separately again so that the other users on the system can use it? Will there be a conflict?

Morganh · July 20, 2023, 2:08pm

Refer to TAO Toolkit 4.0.0 API bare metal setup causing gpu driver and kube utilities to uninstall (lots of confusing things happening at the same time) - #29 by Morganh

amogh.dabholkar · July 20, 2023, 2:29pm

Thank you. But we currently don’t have the bandwidth to revert back all of this, and then follow the other process to install. What if I were to just install the nvidia drivers on top of the bare-metal setup. Would I be able to access the GPUs via nvidia-smi as well then? Would it interfere with the TAO-API?

Morganh · July 25, 2023, 7:56am

If at time of deployment you specified that you want the driver that comes from GPU Operator, and then you install later a host driver, it will cause GPU Operator failures.

For TAO 5.0, there is a new parameter in quickstart_api_bare_metal/gpu-operator-values.yml.
install_driver: true：The deployment script will remove host drivers and will let GPU Operator start it’s driver pod.
It is highly suggested to let the deployment script remove host drivers and install GPU Operator drivers pod. That way, you have the correct driver version.

install_driver: false： When there is the latest host driver and you don’t want to remove it. If you want to keep existing host drivers, then make sure you are using latest drivers.
But please note that this setting comes with a huge warning that any use of the GPU outside of K8s will cause conflict as K8s GPU jobs will get scheduled on GPUs you currently use outside of K8s.
If you change your mind later on, you can re-install GPU Operator manually via it’s Helm chart, one can specify if he wants the GPU Operator to include its own driver pod with --set driver.enabled=true. If you want to use host driver instead, you need to --set driver.enabled=false.

system · August 8, 2023, 7:57am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
TAO AutoML - TAO Toolkit Setup TAO Toolkit ubuntu	12	734	May 22, 2023
TAO API service bare metal setup issues TAO Toolkit	5	553	March 2, 2023
Baremetal install TAO5.0 error TAO Toolkit	55	1313	October 3, 2023
TAO Toolkit 4.0 setup issue TAO Toolkit	19	2954	January 5, 2023
TAO Toolkit 4.0.0 API bare metal setup causing gpu driver and kube utilities to uninstall (lots of confusing things happening at the same time) TAO Toolkit	36	1997	April 5, 2023
Unable to install TAO Toolkit 5.2.0 API on bare metal TAO Toolkit installation , api	58	1268	February 29, 2024
TAO API - Bare metal uninstall - TAO Toolkit	3	482	July 17, 2023
TAO Toolkit API 5.3.0 - Installed with errors TAO Toolkit	3	362	April 22, 2024
AutoML training speed and GPU problem TAO Toolkit	28	1560	March 29, 2023
TAO Toolkit 4.0 setup issue - similar as a previous issue TAO Toolkit	5	852	February 6, 2023

NVIDIA Driver Installation skipped during bare-metal install

Related topics