NVIDIA Driver Installation skipped during bare-metal install

Also, you can also get the hint from TAO AutoML - TAO Toolkit Setup - #11 by jay.duff to set the IP from the output of $ hostname -i

Sure I will try that,
Just curious - where is this v4.0.2 mentioned? I can see only v4.0.1 on the documentation pages

The notebook link is in TAO Toolkit Getting Started | NVIDIA NGC
The link can be found in TAO Toolkit Quick Start Guide - NVIDIA Docs

All computer vision samples are included in the getting started resource on NGC.

The tao-api link can be found in TAO Toolkit | NVIDIA NGC

Thankyou
Got confused because TAO Toolkit v4.0.1 (Latest Release) - NVIDIA Docs says v4.0.1 is the latest release
I’ll try this out and update here

I did follow this as well as used v4.0.2 for everything and I get stuck at the same place as that person actually.

TASK [Waiting for the Cluster to become available] takes forever to load, I’ve waited 15 minutes, tried it twice. Rebooting the system actually did push it past that place where I was stuck, but the NVIDIA drivers installation is still being skipped and 2 validations are failing fatally.

Tried with 127.0.1.1 as IP instead of the actual address of the server.

When I do nvidia-smi, it still shows not found

That is expected.

When running here, please open a new terminal and run
$ kubectl delete crd clusterpolicies.nvidia.com

So I’ll have to install nvidia drivers from scratch again? And everything else that was uninstalled?

It is not needed to install nvidia driver.
Please run $bash setup.sh install

This is the output of running bash setup.sh install. At the end of running it, there was no driver found for nvidia-smi

Can you share the output while uploading a .txt file via button ?
image

To get the info from nvidia-smi, please run below command.
$ kubectl get pods
then you can find the pod which is named nvidia-smi-xxxx, then,
$ kubectl exec nvidia-smi-xxxx -- nvidia-smi

This works. So to access the GPU on my server, I have to use kubectl everytime? Just nvidia-smi won’t work? Sorry, I’m not familiar with Kubernetes

To get the info from nvidia-smi, yes, please run above command.
Usually the nvidia-smi pod’s name will not be changed. So, you just need to run
$ kubectl exec nvidia-smi-xxxx -- nvidia-smi

I get it that this command works, but I would also like to access via just nvidia-smi
What should I do for that?
I ran $ bash setup.sh install is there anything else that needs to be done?

It will not be working now since you already install tao-api. When you are using tao-api, please use above way to get the info.

What happens if I install the nvidia drivers separately again so that the other users on the system can use it? Will there be a conflict?

Refer to TAO Toolkit 4.0.0 API bare metal setup causing gpu driver and kube utilities to uninstall (lots of confusing things happening at the same time) - #29 by Morganh

Thank you. But we currently don’t have the bandwidth to revert back all of this, and then follow the other process to install. What if I were to just install the nvidia drivers on top of the bare-metal setup. Would I be able to access the GPUs via nvidia-smi as well then? Would it interfere with the TAO-API?

If at time of deployment you specified that you want the driver that comes from GPU Operator, and then you install later a host driver, it will cause GPU Operator failures.

For TAO 5.0, there is a new parameter in quickstart_api_bare_metal/gpu-operator-values.yml.
install_driver: true:The deployment script will remove host drivers and will let GPU Operator start it’s driver pod.
It is highly suggested to let the deployment script remove host drivers and install GPU Operator drivers pod. That way, you have the correct driver version.

install_driver: false: When there is the latest host driver and you don’t want to remove it. If you want to keep existing host drivers, then make sure you are using latest drivers.
But please note that this setting comes with a huge warning that any use of the GPU outside of K8s will cause conflict as K8s GPU jobs will get scheduled on GPUs you currently use outside of K8s.
If you change your mind later on, you can re-install GPU Operator manually via it’s Helm chart, one can specify if he wants the GPU Operator to include its own driver pod with --set driver.enabled=true. If you want to use host driver instead, you need to --set driver.enabled=false.

1 Like

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.