NVIDIA Driver Installation skipped during bare-metal install

Please provide the following information when requesting support.

• Hardware (T4/V100/Xavier/Nano/etc)
• Network Type (Detectnet_v2/Faster_rcnn/Yolo_v4/LPRnet/Mask_rcnn/Classification/etc)
• TLT Version (Please run “tlt info --verbose” and share “docker_tag” here) 4.0.1-tf2.9.1
• Training spec file(If have, please share here)
• How to reproduce the issue ? (This is for errors. Please share the command line and the detailed log here.)

I want to run the AutoML experiments, for which I was installing the TAO REST API.

Current System: SSH into server with Ubuntu 20.04.

I set up my hosts as follows:

[master] ansible_ssh_user='username' ansible_ssh_pass='password' ansible_ssh_extra_args='-o StrictHostKeyChecking=no'

Setup my tao-toolkit-api-ansible-values.yml as follows:

ngc_api_key: amx2cDV........
ngc_email: myname@mycorp.com
api_chart: https://helm.ngc.nvidia.com/nvidia/tao/charts/tao-toolkit-api-4.0.0.tgz
api_values: ./tao-toolkit-api-helm-values.yml
cluster_name: tao-toolkit-api-demo

I have passwordless sudo access as well.

Now I run bash setup.sh check-inventory.yml and it gives no output.

Finally I run bash setup.sh install
The code runs through most TASKS but skips a lot. for example -

Considering the fact that it first uninstalls all NVIDIA drivers, now I am left with no nvidia driver on my system because this install task is somehow skipped.

Report installed versions shows none installed

And then it FAILS at

I have the following questions:
1. Is the way of specifying the IP address, and details in the host file correct?
2. Why is it skipping the driver installation process?
3. How do I fix the final error for the TASK [Validate GPU Operator Version for Cloud Native Core 6.1]

It is ok.

Can you upload the full logs?

Just to clarify, for the IP part, I am already on, inside a conda environment, and that’s the same machine I want as master, so I have entered its IP details, that’s a plausible way of doing this right?

Yes, for a single node cluster, you can list only the master node.

Please try latest notebook,
ngc registry resource download-version "nvidia/tao/tao-getting-started:4.0.2"

and also use latest tao-api.
api_chart: https://helm.ngc.nvidia.com/nvidia/tao/charts/tao-toolkit-api-4.0.2.tgz

Also, you can also get the hint from TAO AutoML - TAO Toolkit Setup - #11 by jay.duff to set the IP from the output of $ hostname -i

Sure I will try that,
Just curious - where is this v4.0.2 mentioned? I can see only v4.0.1 on the documentation pages

The notebook link is in TAO Toolkit Getting Started | NVIDIA NGC
The link can be found in TAO Toolkit Quick Start Guide - NVIDIA Docs

All computer vision samples are included in the getting started resource on NGC.

The tao-api link can be found in TAO Toolkit | NVIDIA NGC

Got confused because TAO Toolkit v4.0.1 (Latest Release) - NVIDIA Docs says v4.0.1 is the latest release
I’ll try this out and update here

I did follow this as well as used v4.0.2 for everything and I get stuck at the same place as that person actually.

TASK [Waiting for the Cluster to become available] takes forever to load, I’ve waited 15 minutes, tried it twice. Rebooting the system actually did push it past that place where I was stuck, but the NVIDIA drivers installation is still being skipped and 2 validations are failing fatally.

Tried with as IP instead of the actual address of the server.

When I do nvidia-smi, it still shows not found

That is expected.

When running here, please open a new terminal and run
$ kubectl delete crd clusterpolicies.nvidia.com

So I’ll have to install nvidia drivers from scratch again? And everything else that was uninstalled?

It is not needed to install nvidia driver.
Please run $bash setup.sh install

This is the output of running bash setup.sh install. At the end of running it, there was no driver found for nvidia-smi

Can you share the output while uploading a .txt file via button ?

To get the info from nvidia-smi, please run below command.
$ kubectl get pods
then you can find the pod which is named nvidia-smi-xxxx, then,
$ kubectl exec nvidia-smi-xxxx -- nvidia-smi

This works. So to access the GPU on my server, I have to use kubectl everytime? Just nvidia-smi won’t work? Sorry, I’m not familiar with Kubernetes

To get the info from nvidia-smi, yes, please run above command.
Usually the nvidia-smi pod’s name will not be changed. So, you just need to run
$ kubectl exec nvidia-smi-xxxx -- nvidia-smi

I get it that this command works, but I would also like to access via just nvidia-smi
What should I do for that?
I ran $ bash setup.sh install is there anything else that needs to be done?

It will not be working now since you already install tao-api. When you are using tao-api, please use above way to get the info.