NVIDIA Driver Installation skipped during bare-metal install

amogh.dabholkar · July 17, 2023, 5:28pm

Please provide the following information when requesting support.

• Hardware (T4/V100/Xavier/Nano/etc)
• Network Type (Detectnet_v2/Faster_rcnn/Yolo_v4/LPRnet/Mask_rcnn/Classification/etc)
• TLT Version (Please run “tlt info --verbose” and share “docker_tag” here) 4.0.1-tf2.9.1
• Training spec file(If have, please share here)
• How to reproduce the issue ? (This is for errors. Please share the command line and the detailed log here.)

I want to run the AutoML experiments, for which I was installing the TAO REST API.

Current System: SSH into 10.103.48.174 server with Ubuntu 20.04.

I set up my hosts as follows:

[master]
10.103.48.174 ansible_ssh_user='username' ansible_ssh_pass='password' ansible_ssh_extra_args='-o StrictHostKeyChecking=no'

Setup my tao-toolkit-api-ansible-values.yml as follows:

ngc_api_key: amx2cDV........
ngc_email: myname@mycorp.com
api_chart: https://helm.ngc.nvidia.com/nvidia/tao/charts/tao-toolkit-api-4.0.0.tgz
api_values: ./tao-toolkit-api-helm-values.yml
cluster_name: tao-toolkit-api-demo

I have passwordless sudo access as well.

Now I run bash setup.sh check-inventory.yml and it gives no output.

Finally I run bash setup.sh install
The code runs through most TASKS but skips a lot. for example -

Considering the fact that it first uninstalls all NVIDIA drivers, now I am left with no nvidia driver on my system because this install task is somehow skipped.

Report installed versions shows none installed

And then it FAILS at

I have the following questions:
1. Is the way of specifying the IP address, and details in the host file correct?
2. Why is it skipping the driver installation process?
3. How do I fix the final error for the TASK [Validate GPU Operator Version for Cloud Native Core 6.1]

Morganh · July 18, 2023, 8:47am

It is ok.

Can you upload the full logs?

amogh.dabholkar · July 18, 2023, 2:02pm

amogh.dabholkar · July 18, 2023, 2:06pm

Just to clarify, for the IP part, I am already on 10.103.48.174, inside a conda environment, and that’s the same machine I want as master, so I have entered its IP details, that’s a plausible way of doing this right?

Morganh · July 18, 2023, 4:46pm

Yes, for a single node cluster, you can list only the master node.

Please try latest notebook,
ngc registry resource download-version "nvidia/tao/tao-getting-started:4.0.2"

and also use latest tao-api.
api_chart: https://helm.ngc.nvidia.com/nvidia/tao/charts/tao-toolkit-api-4.0.2.tgz

Morganh · July 18, 2023, 4:50pm

Also, you can also get the hint from TAO AutoML - TAO Toolkit Setup - #11 by jay.duff to set the IP from the output of $ hostname -i

amogh.dabholkar · July 18, 2023, 4:59pm

Sure I will try that,
Just curious - where is this v4.0.2 mentioned? I can see only v4.0.1 on the documentation pages

Morganh · July 18, 2023, 5:04pm

The notebook link is in TAO Toolkit Getting Started | NVIDIA NGC
The link can be found in TAO Toolkit Quick Start Guide - NVIDIA Docs

All computer vision samples are included in the getting started resource on NGC.

The tao-api link can be found in TAO Toolkit | NVIDIA NGC

amogh.dabholkar · July 18, 2023, 5:11pm

Thankyou
Got confused because TAO Toolkit v4.0.1 (Latest Release) - NVIDIA Docs says v4.0.1 is the latest release
I’ll try this out and update here

amogh.dabholkar · July 18, 2023, 6:11pm

I did follow this as well as used v4.0.2 for everything and I get stuck at the same place as that person actually.

TASK [Waiting for the Cluster to become available] takes forever to load, I’ve waited 15 minutes, tried it twice. Rebooting the system actually did push it past that place where I was stuck, but the NVIDIA drivers installation is still being skipped and 2 validations are failing fatally.

Tried with 127.0.1.1 as IP instead of the actual address of the server.

When I do nvidia-smi, it still shows not found

Morganh · July 19, 2023, 1:44am

That is expected.

When running here, please open a new terminal and run
$ kubectl delete crd clusterpolicies.nvidia.com

amogh.dabholkar · July 19, 2023, 1:48am

So I’ll have to install nvidia drivers from scratch again? And everything else that was uninstalled?

Morganh · July 19, 2023, 1:50am

It is not needed to install nvidia driver.
Please run $bash setup.sh install

amogh.dabholkar · July 19, 2023, 2:00am

This is the output of running bash setup.sh install. At the end of running it, there was no driver found for nvidia-smi

Morganh · July 19, 2023, 2:01am

Can you share the output while uploading a .txt file via button ?

Morganh · July 19, 2023, 2:03am

To get the info from nvidia-smi, please run below command.
$ kubectl get pods
then you can find the pod which is named nvidia-smi-xxxx, then,
$ kubectl exec nvidia-smi-xxxx -- nvidia-smi

amogh.dabholkar · July 19, 2023, 2:12am

This works. So to access the GPU on my server, I have to use kubectl everytime? Just nvidia-smi won’t work? Sorry, I’m not familiar with Kubernetes

Morganh · July 19, 2023, 2:13am

To get the info from nvidia-smi, yes, please run above command.
Usually the nvidia-smi pod’s name will not be changed. So, you just need to run
$ kubectl exec nvidia-smi-xxxx -- nvidia-smi

amogh.dabholkar · July 19, 2023, 2:17am

I get it that this command works, but I would also like to access via just nvidia-smi
What should I do for that?
I ran $ bash setup.sh install is there anything else that needs to be done?

Morganh · July 19, 2023, 2:20am

It will not be working now since you already install tao-api. When you are using tao-api, please use above way to get the info.

Topic		Replies	Views
TAO AutoML - TAO Toolkit Setup TAO Toolkit ubuntu	12	734	May 22, 2023
TAO API service bare metal setup issues TAO Toolkit	5	553	March 2, 2023
Baremetal install TAO5.0 error TAO Toolkit	55	1313	October 3, 2023
TAO Toolkit 4.0 setup issue TAO Toolkit	19	2954	January 5, 2023
TAO Toolkit 4.0.0 API bare metal setup causing gpu driver and kube utilities to uninstall (lots of confusing things happening at the same time) TAO Toolkit	36	1997	April 5, 2023
Unable to install TAO Toolkit 5.2.0 API on bare metal TAO Toolkit installation , api	58	1268	February 29, 2024
TAO API - Bare metal uninstall - TAO Toolkit	3	482	July 17, 2023
TAO Toolkit API 5.3.0 - Installed with errors TAO Toolkit	3	362	April 22, 2024
AutoML training speed and GPU problem TAO Toolkit	28	1560	March 29, 2023
TAO Toolkit 4.0 setup issue - similar as a previous issue TAO Toolkit	5	852	February 6, 2023

NVIDIA Driver Installation skipped during bare-metal install

Related topics