Unable to install TAO Toolkit 5.2.0 API on bare metal

m.digiusto · February 22, 2024, 3:17pm

Hi! I have some issues in installing TAO Toolkit API 5.2.0 on bare metal (single machine), using the provided scripts.
Starting from a fresh Ubuntu 20.04.6 these are the steps:

Define a “ubuntu” user with password “password”
Update the system with apt upgrade
Install NVIDIA-NGC:

> wget --content-disposition https://api.ngc.nvidia.com/v2/resources/nvidia/ngc-apps/ngc_cli/versions/3.39.0/files/ngccli_linux.zip -O ngccli_linux.zip && unzip ngccli_linux.zip
> chmod u+x ngc-cli/ngc
> echo "export PATH=\"\$PATH:$(pwd)/ngc-cli\"" >> ~/.bash_profile && source ~/.bash_profile
> ngc config set

Download TAO Toolkit:

> ngc registry resource download-version "nvidia/tao/tao-getting-started:5.2.0"
> cd tao-getting-started_v5.2.0/setup/quickstart_api_bare_metal
> sudo echo "ubuntu ALL=(ALL) NOPASSWD:ALL" >> /etc/sudoers

Install openssh-server and get hostname via:

> hostname -i

Generate SSH keys pair, use ssh-copy-id and check that ssh ubuntu@127.0.1.1 'sudo whoami' gives “root” as result
Update hosts file:

hosts1095×156 21.6 KB
Set parameters in tao-toolkit-api-ansible-values.yml

Then running bash setup.sh install this is the result:
first_run_log.txt (2.2 KB)
To solve the issue I’ve changed the value of check gpu per node to False. Restarting the installation gives this log:
second_run_log.txt (2.2 KB)
My GPUs are seen as VGA controllers and not as 3D adapters, changing the grep condition to “VGA” fixed the problem. The new log is:
third_log_run.txt (15.7 KB)
Site packages.google seems to be down, changed it to google.com and restart the script. The systems reboots when [Waiting for the Cluster to become available].
Executing the script after reboot gives:
4_run_log.txt (124.1 KB)
and the installation doesn’t go on.
Command kubectl get pods --all-namespaces:

Executing command kubectl delete crd clusterpolicies.nvidia.com gives this log:
output.txt (167.2 KB)
With the TAO api installation stuck. The kubectl describe pod tao command gives this info:
kubect_describe.txt (4.9 KB)
with errors related to connection refused during the liveness checks.

PC specs:

CPU: Intel(R) Xeon(R) w5-2445
RAM: 64 GB
GPU: 2x RTXA2000 (12 GB)

Thanks!

Morganh · February 23, 2024, 6:28am

Thanks for the info. We will take a look futher.

This is not expected. The system should not reboot when Waiting for the Cluster to become available. Did you have the full log before Waiting for the Cluster to become available?

m.digiusto · February 23, 2024, 7:37am

Hi Morganh and thanks for yor answer. Unfortunately not, because the system rebooted before the logs could be saved. I don’t think there was errors by the way, something like file 4_run_log.txt

Morganh · February 23, 2024, 7:43am

Please try to uninstall nvidia-driver,

$sudo apt purge nvidia-driver-xxx (please enter xxx after checking $nvidia-smi)
$sudo apt autoremove
$sudo apt autoclean

Then reboot and run setup.sh again to check if it works. Thanks.

m.digiusto · February 23, 2024, 7:56am

There’s no driver on the system, also after running setup.sh

Morganh · February 23, 2024, 8:03am

Yes, it is expected as well. This is bare metal. You can install to check if it works.

Another case, if user has already installed nvidia-smi, then modify gpu-operator-values.yml to set install_driver: false. This is the 2nd case.

You can firstly check if 1st case works.

m.digiusto · February 23, 2024, 8:33am

After running setup.sh uninstall, rebooting and running setup.sh install the log is this one:
after_reboot.txt (167.4 KB)
Some strange things that happen are:

gpu-test pod unable to start up

gpu_test1514×479 108 KB
tao-toolkit-api pod keeps restaring until crash:

fine1191×540 130 KB

crash_tao1317×514 123 KB

The error of the tao-toolkit-api pod is due to liveness:
kubectl_describe_1.txt (4.8 KB)

Morganh · February 23, 2024, 8:36am

Just to confirm, you already uninstall nvidia-smi and currently there is no output of $nvidia-smi, right?

m.digiusto · February 23, 2024, 8:38am

Yes, no driver and nvidia-smi installed

Morganh · February 23, 2024, 8:39am

OK, got it.

Morganh · February 23, 2024, 8:55am

m.digiusto:

after_reboot.txt (167.4 KB)

TASK [install tao-toolkit-api] *****************************************************************************************************************************************
fatal: [127.0.1.1]: FAILED! => {“changed”: true, “cmd”: “helm upgrade --install --reset-values --cleanup-on-fail --create-namespace --namespace default --atomic --wait tao-toolkit-api https://helm.ngc.nvidia.com/nvidia/tao/charts/tao-toolkit-api-5.2.0.tgz --values /tmp/tao-toolkit-api-helm-values.yml --username=‘$oauthtoken’ --password=ZXJhdWU5ZWluNG1uNHB2bWZldW51dm5mdWY6YzM1YWM2YjAtYmNmNC00N2QwLTk4ZGMtODA1ODBhM2FhNmUw”, “delta”: “0:05:11.259265”, “end”: “2024-02-23 09:28:47.425009”, “msg”: “non-zero return code”, “rc”: 1, “start”: “2024-02-23 09:23:36.165744”, “stderr”: “Error: release tao-toolkit-api failed, and has been uninstalled due to atomic being set: timed out waiting for the condition”, “stderr_lines”: [“Error: release tao-toolkit-api failed, and has been uninstalled due to atomic being set: timed out waiting for the condition”], “stdout”: “Release "tao-toolkit-api" does not exist. Installing it now.”, “stdout_lines”: [“Release "tao-toolkit-api" does not exist. Installing it now.”]}

As above, there is error when install tao-toolkit-api. That is not expected.
Could you
$ kubectl logs -f tao-toolkit-api-app-pod-fdb966967-4ltvv

Also, can you share gpu-operator-values.yml ?

m.digiusto · February 23, 2024, 9:05am

Reinstalling TAO with:

> helm upgrade --install --reset-values --cleanup-on-fail --create-namespace --namespace default --atomic --wait tao-toolkit-api https://helm.ngc.nvidia.com/nvidia/tao/charts/tao-toolkit-api-5.2.0.tgz --values /tmp/tao-toolkit-api-helm-values.yml --username='$oauthtoken' --password=ZXJhdWU5ZWluNG1uNHB2bWZldW51dm5mdWY6YzM1YWM2YjAtYmNmNC00N2QwLTk4ZGMtODA1ODBhM2FhNmUw --debug

This is the log:
helm_upgrade.txt (19.5 KB)

While kubectl logs -f tao-toolkit-api-app-pod-fdb966967-k72jx gives only NGC CLI 3.23.0.

enable_mig: no
mig_profile: all-disabled
mig_strategy: single
nvidia_driver_version: "535.104.12"
install_driver: true

Morganh · February 23, 2024, 9:20am

Please try to use below way for hosts file. Assume your local machine has ip of 10.34.4.222, and its passwd is xxx, then set it as

[master]
10.34.4.222 ansible_ssh_user='ubuntu' ansible_ssh_pass='xxx' ansible_ssh_extra_args='-o StrictHostKeyChecking=no'

Then run setup.sh uninstall and setup.sh install again.

m.digiusto · February 23, 2024, 9:48am

Getting IP via ifconfig:

ifconfig708×171 9.12 KB
Updating hosts:

hosts1157×153 17.1 KB

Running uninstall and then install gives the same errors:
log.txt (197.8 KB)
with TAO pod failing:

Morganh · February 23, 2024, 3:31pm

Could you run below and share the result? Thanks.

$ helm ls
$ helm upgrade --install --reset-values --cleanup-on-fail --create-namespace --namespace default --atomic --wait tao-toolkit-api https://helm.ngc.nvidia.com/nvidia/tao/charts/tao-toolkit-api-5.0.0.tgz --values /tmp/tao-toolkit-api-helm-values.yml --username='$oauthtoken' --password=ZXJhdWU5ZWluNG1uNHB2bWZldW51dm5mdWY6YzM1YWM2YjAtYmNmNC00N2QwLTk4ZGMtODA1ODBhM2FhNmUw --debug

m.digiusto · February 23, 2024, 3:38pm

Result of helm ls:

NAME                           	NAMESPACE	REVISION	UPDATED                                	STATUS  	CHART                                 	APP VERSION
ingress-nginx                  	default  	1       	2024-02-23 10:38:20.834726606 +0100 CET	deployed	ingress-nginx-4.9.1                   	1.9.6      
nfs-subdir-external-provisioner	default  	1       	2024-02-23 10:38:33.750200965 +0100 CET	deployed	nfs-subdir-external-provisioner-4.0.18	4.0.2

Result of helm upgrade:

history.go:56: [debug] getting history for release tao-toolkit-api
Release "tao-toolkit-api" does not exist. Installing it now.
install.go:178: [debug] Original chart version: ""
install.go:195: [debug] CHART PATH: /home/ubuntu/.cache/helm/repository/tao-toolkit-api-5.0.0.tgz

client.go:128: [debug] creating 1 resource(s)
client.go:128: [debug] creating 15 resource(s)
wait.go:48: [debug] beginning wait for 15 resources with timeout of 5m0s
ready.go:277: [debug] Deployment is not ready: default/tao-toolkit-api-app-pod. 0 out of 1 expected pods are ready
ready.go:277: [debug] Deployment is not ready: default/tao-toolkit-api-app-pod. 0 out of 1 expected pods are ready
ready.go:277: [debug] Deployment is not ready: default/tao-toolkit-api-app-pod. 0 out of 1 expected pods are ready
ready.go:277: [debug] Deployment is not ready: default/tao-toolkit-api-app-pod. 0 out of 1 expected pods are ready
ready.go:277: [debug] Deployment is not ready: default/tao-toolkit-api-app-pod. 0 out of 1 expected pods are ready
ready.go:277: [debug] Deployment is not ready: default/tao-toolkit-api-app-pod. 0 out of 1 expected pods are ready
ready.go:277: [debug] Deployment is not ready: default/tao-toolkit-api-app-pod. 0 out of 1 expected pods are ready
ready.go:277: [debug] Deployment is not ready: default/tao-toolkit-api-app-pod. 0 out of 1 expected pods are ready
ready.go:277: [debug] Deployment is not ready: default/tao-toolkit-api-app-pod. 0 out of 1 expected pods are ready
ready.go:277: [debug] Deployment is not ready: default/tao-toolkit-api-app-pod. 0 out of 1 expected pods are ready
ready.go:277: [debug] Deployment is not ready: default/tao-toolkit-api-app-pod. 0 out of 1 expected pods are ready
ready.go:277: [debug] Deployment is not ready: default/tao-toolkit-api-app-pod. 0 out of 1 expected pods are ready
ready.go:277: [debug] Deployment is not ready: default/tao-toolkit-api-app-pod. 0 out of 1 expected pods are ready
ready.go:277: [debug] Deployment is not ready: default/tao-toolkit-api-app-pod. 0 out of 1 expected pods are ready
ready.go:277: [debug] Deployment is not ready: default/tao-toolkit-api-app-pod. 0 out of 1 expected pods are ready
ready.go:277: [debug] Deployment is not ready: default/tao-toolkit-api-app-pod. 0 out of 1 expected pods are ready
ready.go:277: [debug] Deployment is not ready: default/tao-toolkit-api-app-pod. 0 out of 1 expected pods are ready
ready.go:277: [debug] Deployment is not ready: default/tao-toolkit-api-app-pod. 0 out of 1 expected pods are ready
ready.go:277: [debug] Deployment is not ready: default/tao-toolkit-api-app-pod. 0 out of 1 expected pods are ready
ready.go:277: [debug] Deployment is not ready: default/tao-toolkit-api-app-pod. 0 out of 1 expected pods are ready
ready.go:277: [debug] Deployment is not ready: default/tao-toolkit-api-app-pod. 0 out of 1 expected pods are ready
...
ready.go:277: [debug] Deployment is not ready: default/tao-toolkit-api-app-pod. 0 out of 1 expected pods are ready
install.go:441: [debug] Install failed and atomic is set, uninstalling release
uninstall.go:95: [debug] uninstall: Deleting tao-toolkit-api
client.go:299: [debug] Starting delete for "tao-toolkit-api-ingress-auth" Ingress
client.go:299: [debug] Starting delete for "tao-toolkit-api-ingress-openapi-yaml" Ingress
client.go:299: [debug] Starting delete for "tao-toolkit-api-ingress-login" Ingress
client.go:299: [debug] Starting delete for "tao-toolkit-api-ingress-openapi-json" Ingress
client.go:299: [debug] Starting delete for "tao-toolkit-api-ingress-redoc" Ingress
client.go:299: [debug] Starting delete for "tao-toolkit-api-ingress-swagger" Ingress
client.go:299: [debug] Starting delete for "tao-toolkit-api-jupyterlab-service" Service
client.go:299: [debug] Starting delete for "tao-toolkit-api-service" Service
client.go:299: [debug] Starting delete for "tao-toolkit-api-workflow-pod" Deployment
client.go:299: [debug] Starting delete for "tao-toolkit-api-jupyterlab-pod" Deployment
client.go:299: [debug] Starting delete for "tao-toolkit-api-app-pod" Deployment
client.go:299: [debug] Starting delete for "e7f6a9f345004c800bf5eae8b81c3b1a20760348-rbac-crb" ClusterRoleBinding
client.go:299: [debug] Starting delete for "e7f6a9f345004c800bf5eae8b81c3b1a20760348-rbac-cr" ClusterRole
client.go:299: [debug] Starting delete for "tao-toolkit-api-pvc" PersistentVolumeClaim
client.go:299: [debug] Starting delete for "tao-toolkit-api-tutorials-configmap" ConfigMap
uninstall.go:144: [debug] purge requested for tao-toolkit-api
Error: release tao-toolkit-api failed, and has been uninstalled due to atomic being set: timed out waiting for the condition
helm.go:84: [debug] timed out waiting for the condition
release tao-toolkit-api failed, and has been uninstalled due to atomic being set
helm.sh/helm/v3/pkg/action.(*Install).failRelease
	helm.sh/helm/v3/pkg/action/install.go:449
helm.sh/helm/v3/pkg/action.(*Install).reportToRun
	helm.sh/helm/v3/pkg/action/install.go:433
helm.sh/helm/v3/pkg/action.(*Install).performInstall
	helm.sh/helm/v3/pkg/action/install.go:389
runtime.goexit
	runtime/asm_amd64.s:1581

Morganh · February 23, 2024, 3:54pm

Could you run below to uninstall firstly? Please let me know if the uninstallation can work successfully.
$ bash setup.sh uninstall

m.digiusto · February 23, 2024, 3:58pm

Do you mean uninstalling and then running helm ls and helm upgrade? If yes I think that won’t work because helm will be uninstalled

Morganh · February 23, 2024, 3:59pm

After $ bash setup.sh uninstall, it will auto reboot, then please run $ bash setup.sh install .

m.digiusto · February 23, 2024, 4:01pm

TASK [capture user intent to override driver] *********************************************************************************************************************************************************************
[capture user intent to override driver]
One or more hosts has NVIDIA driver installed. Do you want to override it (y/n)?:

Should I type y or n?