Unable to install TAO Toolkit 5.2.0 API on bare metal

Hi! I have some issues in installing TAO Toolkit API 5.2.0 on bare metal (single machine), using the provided scripts.
Starting from a fresh Ubuntu 20.04.6 these are the steps:

  1. Define a “ubuntu” user with password “password”
  2. Update the system with apt upgrade
  3. Install NVIDIA-NGC:
> wget --content-disposition https://api.ngc.nvidia.com/v2/resources/nvidia/ngc-apps/ngc_cli/versions/3.39.0/files/ngccli_linux.zip -O ngccli_linux.zip && unzip ngccli_linux.zip
> chmod u+x ngc-cli/ngc
> echo "export PATH=\"\$PATH:$(pwd)/ngc-cli\"" >> ~/.bash_profile && source ~/.bash_profile
> ngc config set
  1. Download TAO Toolkit:
> ngc registry resource download-version "nvidia/tao/tao-getting-started:5.2.0"
> cd tao-getting-started_v5.2.0/setup/quickstart_api_bare_metal
> sudo echo "ubuntu ALL=(ALL) NOPASSWD:ALL" >> /etc/sudoers
  1. Install openssh-server and get hostname via:
> hostname -i
  1. Generate SSH keys pair, use ssh-copy-id and check that ssh ubuntu@127.0.1.1 'sudo whoami' gives “root” as result

  2. Update hosts file:

  3. Set parameters in tao-toolkit-api-ansible-values.yml

Then running bash setup.sh install this is the result:
first_run_log.txt (2.2 KB)
To solve the issue I’ve changed the value of check gpu per node to False. Restarting the installation gives this log:
second_run_log.txt (2.2 KB)
My GPUs are seen as VGA controllers and not as 3D adapters, changing the grep condition to “VGA” fixed the problem. The new log is:
third_log_run.txt (15.7 KB)
Site packages.google seems to be down, changed it to google.com and restart the script. The systems reboots when [Waiting for the Cluster to become available].
Executing the script after reboot gives:
4_run_log.txt (124.1 KB)
and the installation doesn’t go on.
Command kubectl get pods --all-namespaces:

Executing command kubectl delete crd clusterpolicies.nvidia.com gives this log:
output.txt (167.2 KB)
With the TAO api installation stuck. The kubectl describe pod tao command gives this info:
kubect_describe.txt (4.9 KB)
with errors related to connection refused during the liveness checks.

PC specs:

  • CPU: Intel(R) Xeon(R) w5-2445
  • RAM: 64 GB
  • GPU: 2x RTXA2000 (12 GB)

Thanks!

Thanks for the info. We will take a look futher.

This is not expected. The system should not reboot when Waiting for the Cluster to become available. Did you have the full log before Waiting for the Cluster to become available?

Hi Morganh and thanks for yor answer. Unfortunately not, because the system rebooted before the logs could be saved. I don’t think there was errors by the way, something like file 4_run_log.txt

Please try to uninstall nvidia-driver,

$sudo apt purge nvidia-driver-xxx (please enter xxx after checking $nvidia-smi)
$sudo apt autoremove
$sudo apt autoclean

Then reboot and run setup.sh again to check if it works. Thanks.

There’s no driver on the system, also after running setup.sh

Yes, it is expected as well. This is bare metal. You can install to check if it works.

Another case, if user has already installed nvidia-smi, then modify gpu-operator-values.yml to set install_driver: false. This is the 2nd case.

You can firstly check if 1st case works.

After running setup.sh uninstall, rebooting and running setup.sh install the log is this one:
after_reboot.txt (167.4 KB)
Some strange things that happen are:

The error of the tao-toolkit-api pod is due to liveness:
kubectl_describe_1.txt (4.8 KB)

Just to confirm, you already uninstall nvidia-smi and currently there is no output of $nvidia-smi, right?

Yes, no driver and nvidia-smi installed

OK, got it.

As above, there is error when install tao-toolkit-api. That is not expected.
Could you
$ kubectl logs -f tao-toolkit-api-app-pod-fdb966967-4ltvv

Also, can you share gpu-operator-values.yml ?

Reinstalling TAO with:

> helm upgrade --install --reset-values --cleanup-on-fail --create-namespace --namespace default --atomic --wait tao-toolkit-api https://helm.ngc.nvidia.com/nvidia/tao/charts/tao-toolkit-api-5.2.0.tgz --values /tmp/tao-toolkit-api-helm-values.yml --username='$oauthtoken' --password=ZXJhdWU5ZWluNG1uNHB2bWZldW51dm5mdWY6YzM1YWM2YjAtYmNmNC00N2QwLTk4ZGMtODA1ODBhM2FhNmUw --debug

This is the log:
helm_upgrade.txt (19.5 KB)

While kubectl logs -f tao-toolkit-api-app-pod-fdb966967-k72jx gives only NGC CLI 3.23.0.

enable_mig: no
mig_profile: all-disabled
mig_strategy: single
nvidia_driver_version: "535.104.12"
install_driver: true

Please try to use below way for hosts file. Assume your local machine has ip of 10.34.4.222, and its passwd is xxx, then set it as

[master]
10.34.4.222 ansible_ssh_user='ubuntu' ansible_ssh_pass='xxx' ansible_ssh_extra_args='-o StrictHostKeyChecking=no'

Then run setup.sh uninstall and setup.sh install again.

Running uninstall and then install gives the same errors:
log.txt (197.8 KB)
with TAO pod failing:

Could you run below and share the result? Thanks.

$ helm ls
$ helm upgrade --install --reset-values --cleanup-on-fail --create-namespace --namespace default --atomic --wait tao-toolkit-api https://helm.ngc.nvidia.com/nvidia/tao/charts/tao-toolkit-api-5.0.0.tgz --values /tmp/tao-toolkit-api-helm-values.yml --username='$oauthtoken' --password=ZXJhdWU5ZWluNG1uNHB2bWZldW51dm5mdWY6YzM1YWM2YjAtYmNmNC00N2QwLTk4ZGMtODA1ODBhM2FhNmUw --debug

Result of helm ls:

NAME                           	NAMESPACE	REVISION	UPDATED                                	STATUS  	CHART                                 	APP VERSION
ingress-nginx                  	default  	1       	2024-02-23 10:38:20.834726606 +0100 CET	deployed	ingress-nginx-4.9.1                   	1.9.6      
nfs-subdir-external-provisioner	default  	1       	2024-02-23 10:38:33.750200965 +0100 CET	deployed	nfs-subdir-external-provisioner-4.0.18	4.0.2 

Result of helm upgrade:

history.go:56: [debug] getting history for release tao-toolkit-api
Release "tao-toolkit-api" does not exist. Installing it now.
install.go:178: [debug] Original chart version: ""
install.go:195: [debug] CHART PATH: /home/ubuntu/.cache/helm/repository/tao-toolkit-api-5.0.0.tgz

client.go:128: [debug] creating 1 resource(s)
client.go:128: [debug] creating 15 resource(s)
wait.go:48: [debug] beginning wait for 15 resources with timeout of 5m0s
ready.go:277: [debug] Deployment is not ready: default/tao-toolkit-api-app-pod. 0 out of 1 expected pods are ready
ready.go:277: [debug] Deployment is not ready: default/tao-toolkit-api-app-pod. 0 out of 1 expected pods are ready
ready.go:277: [debug] Deployment is not ready: default/tao-toolkit-api-app-pod. 0 out of 1 expected pods are ready
ready.go:277: [debug] Deployment is not ready: default/tao-toolkit-api-app-pod. 0 out of 1 expected pods are ready
ready.go:277: [debug] Deployment is not ready: default/tao-toolkit-api-app-pod. 0 out of 1 expected pods are ready
ready.go:277: [debug] Deployment is not ready: default/tao-toolkit-api-app-pod. 0 out of 1 expected pods are ready
ready.go:277: [debug] Deployment is not ready: default/tao-toolkit-api-app-pod. 0 out of 1 expected pods are ready
ready.go:277: [debug] Deployment is not ready: default/tao-toolkit-api-app-pod. 0 out of 1 expected pods are ready
ready.go:277: [debug] Deployment is not ready: default/tao-toolkit-api-app-pod. 0 out of 1 expected pods are ready
ready.go:277: [debug] Deployment is not ready: default/tao-toolkit-api-app-pod. 0 out of 1 expected pods are ready
ready.go:277: [debug] Deployment is not ready: default/tao-toolkit-api-app-pod. 0 out of 1 expected pods are ready
ready.go:277: [debug] Deployment is not ready: default/tao-toolkit-api-app-pod. 0 out of 1 expected pods are ready
ready.go:277: [debug] Deployment is not ready: default/tao-toolkit-api-app-pod. 0 out of 1 expected pods are ready
ready.go:277: [debug] Deployment is not ready: default/tao-toolkit-api-app-pod. 0 out of 1 expected pods are ready
ready.go:277: [debug] Deployment is not ready: default/tao-toolkit-api-app-pod. 0 out of 1 expected pods are ready
ready.go:277: [debug] Deployment is not ready: default/tao-toolkit-api-app-pod. 0 out of 1 expected pods are ready
ready.go:277: [debug] Deployment is not ready: default/tao-toolkit-api-app-pod. 0 out of 1 expected pods are ready
ready.go:277: [debug] Deployment is not ready: default/tao-toolkit-api-app-pod. 0 out of 1 expected pods are ready
ready.go:277: [debug] Deployment is not ready: default/tao-toolkit-api-app-pod. 0 out of 1 expected pods are ready
ready.go:277: [debug] Deployment is not ready: default/tao-toolkit-api-app-pod. 0 out of 1 expected pods are ready
ready.go:277: [debug] Deployment is not ready: default/tao-toolkit-api-app-pod. 0 out of 1 expected pods are ready
...
ready.go:277: [debug] Deployment is not ready: default/tao-toolkit-api-app-pod. 0 out of 1 expected pods are ready
install.go:441: [debug] Install failed and atomic is set, uninstalling release
uninstall.go:95: [debug] uninstall: Deleting tao-toolkit-api
client.go:299: [debug] Starting delete for "tao-toolkit-api-ingress-auth" Ingress
client.go:299: [debug] Starting delete for "tao-toolkit-api-ingress-openapi-yaml" Ingress
client.go:299: [debug] Starting delete for "tao-toolkit-api-ingress-login" Ingress
client.go:299: [debug] Starting delete for "tao-toolkit-api-ingress-openapi-json" Ingress
client.go:299: [debug] Starting delete for "tao-toolkit-api-ingress-redoc" Ingress
client.go:299: [debug] Starting delete for "tao-toolkit-api-ingress-swagger" Ingress
client.go:299: [debug] Starting delete for "tao-toolkit-api-jupyterlab-service" Service
client.go:299: [debug] Starting delete for "tao-toolkit-api-service" Service
client.go:299: [debug] Starting delete for "tao-toolkit-api-workflow-pod" Deployment
client.go:299: [debug] Starting delete for "tao-toolkit-api-jupyterlab-pod" Deployment
client.go:299: [debug] Starting delete for "tao-toolkit-api-app-pod" Deployment
client.go:299: [debug] Starting delete for "e7f6a9f345004c800bf5eae8b81c3b1a20760348-rbac-crb" ClusterRoleBinding
client.go:299: [debug] Starting delete for "e7f6a9f345004c800bf5eae8b81c3b1a20760348-rbac-cr" ClusterRole
client.go:299: [debug] Starting delete for "tao-toolkit-api-pvc" PersistentVolumeClaim
client.go:299: [debug] Starting delete for "tao-toolkit-api-tutorials-configmap" ConfigMap
uninstall.go:144: [debug] purge requested for tao-toolkit-api
Error: release tao-toolkit-api failed, and has been uninstalled due to atomic being set: timed out waiting for the condition
helm.go:84: [debug] timed out waiting for the condition
release tao-toolkit-api failed, and has been uninstalled due to atomic being set
helm.sh/helm/v3/pkg/action.(*Install).failRelease
	helm.sh/helm/v3/pkg/action/install.go:449
helm.sh/helm/v3/pkg/action.(*Install).reportToRun
	helm.sh/helm/v3/pkg/action/install.go:433
helm.sh/helm/v3/pkg/action.(*Install).performInstall
	helm.sh/helm/v3/pkg/action/install.go:389
runtime.goexit
	runtime/asm_amd64.s:1581

Could you run below to uninstall firstly? Please let me know if the uninstallation can work successfully.
$ bash setup.sh uninstall

Do you mean uninstalling and then running helm ls and helm upgrade? If yes I think that won’t work because helm will be uninstalled

After $ bash setup.sh uninstall, it will auto reboot, then please run $ bash setup.sh install .

TASK [capture user intent to override driver] *********************************************************************************************************************************************************************
[capture user intent to override driver]
One or more hosts has NVIDIA driver installed. Do you want to override it (y/n)?:

Should I type y or n?