Tao Auto ML setup/installation issue for bare metal(single node/local deployment)

umar.abdullah · March 21, 2025, 9:45pm

Please provide the following information when requesting support.

• Hardware (T4/V100/Xavier/Nano/etc) : 2 * A40
• Network Type (Detectnet_v2/Faster_rcnn/Yolo_v4/LPRnet/Mask_rcnn/Classification/etc) : First need to setup
• TLT Version (Please run “tlt info --verbose” and share “docker_tag” here)
• Training spec file(If have, please share here)
• How to reproduce the issue ? (This is for errors. Please share the command line and the detailed log here.)

I need to use Tao Auto ML,

I am following the steps mentioned here: tao api setup

I am using my local node/server to setup Tao API services.

I faced a lot of issues and resolved a some of them after looking at the earlier nvidia forum answers related to this topic, but still getting issues.

I am attaching the log txt file for : bash setup.sh install

log_setup_sh_install.txt (94.8 KB)

If you look at the log files, what I understand is that I am mainly getting one error and gpu validation is also failing

TASK [Validating the CUDA with GPU] ********************************************
ASYNC POLL on 127.0.1.1: jid=j70943425518.1881829 started=1 finished=0
ASYNC POLL on 127.0.1.1: jid=j70943425518.1881829 started=1 finished=0
ASYNC POLL on 127.0.1.1: jid=j70943425518.1881829 started=1 finished=0
ASYNC FAILED on 127.0.1.1: jid=j70943425518.1881829
fatal: [127.0.1.1]: FAILED! => {"ansible_job_id": "j70943425518.1881829", "changed": true, "cmd": "kubectl run cuda-vector-add --rm -t -i --restart=Never --image=k8s.gcr.io/cuda-vector-add:v0.1", "delta": "0:01:00.159991", "end": "2025-03-21 12:21:13.013696", "finished": 1, "msg": "non-zero return code", "rc": 1, "results_file": "/home/administrator/.ansible_async/j70943425518.1881829", "start": "2025-03-21 12:20:12.853705", "started": 1, "stderr": "error: timed out waiting for the condition", "stderr_lines": ["error: timed out waiting for the condition"], "stdout": "pod \"cuda-vector-add\" deleted", "stdout_lines": ["pod \"cuda-vector-add\" deleted"]}
...ignoring

TASK [Validating the nvidia-smi on NVIDIA Cloud Native Stack] ******************
ASYNC POLL on 127.0.1.1: jid=j933072673456.1883729 started=1 finished=0
ASYNC POLL on 127.0.1.1: jid=j933072673456.1883729 started=1 finished=0
ASYNC FAILED on 127.0.1.1: jid=j933072673456.1883729
fatal: [127.0.1.1]: FAILED! => {"ansible_job_id": "j933072673456.1883729", "changed": true, "cmd": "kubectl delete -f nvidia-smi.yaml; sleep 10; kubectl apply -f nvidia-smi.yaml; sleep 25; kubectl logs nvidia-smi", "delta": "0:00:36.267508", "end": "2025-03-21 12:21:51.836028", "finished": 1, "msg": "non-zero return code", "rc": 1, "results_file": "/home/administrator/.ansible_async/j933072673456.1883729", "start": "2025-03-21 12:21:15.568520", "started": 1, "stderr": "Error from server (NotFound): error when deleting \"nvidia-smi.yaml\": pods \"nvidia-smi\" not found\nError from server (BadRequest): container \"nvidia-smi\" in pod \"nvidia-smi\" is waiting to start: ContainerCreating", "stderr_lines": ["Error from server (NotFound): error when deleting \"nvidia-smi.yaml\": pods \"nvidia-smi\" not found", "Error from server (BadRequest): container \"nvidia-smi\" in pod \"nvidia-smi\" is waiting to start: ContainerCreating"], "stdout": "pod/nvidia-smi created", "stdout_lines": ["pod/nvidia-smi created"]}
...ignoring

and

TASK [install tao-toolkit-api] *************************************************
fatal: [127.0.1.1]: FAILED! => {"changed": true, "cmd": "helm upgrade --install --reset-values --cleanup-on-fail --create-namespace --namespace default --atomic --wait tao-api https://helm.ngc.nvidia.com/nvidia/tao/charts/tao-toolkit-api-5.5.0.tgz --values /tmp/tao-toolkit-api-helm-values.yml --username='$oauthtoken' --password=YmR0a2Fqa2lmNXRqMW8xMWkyOGY4ZDhhcnQ6YWFiZTMyYzctNDVmZi00NTMwLTk5ZTgtZjE0ODBmZjRlYzk5", "delta": "0:05:02.379604", "end": "2025-03-21 12:28:14.084631", "msg": "non-zero return code", "rc": 1, "start": "2025-03-21 12:23:11.705027", "stderr": "W0321 12:23:13.683970 1892289 warnings.go:70] annotation \"kubernetes.io/ingress.class\" is deprecated, please use 'spec.ingressClassName' instead\nW0321 12:23:13.684015 1892289 warnings.go:70] annotation \"kubernetes.io/ingress.class\" is deprecated, please use 'spec.ingressClassName' instead\nW0321 12:23:13.684142 1892289 warnings.go:70] annotation \"kubernetes.io/ingress.class\" is deprecated, please use 'spec.ingressClassName' instead\nW0321 12:23:13.684144 1892289 warnings.go:70] annotation \"kubernetes.io/ingress.class\" is deprecated, please use 'spec.ingressClassName' instead\nW0321 12:23:13.683968 1892289 warnings.go:70] annotation \"kubernetes.io/ingress.class\" is deprecated, please use 'spec.ingressClassName' instead\nW0321 12:23:13.684339 1892289 warnings.go:70] annotation \"kubernetes.io/ingress.class\" is deprecated, please use 'spec.ingressClassName' instead\nW0321 12:23:13.699830 1892289 warnings.go:70] annotation \"kubernetes.io/ingress.class\" is deprecated, please use 'spec.ingressClassName' instead\nW0321 12:23:13.699842 1892289 warnings.go:70] annotation \"kubernetes.io/ingress.class\" is deprecated, please use 'spec.ingressClassName' instead\nW0321 12:23:13.699977 1892289 warnings.go:70] annotation \"kubernetes.io/ingress.class\" is deprecated, please use 'spec.ingressClassName' instead\nW0321 12:23:13.701518 1892289 warnings.go:70] annotation \"kubernetes.io/ingress.class\" is deprecated, please use 'spec.ingressClassName' instead\nError: release tao-api failed, and has been uninstalled due to atomic being set: context deadline exceeded", "stderr_lines": ["W0321 12:23:13.683970 1892289 warnings.go:70] annotation \"kubernetes.io/ingress.class\" is deprecated, please use 'spec.ingressClassName' instead", "W0321 12:23:13.684015 1892289 warnings.go:70] annotation \"kubernetes.io/ingress.class\" is deprecated, please use 'spec.ingressClassName' instead", "W0321 12:23:13.684142 1892289 warnings.go:70] annotation \"kubernetes.io/ingress.class\" is deprecated, please use 'spec.ingressClassName' instead", "W0321 12:23:13.684144 1892289 warnings.go:70] annotation \"kubernetes.io/ingress.class\" is deprecated, please use 'spec.ingressClassName' instead", "W0321 12:23:13.683968 1892289 warnings.go:70] annotation \"kubernetes.io/ingress.class\" is deprecated, please use 'spec.ingressClassName' instead", "W0321 12:23:13.684339 1892289 warnings.go:70] annotation \"kubernetes.io/ingress.class\" is deprecated, please use 'spec.ingressClassName' instead", "W0321 12:23:13.699830 1892289 warnings.go:70] annotation \"kubernetes.io/ingress.class\" is deprecated, please use 'spec.ingressClassName' instead", "W0321 12:23:13.699842 1892289 warnings.go:70] annotation \"kubernetes.io/ingress.class\" is deprecated, please use 'spec.ingressClassName' instead", "W0321 12:23:13.699977 1892289 warnings.go:70] annotation \"kubernetes.io/ingress.class\" is deprecated, please use 'spec.ingressClassName' instead", "W0321 12:23:13.701518 1892289 warnings.go:70] annotation \"kubernetes.io/ingress.class\" is deprecated, please use 'spec.ingressClassName' instead", "Error: release tao-api failed, and has been uninstalled due to atomic being set: context deadline exceeded"], "stdout": "Release \"tao-api\" does not exist. Installing it now.", "stdout_lines": ["Release \"tao-api\" does not exist. Installing it now."]}

please take a look and suggest possible solutions,

By Looking at the forum answers regarding this topic, what I felt is that its not straight forward to setup Tao AUTO ML on bare metal machine.

I tried looking and finding solution in below Nvidia Topics:

What is the recommended way to use Tao Auto ML (which utilizes Tao API services on kubernetes) on Local Server i.e having only one node i.e the master node only.

thanks.

Morganh · March 22, 2025, 1:06pm

Yes, it is expected to install on a bare metal machine.

It can work on only one node.

Topic		Replies	Views
Install TAO bare metal fail TAO Toolkit	8	57	November 26, 2024
How to Deploy TAO 4.0 (with AutoML) Support without Kubernetes? TAO Toolkit automation , ansible , kubernetes , tao	11	1154	January 4, 2023
Unable to install TAO Toolkit 5.2.0 API on bare metal TAO Toolkit installation , api	58	828	February 29, 2024
NVIDIA Driver Installation skipped during bare-metal install TAO Toolkit	24	881	July 25, 2023
TAO 4.0 AutoML - the provided PTX was compiled with an unsupported toolchain TAO Toolkit	6	675	July 17, 2023
Exception: TAO4 AutoML with PeopleNet. Round 2 TAO Toolkit	49	939	June 28, 2023
TAO AutoML - TAO Toolkit Setup TAO Toolkit ubuntu	12	657	May 22, 2023
Baremetal install TAO5.0 error TAO Toolkit	55	952	October 3, 2023
TAO API Bare Metal Install - 401 API response from TAO AutoML notebook TAO Toolkit	2	488	July 3, 2023
AutoML training speed and GPU problem TAO Toolkit	28	1351	March 29, 2023

Tao Auto ML setup/installation issue for bare metal(single node/local deployment)

Related topics