Tao Auto ML setup/installation issue for bare metal(single node/local deployment)

Please provide the following information when requesting support.

• Hardware (T4/V100/Xavier/Nano/etc) : 2 * A40
• Network Type (Detectnet_v2/Faster_rcnn/Yolo_v4/LPRnet/Mask_rcnn/Classification/etc) : First need to setup
• TLT Version (Please run “tlt info --verbose” and share “docker_tag” here)
• Training spec file(If have, please share here)
• How to reproduce the issue ? (This is for errors. Please share the command line and the detailed log here.)

I need to use Tao Auto ML,

I am following the steps mentioned here: tao api setup

I am using my local node/server to setup Tao API services.

I faced a lot of issues and resolved a some of them after looking at the earlier nvidia forum answers related to this topic, but still getting issues.

I am attaching the log txt file for : bash setup.sh install

log_setup_sh_install.txt (94.8 KB)

If you look at the log files, what I understand is that I am mainly getting one error and gpu validation is also failing

TASK [Validating the CUDA with GPU] ********************************************
ASYNC POLL on 127.0.1.1: jid=j70943425518.1881829 started=1 finished=0
ASYNC POLL on 127.0.1.1: jid=j70943425518.1881829 started=1 finished=0
ASYNC POLL on 127.0.1.1: jid=j70943425518.1881829 started=1 finished=0
ASYNC FAILED on 127.0.1.1: jid=j70943425518.1881829
fatal: [127.0.1.1]: FAILED! => {"ansible_job_id": "j70943425518.1881829", "changed": true, "cmd": "kubectl run cuda-vector-add --rm -t -i --restart=Never --image=k8s.gcr.io/cuda-vector-add:v0.1", "delta": "0:01:00.159991", "end": "2025-03-21 12:21:13.013696", "finished": 1, "msg": "non-zero return code", "rc": 1, "results_file": "/home/administrator/.ansible_async/j70943425518.1881829", "start": "2025-03-21 12:20:12.853705", "started": 1, "stderr": "error: timed out waiting for the condition", "stderr_lines": ["error: timed out waiting for the condition"], "stdout": "pod \"cuda-vector-add\" deleted", "stdout_lines": ["pod \"cuda-vector-add\" deleted"]}
...ignoring

TASK [Validating the nvidia-smi on NVIDIA Cloud Native Stack] ******************
ASYNC POLL on 127.0.1.1: jid=j933072673456.1883729 started=1 finished=0
ASYNC POLL on 127.0.1.1: jid=j933072673456.1883729 started=1 finished=0
ASYNC FAILED on 127.0.1.1: jid=j933072673456.1883729
fatal: [127.0.1.1]: FAILED! => {"ansible_job_id": "j933072673456.1883729", "changed": true, "cmd": "kubectl delete -f nvidia-smi.yaml; sleep 10; kubectl apply -f nvidia-smi.yaml; sleep 25; kubectl logs nvidia-smi", "delta": "0:00:36.267508", "end": "2025-03-21 12:21:51.836028", "finished": 1, "msg": "non-zero return code", "rc": 1, "results_file": "/home/administrator/.ansible_async/j933072673456.1883729", "start": "2025-03-21 12:21:15.568520", "started": 1, "stderr": "Error from server (NotFound): error when deleting \"nvidia-smi.yaml\": pods \"nvidia-smi\" not found\nError from server (BadRequest): container \"nvidia-smi\" in pod \"nvidia-smi\" is waiting to start: ContainerCreating", "stderr_lines": ["Error from server (NotFound): error when deleting \"nvidia-smi.yaml\": pods \"nvidia-smi\" not found", "Error from server (BadRequest): container \"nvidia-smi\" in pod \"nvidia-smi\" is waiting to start: ContainerCreating"], "stdout": "pod/nvidia-smi created", "stdout_lines": ["pod/nvidia-smi created"]}
...ignoring

and

TASK [install tao-toolkit-api] *************************************************
fatal: [127.0.1.1]: FAILED! => {"changed": true, "cmd": "helm upgrade --install --reset-values --cleanup-on-fail --create-namespace --namespace default --atomic --wait tao-api https://helm.ngc.nvidia.com/nvidia/tao/charts/tao-toolkit-api-5.5.0.tgz --values /tmp/tao-toolkit-api-helm-values.yml --username='$oauthtoken' --password=YmR0a2Fqa2lmNXRqMW8xMWkyOGY4ZDhhcnQ6YWFiZTMyYzctNDVmZi00NTMwLTk5ZTgtZjE0ODBmZjRlYzk5", "delta": "0:05:02.379604", "end": "2025-03-21 12:28:14.084631", "msg": "non-zero return code", "rc": 1, "start": "2025-03-21 12:23:11.705027", "stderr": "W0321 12:23:13.683970 1892289 warnings.go:70] annotation \"kubernetes.io/ingress.class\" is deprecated, please use 'spec.ingressClassName' instead\nW0321 12:23:13.684015 1892289 warnings.go:70] annotation \"kubernetes.io/ingress.class\" is deprecated, please use 'spec.ingressClassName' instead\nW0321 12:23:13.684142 1892289 warnings.go:70] annotation \"kubernetes.io/ingress.class\" is deprecated, please use 'spec.ingressClassName' instead\nW0321 12:23:13.684144 1892289 warnings.go:70] annotation \"kubernetes.io/ingress.class\" is deprecated, please use 'spec.ingressClassName' instead\nW0321 12:23:13.683968 1892289 warnings.go:70] annotation \"kubernetes.io/ingress.class\" is deprecated, please use 'spec.ingressClassName' instead\nW0321 12:23:13.684339 1892289 warnings.go:70] annotation \"kubernetes.io/ingress.class\" is deprecated, please use 'spec.ingressClassName' instead\nW0321 12:23:13.699830 1892289 warnings.go:70] annotation \"kubernetes.io/ingress.class\" is deprecated, please use 'spec.ingressClassName' instead\nW0321 12:23:13.699842 1892289 warnings.go:70] annotation \"kubernetes.io/ingress.class\" is deprecated, please use 'spec.ingressClassName' instead\nW0321 12:23:13.699977 1892289 warnings.go:70] annotation \"kubernetes.io/ingress.class\" is deprecated, please use 'spec.ingressClassName' instead\nW0321 12:23:13.701518 1892289 warnings.go:70] annotation \"kubernetes.io/ingress.class\" is deprecated, please use 'spec.ingressClassName' instead\nError: release tao-api failed, and has been uninstalled due to atomic being set: context deadline exceeded", "stderr_lines": ["W0321 12:23:13.683970 1892289 warnings.go:70] annotation \"kubernetes.io/ingress.class\" is deprecated, please use 'spec.ingressClassName' instead", "W0321 12:23:13.684015 1892289 warnings.go:70] annotation \"kubernetes.io/ingress.class\" is deprecated, please use 'spec.ingressClassName' instead", "W0321 12:23:13.684142 1892289 warnings.go:70] annotation \"kubernetes.io/ingress.class\" is deprecated, please use 'spec.ingressClassName' instead", "W0321 12:23:13.684144 1892289 warnings.go:70] annotation \"kubernetes.io/ingress.class\" is deprecated, please use 'spec.ingressClassName' instead", "W0321 12:23:13.683968 1892289 warnings.go:70] annotation \"kubernetes.io/ingress.class\" is deprecated, please use 'spec.ingressClassName' instead", "W0321 12:23:13.684339 1892289 warnings.go:70] annotation \"kubernetes.io/ingress.class\" is deprecated, please use 'spec.ingressClassName' instead", "W0321 12:23:13.699830 1892289 warnings.go:70] annotation \"kubernetes.io/ingress.class\" is deprecated, please use 'spec.ingressClassName' instead", "W0321 12:23:13.699842 1892289 warnings.go:70] annotation \"kubernetes.io/ingress.class\" is deprecated, please use 'spec.ingressClassName' instead", "W0321 12:23:13.699977 1892289 warnings.go:70] annotation \"kubernetes.io/ingress.class\" is deprecated, please use 'spec.ingressClassName' instead", "W0321 12:23:13.701518 1892289 warnings.go:70] annotation \"kubernetes.io/ingress.class\" is deprecated, please use 'spec.ingressClassName' instead", "Error: release tao-api failed, and has been uninstalled due to atomic being set: context deadline exceeded"], "stdout": "Release \"tao-api\" does not exist. Installing it now.", "stdout_lines": ["Release \"tao-api\" does not exist. Installing it now."]}

please take a look and suggest possible solutions,

By Looking at the forum answers regarding this topic, what I felt is that its not straight forward to setup Tao AUTO ML on bare metal machine.

I tried looking and finding solution in below Nvidia Topics:

What is the recommended way to use Tao Auto ML (which utilizes Tao API services on kubernetes) on Local Server i.e having only one node i.e the master node only.

thanks.

Yes, it is expected to install on a bare metal machine.

It can work on only one node.