TAO AutoML - TAO Toolkit Setup

jay.duff · May 12, 2023, 5:31pm

Please provide the following information when requesting support.

• Hardware (T4/V100/Xavier/Nano/etc) - AMD64 based Ubuntu 20.04, TAO Toolkit Setup, (2) RTX 3080 TI, setting up TAO Toolkit - this is my server so I’m assuming “bare metal” setup. ref: Setup - NVIDIA Docs

• Network Type (Detectnet_v2/Faster_rcnn/Yolo_v4/LPRnet/Mask_rcnn/Classification/etc)
haven’t gotten that far - but working towards AutoML
ref: AutoML - NVIDIA Docs
i.e. trying to get a working TAO server first

• TLT Version (Please run “tlt info --verbose” and share “docker_tag” here)
Not applicable - haven’t gotten that far
• Training spec file(If have, please share here)
Not applicable - haven’t gotten that far
• How to reproduce the issue ? (This is for errors. Please share the command line and the detailed log here.)

AMD64 Ubuntu 20.04 desktop w/ 2 RTX 3080 TIs
followed “bare metal” installation directions
when executing: bash setup.sh install
got a error: cat: target-os: No such file or directory

Since I know I’m Ubuntu 20.04, I changed the script:
function prepare_cnc() {
local os
os=“Ubuntu”
local os_version
os_version=“20.04”

The script appears to run okay - until new error:
TASK [Report Versions] ****************************************************************************************************************************************************************************
ok: [127.0.0.1] => {
“msg”: [
“===========================================================================================”,
" Components Matrix Version || Installed Version ",
“===========================================================================================”,
"GPU Operator Version v1.10.1 || ",
"Nvidia Container Driver Version 510.47.03 || ",
"GPU Operator NV Toolkit Driver v1.9.0 || ",
"K8sDevice Plugin Version v0.11.0 || ",
"Data Center GPU Manager(DCGM) Version 2.3.4-2.6.4 || ",
"Node Feature Discovery Version v0.10.1 || ",
"GPU Feature Discovery Version v0.5.0 || ",
"Nvidia validator version v1.10.1 || ",
"Nvidia MIG Manager version 0.3.0 || ",
“”,
“Note: NVIDIA Mig Manager is valid for only Amphere GPU’s like A100, A30”,
“”,
“Please validate between Matrix Version and Installed Version listed above”
]
}

TASK [Componenets Matrix Versions Vs Installed Versions] ******************************************************************************************************************************************
skipping: [127.0.0.1]

TASK [Report Versions] ****************************************************************************************************************************************************************************
skipping: [127.0.0.1]

TASK [Componenets Matrix Versions Vs Installed Versions] ******************************************************************************************************************************************
skipping: [127.0.0.1]

TASK [Report Versions] ****************************************************************************************************************************************************************************
skipping: [127.0.0.1]

TASK [Validate the GPU Operator pods State] *******************************************************************************************************************************************************
fatal: [127.0.0.1]: FAILED! => {“changed”: true, “cmd”: “kubectl get pods --all-namespaces | egrep -v ‘kube-system|NAME’”, “delta”: “0:00:00.042849”, “end”: “2023-05-12 13:04:11.768863”, “failed_when_result”: true, “msg”: “non-zero return code”, “rc”: 1, “start”: “2023-05-12 13:04:11.726014”, “stderr”: “”, “stderr_lines”: , “stdout”: “”, “stdout_lines”: }
…ignoring

TASK [Report GPU Operator Pods] *******************************************************************************************************************************************************************
ok: [127.0.0.1] => {
“msg”:
}

TASK [Validate GPU Operator Version for Cloud Native Core 6.2 and 7.0] ****************************************************************************************************************************
changed: [127.0.0.1]

TASK [Validate GPU Operator Version for Cloud Native Core 6.1] ************************************************************************************************************************************
fatal: [127.0.0.1]: FAILED! => {“changed”: true, “cmd”: “helm ls -A | grep gpu-operator | awk ‘{print $NF}’ | grep -v VERSION | sed ‘s/v//g’”, “delta”: “0:00:00.028749”, “end”: “2023-05-12 13:04:12.236032”, “failed_when_result”: true, “msg”: “”, “rc”: 0, “start”: “2023-05-12 13:04:12.207283”, “stderr”: “”, “stderr_lines”: , “stdout”: “”, “stdout_lines”: }

PLAY RECAP ****************************************************************************************************************************************************************************************
127.0.0.1 : ok=40 changed=23 unreachable=0 failed=1 skipped=19 rescued=0 ignored=1

How can I correct this GPU Operator for Cloud Native Core error?

Morganh · May 13, 2023, 2:13pm

Can you share the result of
$ kubectl get pods

Then, please run below against the pods,
$ kubectl describe pod <pod_name>

For example,
$ kubectl describe pod tao-toolkit-api-app-pod-<str>

More, please uninstall nvidia-driver and re-run.
$ sudo apt purge nvidia-driver-*
$ sudo apt autoremove
$ sudo apt autoclean

jay.duff · May 15, 2023, 7:14pm

I checked my NVIDIA driver. X.Org X server being used.
I selected 525 and rebooted.

ubuntu@5950X:~/tao-getting-started_v4.0.1$ nvidia-smi
Mon May 15 14:53:46 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.105.17   Driver Version: 525.105.17   CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:0B:00.0  On |                  N/A |
|  0%   30C    P8    28W / 350W |    589MiB / 12288MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA GeForce ...  Off  | 00000000:0C:00.0 Off |                  N/A |
|  0%   29C    P8     5W / 350W |     10MiB / 12288MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                              
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      1863      G   /usr/lib/xorg/Xorg                102MiB |
|    0   N/A  N/A      4852      G   /usr/lib/xorg/Xorg                135MiB |
|    0   N/A  N/A      5562      G   /usr/bin/gnome-shell              132MiB |
|    0   N/A  N/A      6547      G   ...404034420474974049,262144      205MiB |
|    1   N/A  N/A      1863      G   /usr/lib/xorg/Xorg                  4MiB |
|    1   N/A  N/A      4852      G   /usr/lib/xorg/Xorg                  4MiB |
+-----------------------------------------------------------------------------+

I re-ran the installation, same result
note, I also ran the kubectl get pods:

TASK [Validate GPU Operator Version for Cloud Native Core 6.1] *************************************************************************************************************************************************************************************************************************************************************************
fatal: [127.0.0.1]: FAILED! => {"changed": true, "cmd": "helm ls -A | grep gpu-operator |  awk '{print $NF}' | grep -v VERSION | sed 's/v//g'", "delta": "0:00:00.027629", "end": "2023-05-15 14:58:24.533876", "failed_when_result": true, "msg": "", "rc": 0, "start": "2023-05-15 14:58:24.506247", "stderr": "", "stderr_lines": [], "stdout": "", "stdout_lines": []}

PLAY RECAP *****************************************************************************************************************************************************************************************************************************************************************************************************************************
127.0.0.1                  : ok=40   changed=23   unreachable=0    failed=1    skipped=19   rescued=0    ignored=1   

ubuntu@5950X:~/tao-getting-started_v4.0.1/setup/quickstart_api_bare_metal$ kubectl get pods
No resources found in default namespace.

I also ran the apt purge/remove/clean
re-ran the setup.sh
same results:

TASK [Validate GPU Operator Version for Cloud Native Core 6.1] ***********************************************************************************************************************************
fatal: [127.0.0.1]: FAILED! => {"changed": true, "cmd": "helm ls -A | grep gpu-operator |  awk '{print $NF}' | grep -v VERSION | sed 's/v//g'", "delta": "0:00:00.029867", "end": "2023-05-15 15:03:21.542528", "failed_when_result": true, "msg": "", "rc": 0, "start": "2023-05-15 15:03:21.512661", "stderr": "", "stderr_lines": [], "stdout": "", "stdout_lines": []}

PLAY RECAP ***************************************************************************************************************************************************************************************
127.0.0.1                  : ok=40   changed=23   unreachable=0    failed=1    skipped=19   rescued=0    ignored=1   

ubuntu@5950X:~/tao-getting-started_v4.0.1/setup/quickstart_api_bare_metal$ kubectl get pods
No resources found in default namespace.
ubuntu@5950X:~/tao-getting-started_v4.0.1/setup/quickstart_api_bare_metal$ nvidia-smi
bash: /usr/bin/nvidia-smi: No such file or directory

I noticed I’m back to driver: X.Org X server

Morganh · May 16, 2023, 6:40am

Could you run below
$ sudo apt purge nvidia-driver-525
$ sudo apt autoremove
$ sudo apt autoclean

Then reboot and run again?
$ bash setup.sh install

jay.duff · May 18, 2023, 12:35pm

I removed drivers per your instructions, no problem.
rebooted.
however, it looks like the install generated the same results:

TASK [Report Versions] ********************************************************************************************************************************************************************************************************
ok: [127.0.0.1] => {
“msg”: [
“===========================================================================================”,
" Components Matrix Version || Installed Version ",
“===========================================================================================”,
"GPU Operator Version v1.10.1 || ",
"Nvidia Container Driver Version 510.47.03 || ",
"GPU Operator NV Toolkit Driver v1.9.0 || ",
"K8sDevice Plugin Version v0.11.0 || ",
"Data Center GPU Manager(DCGM) Version 2.3.4-2.6.4 || ",
"Node Feature Discovery Version v0.10.1 || ",
"GPU Feature Discovery Version v0.5.0 || ",
"Nvidia validator version v1.10.1 || ",
"Nvidia MIG Manager version 0.3.0 || ",
“”,
“Note: NVIDIA Mig Manager is valid for only Amphere GPU’s like A100, A30”,
“”,
“Please validate between Matrix Version and Installed Version listed above”
]
}

TASK [Componenets Matrix Versions Vs Installed Versions] **********************************************************************************************************************************************************************
skipping: [127.0.0.1]

TASK [Report Versions] ********************************************************************************************************************************************************************************************************
skipping: [127.0.0.1]

TASK [Componenets Matrix Versions Vs Installed Versions] **********************************************************************************************************************************************************************
skipping: [127.0.0.1]

TASK [Report Versions] ********************************************************************************************************************************************************************************************************
skipping: [127.0.0.1]

TASK [Validate the GPU Operator pods State] ***********************************************************************************************************************************************************************************
fatal: [127.0.0.1]: FAILED! => {“changed”: true, “cmd”: “kubectl get pods --all-namespaces | egrep -v ‘kube-system|NAME’”, “delta”: “0:00:00.057455”, “end”: “2023-05-18 08:33:58.809700”, “failed_when_result”: true, “msg”: “non-zero return code”, “rc”: 1, “start”: “2023-05-18 08:33:58.752245”, “stderr”: “”, “stderr_lines”: , “stdout”: “”, “stdout_lines”: }
…ignoring

TASK [Report GPU Operator Pods] ***********************************************************************************************************************************************************************************************
ok: [127.0.0.1] => {
“msg”:
}

TASK [Validate GPU Operator Version for Cloud Native Core 6.2 and 7.0] ********************************************************************************************************************************************************
changed: [127.0.0.1]

TASK [Validate GPU Operator Version for Cloud Native Core 6.1] ****************************************************************************************************************************************************************
fatal: [127.0.0.1]: FAILED! => {“changed”: true, “cmd”: “helm ls -A | grep gpu-operator | awk ‘{print $NF}’ | grep -v VERSION | sed ‘s/v//g’”, “delta”: “0:00:00.026909”, “end”: “2023-05-18 08:33:59.245347”, “failed_when_result”: true, “msg”: “”, “rc”: 0, “start”: “2023-05-18 08:33:59.218438”, “stderr”: “”, “stderr_lines”: , “stdout”: “”, “stdout_lines”: }

PLAY RECAP ********************************************************************************************************************************************************************************************************************
127.0.0.1 : ok=40 changed=23 unreachable=0 failed=1 skipped=19 rescued=0 ignored=1

Morganh · May 18, 2023, 4:24pm

Could you blacklist nouveau?

$ lsmod | grep nouveau

Open below.

sudo vim /etc/modprobe.d/blacklist.conf

In the end, add below

blacklist nouveau

Run below to take effect.

sudo update-initramfs -u

Then,

reboot

Check if the nouveau is not available now.

$ lsmod | grep nouveau

jay.duff · May 18, 2023, 7:06pm

thanks for your continued help. I followed your instructions but I’m still loading the nouveau driver:

ubuntu@5950X:~$ lsmod | grep nouveau

ubuntu@5950X:~$ sudo nano /etc/modprobe.d/blacklist.conf

ubuntu@5950X:~/tao-getting-started_v4.0.1/setup/quickstart_api_bare_metal$ grep nouv /etc/modprobe.d/blacklist.conf

blacklist nouveau

ubuntu@5950X:~$ sudo update-initramfs -u

update-initramfs: Generating /boot/initrd.img-5.15.0-71-generic

ubuntu@5950X:~$ cd tao-getting-started_v4.0.1/setup/quickstart_api_bare_metal/

ubuntu@5950X:~/tao-getting-started_v4.0.1/setup/quickstart_api_bare_metal$ lsmod | grep nouveau

I checked Software and Updates,

Using X.Org X server - Nouveau display driver …

Morganh · May 19, 2023, 2:35am

Could you share the latest result of $ lsmod | grep nouveau ?

jay.duff · May 19, 2023, 1:32pm

I did. (it was easy to miss, it returned nothing)
the line before “I checked Software …”

I made the change,
grep to show I made the change
updated, rebooted
I am not familiar with lsmod but it looks like it didn’t do anything

Morganh · May 19, 2023, 4:33pm

Can you download 4.0.2 notebook from TAO Toolkit Getting Started | NVIDIA NGC and run again?

ngc registry resource download-version “nvidia/tao/tao-getting-started:4.0.2”
cd tao-getting-started_v4.0.2/setup/quickstart_api_bare_metal/
vim hosts
cat hosts
…
bash setup.sh check-inventory
bash setup.sh install

Please share the result and logs. Thanks.

jay.duff · May 20, 2023, 3:44pm

good progress and I think I identified my original mistake.

I installed latest version of ngc:

echo “export PATH="$PATH:$(pwd)/ngc-cli"” >> ~/.bash_profile && source ~/.bash_profile
ubuntu@5950X:~$ ngc --version
NGC CLI 3.21.1

version 4.0.2

ubuntu@5950X:~$ ngc registry resource download-version “nvidia/tao/tao-getting-started:4.0.2”
{
“download_end”: “2023-05-20 10:56:49.118397”,
“download_start”: “2023-05-20 10:56:47.115539”,
“download_time”: “2s”,
“files_downloaded”: 378,
“local_path”: “/home/ubuntu/tao-getting-started_v4.0.2”,
“size_downloaded”: “2.43 MB”,
“status”: “Completed”,
“transfer_id”: “tao-getting-started_v4.0.2”
}

edit hosts

I think this was my original problem. I have never used ansible and didn’t understand what I was doing. I am installing this on a single computer (master) with no worker nodes.
I assumed I wanted my master to be local host and used:
127.0.0.1
this time, I used 127.0.1.1 and it seems to work (- added to # to avoid format problem)

ubuntu@5950X:~/tao-getting-started_v4.0.2/setup/quickstart_api_bare_metal$ cat hosts
-# List all hosts below.
-# For single node deployment, listing the master is enough.
[master]
-# Example of host accessible using ssh private key
-# 127.0.0.1 ansible_ssh_user=‘ubuntu’ ansible_ssh_private_key_file=‘/home/ubuntu/.ssh/id_rsa’
127.0.1.1 ansible_ssh_user=‘ubuntu’ ansible_ssh_pass=mypassword ansible_ssh_extra_args=‘-o StrictHostKeyChecking=no’
[nodes]
-# Example of host accessible using ssh password
-# 1.1.1.2 ansible_ssh_user=‘ubuntu’ ansible_ssh_pass=‘some-password’ ansible_ssh_extra_args=‘-o StrictHostKeyChecking=no’

check inventory

ubuntu@5950X:~/tao-getting-started_v4.0.2/setup/quickstart_api_bare_metal$ bash setup.sh check-inventory
TASK [print host details] *************************************************************************************************************************************************************************************************************************************
ok: [127.0.1.1 → localhost] => {
“host_details”: [
{
“host”: “5950X”,
“os”: “Ubuntu”,
“os_version”: “20.04”
}
]
}

TASK [check all instances have single os] *********************************************************************************************************************************************************************************************************************
ok: [127.0.1.1 → localhost]

TASK [check all instances have single os version] *************************************************************************************************************************************************************************************************************
ok: [127.0.1.1 → localhost]

TASK [capture os] *********************************************************************************************************************************************************************************************************************************************
changed: [127.0.1.1 → localhost]

TASK [capture os version] *************************************************************************************************************************************************************************************************************************************
changed: [127.0.1.1 → localhost]

PLAY RECAP ****************************************************************************************************************************************************************************************************************************************************
127.0.1.1 : ok=16 changed=3 unreachable=0 failed=0 skipped=0 rescued=0 ignored=0

PLAY RECAP ****************************************************************************************************************************************************************************************************************************************************
127.0.1.1 : ok=70 changed=41 unreachable=0 failed=0 skipped=453 rescued=0 ignored=0

install

ubuntu@5950X:~/tao-getting-started_v4.0.2/setup/quickstart_api_bare_metal$ bash setup.sh install

Provide the path to the hosts file [./hosts]:

Provide the ngc-api-key: cHJoO–edited–4MzAx

Provide the ngc-email: jay.duff@cfacorp.com

Provide the api-chart [https://helm.ngc.nvidia.com/nvidia/tao/charts/tao-toolkit-api-4.0.2.tgz]:

Provide the api-values [./tao-toolkit-api-helm-values.yml]:

Provide the cluster-name: mycluster

Provide the value for enable_mig (no/yes) [no]:

Provide the value for mig_profile [all-disabled]:

Provide the value for mig_strategy (single/mixed) [single]:

Provide the value for nvidia_driver_version [“510.47.03”]:

skipped many lines here…

PLAY [master] *************************************************************************************************************************************************************************************************************************************************

TASK [Gathering Facts] ****************************************************************************************************************************************************************************************************************************************

ok: [127.0.1.1]

TASK [Waiting for the Cluster to become available] ************************************************************************************************************************************************************************************************************

WAITED 15 minutes, control-C

validate

similar result, waited 20 minutes for cluster

ubuntu@5950X:~/tao-getting-started_v4.0.2/setup/quickstart_api_bare_metal$ bash setup.sh validate
Provide the path to the hosts file [./hosts]:
Provide the value for enable_mig (no/yes) [no]:
Provide the value for mig_profile [all-disabled]:
Provide the value for mig_strategy (single/mixed) [single]:
Provide the value for nvidia_driver_version [“510.47.03”]:

PLAY [all] ****************************************************************************************************************************************************************************************************************************************************

TASK [Gathering Facts] ****************************************************************************************************************************************************************************************************************************************
ok: [127.0.1.1]

TASK [check os] ***********************************************************************************************************************************************************************************************************************************************
ok: [127.0.1.1]

TASK [check os version] ***************************************************************************************************************************************************************************************************************************************
ok: [127.0.1.1]

TASK [check disk size sufficient] *****************************************************************************************************************************************************************************************************************************
skipping: [127.0.1.1]

TASK [check sufficient memory] ********************************************************************************************************************************************************************************************************************************
ok: [127.0.1.1]

TASK [check sufficient number of cpu cores] *******************************************************************************************************************************************************************************************************************
ok: [127.0.1.1]

TASK [check sudo privileges] **********************************************************************************************************************************************************************************************************************************
ok: [127.0.1.1]

TASK [capture gpus per node] **********************************************************************************************************************************************************************************************************************************
changed: [127.0.1.1]

TASK [check not more than 1 gpu per node] *********************************************************************************************************************************************************************************************************************
ok: [127.0.1.1]

TASK [check exactly 1 master] *********************************************************************************************************************************************************************************************************************************
ok: [127.0.1.1 → localhost]

TASK [capture host details] ***********************************************************************************************************************************************************************************************************************************
ok: [127.0.1.1 → localhost]

TASK [print host details] *************************************************************************************************************************************************************************************************************************************
ok: [127.0.1.1 → localhost] => {
“host_details”: [
{
“host”: “5950X”,
“os”: “Ubuntu”,
“os_version”: “20.04”
}
]
}

TASK [check all instances have single os] *********************************************************************************************************************************************************************************************************************
ok: [127.0.1.1 → localhost]

TASK [check all instances have single os version] *************************************************************************************************************************************************************************************************************
ok: [127.0.1.1 → localhost]

TASK [capture os] *********************************************************************************************************************************************************************************************************************************************
ok: [127.0.1.1 → localhost]

TASK [capture os version] *************************************************************************************************************************************************************************************************************************************
ok: [127.0.1.1 → localhost]

PLAY RECAP ****************************************************************************************************************************************************************************************************************************************************
127.0.1.1 : ok=15 changed=1 unreachable=0 failed=0 skipped=1 rescued=0 ignored=0

PLAY [master] *************************************************************************************************************************************************************************************************************************************************

TASK [Gathering Facts] ****************************************************************************************************************************************************************************************************************************************
ok: [127.0.1.1]

TASK [Waiting for the Cluster to become available] ************************************************************************************************************************************************************************************************************

reboot & retry

validate

TASK [Report Cuda Validation] *********************************************************************************************************************************************************************************
ok: [127.0.1.1] => {
“msg”: [
“[Vector addition of 50000 elements]”,
“Copy input data from the host memory to the CUDA device”,
“CUDA kernel launch with 196 blocks of 256 threads”,
“Copy output data from the CUDA device to the host memory”,
“Test PASSED”,
“Done”,
“pod "cuda-vector-add" deleted”
]
}

TASK [Report Network Operator version] ************************************************************************************************************************************************************************
skipping: [127.0.1.1]

TASK [Report Mellanox MOFED Driver Version] *******************************************************************************************************************************************************************
skipping: [127.0.1.1]

TASK [Report RDMA Shared Device Plugin Version] ***************************************************************************************************************************************************************
skipping: [127.0.1.1]

TASK [Report SRIOV Device Plugin Version] *********************************************************************************************************************************************************************
skipping: [127.0.1.1]

TASK [Report Container Networking Plugin Version] *************************************************************************************************************************************************************
skipping: [127.0.1.1]

TASK [Report Multus Version] **********************************************************************************************************************************************************************************
skipping: [127.0.1.1]

TASK [Report Whereabouts Version] *****************************************************************************************************************************************************************************
skipping: [127.0.1.1]

TASK [Status Check] *******************************************************************************************************************************************************************************************
changed: [127.0.1.1]

TASK [debug] **************************************************************************************************************************************************************************************************
ok: [127.0.1.1] => {
“msg”: “All tasks should be changed or ok, if it’s failed or ignoring means that validation task failed.”
}

PLAY RECAP ****************************************************************************************************************************************************************************************************
127.0.1.1 : ok=47 changed=27 unreachable=0 failed=0 skipped=30 rescued=0 ignored=0

install (after reboot)

cluster started after a few minutes

looks like success (skipped many lines):

TASK [create kube directory] ************************************************************************************************************************************************************************************************************************************************************
ok: [localhost]

TASK [ensure kubeconfig file exists] ****************************************************************************************************************************************************************************************************************************************************
changed: [localhost]

TASK [merge kubeconfig to existing] *****************************************************************************************************************************************************************************************************************************************************
changed: [localhost]

TASK [make merged-kubeconfig default] ***************************************************************************************************************************************************************************************************************************************************
changed: [localhost]

PLAY RECAP ******************************************************************************************************************************************************************************************************************************************************************************
127.0.1.1 : ok=25 changed=17 unreachable=0 failed=0 skipped=2 rescued=0 ignored=0
localhost : ok=10 changed=5 unreachable=0 failed=0 skipped=2 rescued=0 ignored=0

ubuntu@5950X:~/tao-getting-started_v4.0.2/setup/quickstart_api_bare_metal$

final steps

ubuntu@5950X:~/tao-getting-started_v4.0.2/setup/quickstart_api_bare_metal$ hostname -i
127.0.1.1

ubuntu@5950X:~/tao-getting-started_v4.0.2/setup/quickstart_api_bare_metal$ kubectl get service ingress-nginx-controller -o jsonpath=‘{.spec.ports[0].nodePort}’
32080

I think I’m good now. Thanks for you patience and help.
For someone else reading this, my main lessons were:

use password for ansible setup
use 127.0.1.1 (in a setup with just one master computer)

Morganh · May 22, 2023, 2:57am

Glad to know it is working now. Thanks for the info!

system · June 5, 2023, 2:57am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
TAO Toolkit 4.0.0 API bare metal setup causing gpu driver and kube utilities to uninstall (lots of confusing things happening at the same time) TAO Toolkit	36	1710	April 5, 2023
Unable to install TAO Toolkit 5.2.0 API on bare metal TAO Toolkit installation , api	58	806	February 29, 2024
Baremetal install TAO5.0 error TAO Toolkit	55	947	October 3, 2023
TAO 4.0 AutoML - the provided PTX was compiled with an unsupported toolchain TAO Toolkit	6	674	July 17, 2023
NVIDIA Driver Installation skipped during bare-metal install TAO Toolkit	24	870	July 25, 2023
TAO Toolkit 4.0 setup issue TAO Toolkit	19	2752	January 5, 2023
'KeyError' : TAO4 AutoML with PeopleNet TAO Toolkit	37	1103	June 1, 2023
Exception: TAO4 AutoML with PeopleNet. Round 2 TAO Toolkit	49	938	June 28, 2023
Completely purge and reinstall nvidia gpu operator TAO Toolkit	41	5062	September 5, 2023
TAO API (kubernetes pod) troubleshooting: TAO API jobs stuck in "Pending" state indefinitely TAO Toolkit api , tao	25	1200	June 22, 2023

TAO AutoML - TAO Toolkit Setup

I installed latest version of ngc:

version 4.0.2

edit hosts

check inventory

install

validate

reboot & retry

validate

install (after reboot)

final steps

Related topics