TAO AutoML - TAO Toolkit Setup

good progress and I think I identified my original mistake.

I installed latest version of ngc:

echo “export PATH="$PATH:$(pwd)/ngc-cli"” >> ~/.bash_profile && source ~/.bash_profile
ubuntu@5950X:~$ ngc --version
NGC CLI 3.21.1

version 4.0.2

ubuntu@5950X:~$ ngc registry resource download-version “nvidia/tao/tao-getting-started:4.0.2”
{
“download_end”: “2023-05-20 10:56:49.118397”,
“download_start”: “2023-05-20 10:56:47.115539”,
“download_time”: “2s”,
“files_downloaded”: 378,
“local_path”: “/home/ubuntu/tao-getting-started_v4.0.2”,
“size_downloaded”: “2.43 MB”,
“status”: “Completed”,
“transfer_id”: “tao-getting-started_v4.0.2”
}

edit hosts

I think this was my original problem. I have never used ansible and didn’t understand what I was doing. I am installing this on a single computer (master) with no worker nodes.
I assumed I wanted my master to be local host and used:
127.0.0.1
this time, I used 127.0.1.1 and it seems to work (- added to # to avoid format problem)

ubuntu@5950X:~/tao-getting-started_v4.0.2/setup/quickstart_api_bare_metal$ cat hosts
-# List all hosts below.
-# For single node deployment, listing the master is enough.
[master]
-# Example of host accessible using ssh private key
-# 127.0.0.1 ansible_ssh_user=‘ubuntu’ ansible_ssh_private_key_file=‘/home/ubuntu/.ssh/id_rsa’
127.0.1.1 ansible_ssh_user=‘ubuntu’ ansible_ssh_pass=mypassword ansible_ssh_extra_args=‘-o StrictHostKeyChecking=no’
[nodes]
-# Example of host accessible using ssh password
-# 1.1.1.2 ansible_ssh_user=‘ubuntu’ ansible_ssh_pass=‘some-password’ ansible_ssh_extra_args=‘-o StrictHostKeyChecking=no’

check inventory

ubuntu@5950X:~/tao-getting-started_v4.0.2/setup/quickstart_api_bare_metal$ bash setup.sh check-inventory
TASK [print host details] *************************************************************************************************************************************************************************************************************************************
ok: [127.0.1.1 → localhost] => {
“host_details”: [
{
“host”: “5950X”,
“os”: “Ubuntu”,
“os_version”: “20.04”
}
]
}

TASK [check all instances have single os] *********************************************************************************************************************************************************************************************************************
ok: [127.0.1.1 → localhost]

TASK [check all instances have single os version] *************************************************************************************************************************************************************************************************************
ok: [127.0.1.1 → localhost]

TASK [capture os] *********************************************************************************************************************************************************************************************************************************************
changed: [127.0.1.1 → localhost]

TASK [capture os version] *************************************************************************************************************************************************************************************************************************************
changed: [127.0.1.1 → localhost]

PLAY RECAP ****************************************************************************************************************************************************************************************************************************************************
127.0.1.1 : ok=16 changed=3 unreachable=0 failed=0 skipped=0 rescued=0 ignored=0

PLAY RECAP ****************************************************************************************************************************************************************************************************************************************************
127.0.1.1 : ok=70 changed=41 unreachable=0 failed=0 skipped=453 rescued=0 ignored=0

install

ubuntu@5950X:~/tao-getting-started_v4.0.2/setup/quickstart_api_bare_metal$ bash setup.sh install

Provide the path to the hosts file [./hosts]:

Provide the ngc-api-key: cHJoO–edited–4MzAx

Provide the ngc-email: jay.duff@cfacorp.com

Provide the api-chart [https://helm.ngc.nvidia.com/nvidia/tao/charts/tao-toolkit-api-4.0.2.tgz]:

Provide the api-values [./tao-toolkit-api-helm-values.yml]:

Provide the cluster-name: mycluster

Provide the value for enable_mig (no/yes) [no]:

Provide the value for mig_profile [all-disabled]:

Provide the value for mig_strategy (single/mixed) [single]:

Provide the value for nvidia_driver_version [“510.47.03”]:

skipped many lines here…

PLAY [master] *************************************************************************************************************************************************************************************************************************************************

TASK [Gathering Facts] ****************************************************************************************************************************************************************************************************************************************

ok: [127.0.1.1]

TASK [Waiting for the Cluster to become available] ************************************************************************************************************************************************************************************************************

WAITED 15 minutes, control-C

validate

similar result, waited 20 minutes for cluster

ubuntu@5950X:~/tao-getting-started_v4.0.2/setup/quickstart_api_bare_metal$ bash setup.sh validate
Provide the path to the hosts file [./hosts]:
Provide the value for enable_mig (no/yes) [no]:
Provide the value for mig_profile [all-disabled]:
Provide the value for mig_strategy (single/mixed) [single]:
Provide the value for nvidia_driver_version [“510.47.03”]:

PLAY [all] ****************************************************************************************************************************************************************************************************************************************************

TASK [Gathering Facts] ****************************************************************************************************************************************************************************************************************************************
ok: [127.0.1.1]

TASK [check os] ***********************************************************************************************************************************************************************************************************************************************
ok: [127.0.1.1]

TASK [check os version] ***************************************************************************************************************************************************************************************************************************************
ok: [127.0.1.1]

TASK [check disk size sufficient] *****************************************************************************************************************************************************************************************************************************
skipping: [127.0.1.1]

TASK [check sufficient memory] ********************************************************************************************************************************************************************************************************************************
ok: [127.0.1.1]

TASK [check sufficient number of cpu cores] *******************************************************************************************************************************************************************************************************************
ok: [127.0.1.1]

TASK [check sudo privileges] **********************************************************************************************************************************************************************************************************************************
ok: [127.0.1.1]

TASK [capture gpus per node] **********************************************************************************************************************************************************************************************************************************
changed: [127.0.1.1]

TASK [check not more than 1 gpu per node] *********************************************************************************************************************************************************************************************************************
ok: [127.0.1.1]

TASK [check exactly 1 master] *********************************************************************************************************************************************************************************************************************************
ok: [127.0.1.1 → localhost]

TASK [capture host details] ***********************************************************************************************************************************************************************************************************************************
ok: [127.0.1.1 → localhost]

TASK [print host details] *************************************************************************************************************************************************************************************************************************************
ok: [127.0.1.1 → localhost] => {
“host_details”: [
{
“host”: “5950X”,
“os”: “Ubuntu”,
“os_version”: “20.04”
}
]
}

TASK [check all instances have single os] *********************************************************************************************************************************************************************************************************************
ok: [127.0.1.1 → localhost]

TASK [check all instances have single os version] *************************************************************************************************************************************************************************************************************
ok: [127.0.1.1 → localhost]

TASK [capture os] *********************************************************************************************************************************************************************************************************************************************
ok: [127.0.1.1 → localhost]

TASK [capture os version] *************************************************************************************************************************************************************************************************************************************
ok: [127.0.1.1 → localhost]

PLAY RECAP ****************************************************************************************************************************************************************************************************************************************************
127.0.1.1 : ok=15 changed=1 unreachable=0 failed=0 skipped=1 rescued=0 ignored=0

PLAY [master] *************************************************************************************************************************************************************************************************************************************************

TASK [Gathering Facts] ****************************************************************************************************************************************************************************************************************************************
ok: [127.0.1.1]

TASK [Waiting for the Cluster to become available] ************************************************************************************************************************************************************************************************************

reboot & retry

validate

TASK [Report Cuda Validation] *********************************************************************************************************************************************************************************
ok: [127.0.1.1] => {
“msg”: [
“[Vector addition of 50000 elements]”,
“Copy input data from the host memory to the CUDA device”,
“CUDA kernel launch with 196 blocks of 256 threads”,
“Copy output data from the CUDA device to the host memory”,
“Test PASSED”,
“Done”,
“pod "cuda-vector-add" deleted”
]
}

TASK [Report Network Operator version] ************************************************************************************************************************************************************************
skipping: [127.0.1.1]

TASK [Report Mellanox MOFED Driver Version] *******************************************************************************************************************************************************************
skipping: [127.0.1.1]

TASK [Report RDMA Shared Device Plugin Version] ***************************************************************************************************************************************************************
skipping: [127.0.1.1]

TASK [Report SRIOV Device Plugin Version] *********************************************************************************************************************************************************************
skipping: [127.0.1.1]

TASK [Report Container Networking Plugin Version] *************************************************************************************************************************************************************
skipping: [127.0.1.1]

TASK [Report Multus Version] **********************************************************************************************************************************************************************************
skipping: [127.0.1.1]

TASK [Report Whereabouts Version] *****************************************************************************************************************************************************************************
skipping: [127.0.1.1]

TASK [Status Check] *******************************************************************************************************************************************************************************************
changed: [127.0.1.1]

TASK [debug] **************************************************************************************************************************************************************************************************
ok: [127.0.1.1] => {
“msg”: “All tasks should be changed or ok, if it’s failed or ignoring means that validation task failed.”
}

PLAY RECAP ****************************************************************************************************************************************************************************************************
127.0.1.1 : ok=47 changed=27 unreachable=0 failed=0 skipped=30 rescued=0 ignored=0

install (after reboot)

cluster started after a few minutes

looks like success (skipped many lines):

TASK [create kube directory] ************************************************************************************************************************************************************************************************************************************************************
ok: [localhost]

TASK [ensure kubeconfig file exists] ****************************************************************************************************************************************************************************************************************************************************
changed: [localhost]

TASK [merge kubeconfig to existing] *****************************************************************************************************************************************************************************************************************************************************
changed: [localhost]

TASK [make merged-kubeconfig default] ***************************************************************************************************************************************************************************************************************************************************
changed: [localhost]

PLAY RECAP ******************************************************************************************************************************************************************************************************************************************************************************
127.0.1.1 : ok=25 changed=17 unreachable=0 failed=0 skipped=2 rescued=0 ignored=0
localhost : ok=10 changed=5 unreachable=0 failed=0 skipped=2 rescued=0 ignored=0

ubuntu@5950X:~/tao-getting-started_v4.0.2/setup/quickstart_api_bare_metal$

final steps

ubuntu@5950X:~/tao-getting-started_v4.0.2/setup/quickstart_api_bare_metal$ hostname -i
127.0.1.1

ubuntu@5950X:~/tao-getting-started_v4.0.2/setup/quickstart_api_bare_metal$ kubectl get service ingress-nginx-controller -o jsonpath=‘{.spec.ports[0].nodePort}’
32080

I think I’m good now. Thanks for you patience and help.
For someone else reading this, my main lessons were:

  • use password for ansible setup
  • use 127.0.1.1 (in a setup with just one master computer)