TAO Toolkit 4.0.0 API bare metal setup causing gpu driver and kube utilities to uninstall (lots of confusing things happening at the same time)

• Hardware DGX Station A100

  • Enterprise Support case reference: 00568165

• How to reproduce the issue ? (This is for errors. Please share the command line and the detailed log here.)

Note: I’m not an experianced kubernetes user this is my first dablble in it so apologies if this is very obvious

Our network topology:

I tried to do this in several ways.

1. [Attempt 1]: Use the cpu server as master and the DGX as a worker node

As outlined here step by step.

Then downloaded the tao-getting-started_v4.0.0 to the pc (shown in the network topology diagram) and tried to run the setup.sh

my hosts file:

# List all hosts below.
# For single node deployment, listing the master is enough.
[master]
ip_addr_of_the_cpu_server ansible_ssh_user='serveruser' ansible_ssh_private_key_file='path/to/key.file' ansible_ssh_extra_args='-o StrictHostKeyChecking=No'
[nodes]
ip_addr_of_the_dgx ansible_ssh_user='dgxuser' ansible_ssh_private_key_file='path/to/key.file' ansible_ssh_extra_args='-o StrictHostKeyChecking=No'

my tao-toolkit-api-ansible-values.yml file

 ngc_api_key: my_ngc_api_key
 ngc_email: my_email
 api_chart: https://helm.ngc.nvidia.com/nvidia/tao/charts/tao-toolkit-api-4.0.0.tgz
 api_values: ./tao-toolkit-api-helm-values.yml
  cluster_name: my_cluster_name                          

Then i realised (after script run attempt) that tao-getting-started_v4.0.0/setup/quickstart_api_bare_metal/setup.sh call the check-inventory.yml has the following assertion

- name: check os version
  assert:
   that: "ansible_distribution_version in ['18.04', '20.04']"

which fails because
when I run ansible localhost -m setup -a 'filter=ansible_distribution_version' on the master node (hostname gsrv) I get

[WARNING]: No inventory was parsed, only implicit localhost is available
localhost | SUCCESS => {
  "ansible_facts": {
    "ansible_distribution_version": "22.04"
  },
  "changed": false
}

This I belive (as the network topology diagram shows ) is because the server OS version is not compatible.

So Method 1 failed! :(

2. [Method 2 ]: Use dgx as a single node cluser (because I thought downgrading the cpou server to 20.04 was not woth the hassle)

So I went on to install k8 on dgx following this guide with the blessing of enterprise support (assuming I was able to sucessfully convey my motives to them through the horrible comment box interface)

All went well! and then I had two options from there

2.1 [Attempt 2.1] From my pc (because I had the tao-getting-started_v4.0.0 directory from my previous attempt)

  • I modified the hosts file to only have a master, changed the master IP to the dgx station and the correct keys
  • I did not change the tao-toolkit-api-ansible-values.yml file

Then I ran the setup script again.

This is what I got

PLAY [all] ***********************************************************************************************************************************************

TASK [Gathering Facts] ***********************************************************************************************************************************
ok: [172.16.3.2]

TASK [check os] ******************************************************************************************************************************************
ok: [172.16.3.2]

TASK [check os version] **********************************************************************************************************************************
ok: [172.16.3.2] => {
    "changed": false,
    "msg": "All assertions passed"
}

TASK [check disk size sufficient] ************************************************************************************************************************
ok: [172.16.3.2]

TASK [check sufficient memory] ***************************************************************************************************************************
ok: [172.16.3.2]

TASK [check sufficient number of cpu cores] **************************************************************************************************************
ok: [172.16.3.2]

TASK [check sudo privileges] *****************************************************************************************************************************
ok: [172.16.3.2]

TASK [capture gpus per node] *****************************************************************************************************************************
changed: [172.16.3.2]

TASK [check not more than 1 gpu per node] ****************************************************************************************************************
ok: [172.16.3.2]

TASK [check exactly 1 master] ****************************************************************************************************************************
ok: [172.16.3.2 -> localhost]

TASK [capture host details] ******************************************************************************************************************************
ok: [172.16.3.2 -> localhost]

TASK [print host details] ********************************************************************************************************************************
ok: [172.16.3.2 -> localhost] => {
    "host_details": [
        {
            "host": "dgx",
            "os": "Ubuntu",
            "os_version": "20.04"
        }
    ]
}

TASK [check all instances have single os] ****************************************************************************************************************
ok: [172.16.3.2 -> localhost]

TASK [check all instances have single os version] ********************************************************************************************************
ok: [172.16.3.2 -> localhost]

TASK [capture os] ****************************************************************************************************************************************
changed: [172.16.3.2 -> localhost]

TASK [capture os version] ********************************************************************************************************************************
changed: [172.16.3.2 -> localhost]

PLAY RECAP ***********************************************************************************************************************************************
172.16.3.2                 : ok=16   changed=3    unreachable=0    failed=0    skipped=0    rescued=0    ignored=0   


PLAY [all] ***********************************************************************************************************************************************

TASK [Gathering Facts] ***********************************************************************************************************************************
ok: [172.16.3.2]

TASK [uninstall nvidia using installer] ******************************************************************************************************************
fatal: [172.16.3.2]: FAILED! => {"changed": false, "cmd": "nvidia-installer --uninstall --silent", "msg": "[Errno 2] No such file or directory: b'nvidia-installer'", "rc": 2, "stderr": "", "stderr_lines": [], "stdout": "", "stdout_lines": []}
...ignoring

TASK [uninstall nvidia and cuda drivers] *****************************************************************************************************************
changed: [172.16.3.2]

PLAY RECAP ***********************************************************************************************************************************************
172.16.3.2                 : ok=3    changed=1    unreachable=0    failed=0    skipped=0    rescued=0    ignored=1   


PLAY [master] ****************************************************************************************************************************************************************************************************

TASK [Gathering Facts] *******************************************************************************************************************************************************************************************
ok: [172.16.3.2]

TASK [Uninstall the GPU Operator with MIG] ***********************************************************************************************************************************************************************
skipping: [172.16.3.2]

PLAY [all] *******************************************************************************************************************************************************************************************************

TASK [Gathering Facts] *******************************************************************************************************************************************************************************************
ok: [172.16.3.2]

TASK [Reset Kubernetes component] ********************************************************************************************************************************************************************************
changed: [172.16.3.2]

TASK [IPTables Cleanup] ******************************************************************************************************************************************************************************************
changed: [172.16.3.2]

TASK [Remove Conatinerd and Kubernetes packages for Ubuntu] ******************************************************************************************************************************************************
changed: [172.16.3.2]

TASK [Remove Docker and Kubernetes packages for Ubuntu] **********************************************************************************************************************************************************
skipping: [172.16.3.2]

TASK [Remove NVIDIA Docker for Cloud Native Core Developers] *****************************************************************************************************************************************************
skipping: [172.16.3.2]

TASK [Remove dependencies that are no longer required] ***********************************************************************************************************************************************************
changed: [172.16.3.2]

TASK [Remove installed packages for RHEL/CentOS] *****************************************************************************************************************************************************************
skipping: [172.16.3.2]

TASK [Cleanup Containerd Process] ********************************************************************************************************************************************************************************
changed: [172.16.3.2]

TASK [Cleanup Directories for Cloud Native Core Developers] ******************************************************************************************************************************************************
skipping: [172.16.3.2] => (item=/etc/docker) 
skipping: [172.16.3.2] => (item=/var/lib/docker) 
skipping: [172.16.3.2] => (item=/var/run/docker) 
skipping: [172.16.3.2] => (item=/run/docker.sock) 
skipping: [172.16.3.2] => (item=/run/docker) 

TASK [Cleanup Directories] ***************************************************************************************************************************************************************************************
changed: [172.16.3.2] => (item=/var/lib/etcd)
changed: [172.16.3.2] => (item=/etc/kubernetes)
ok: [172.16.3.2] => (item=/usr/local/bin/helm)
ok: [172.16.3.2] => (item=/var/lib/crio)
ok: [172.16.3.2] => (item=/etc/crio)
ok: [172.16.3.2] => (item=/usr/local/bin/crio)
changed: [172.16.3.2] => (item=/var/log/containers)
ok: [172.16.3.2] => (item=/etc/apt/sources.list.d/devel*)
ok: [172.16.3.2] => (item=/etc/sysctl.d/99-kubernetes-cri.conf)
ok: [172.16.3.2] => (item=/etc/modules-load.d/containerd.conf)
ok: [172.16.3.2] => (item=/etc/modules-load.d/crio.conf)
ok: [172.16.3.2] => (item=/etc/apt/trusted.gpg.d/libcontainers*)
ok: [172.16.3.2] => (item=/etc/default/kubelet)
changed: [172.16.3.2] => (item=/etc/cni/net.d)

TASK [Reboot the system] *****************************************************************************************************************************************************************************************
skipping: [172.16.3.2]

PLAY RECAP *******************************************************************************************************************************************************************************************************
172.16.3.2                 : ok=8    changed=6    unreachable=0    failed=0    skipped=6    rescued=0    ignored=0   


PLAY [all] *******************************************************************************************************************************************************************************************************

TASK [Gathering Facts] *******************************************************************************************************************************************************************************************
ok: [172.16.3.2]

TASK [set_fact] **************************************************************************************************************************************************************************************************
ok: [172.16.3.2]

TASK [Checking Nouveau is disabled] ******************************************************************************************************************************************************************************
changed: [172.16.3.2]

TASK [unload nouveau] ********************************************************************************************************************************************************************************************
ok: [172.16.3.2]

TASK [blacklist nouveau] *****************************************************************************************************************************************************************************************
ok: [172.16.3.2]

TASK [Test Internet Connection] **********************************************************************************************************************************************************************************
ok: [172.16.3.2]

TASK [Report Internet Connection status] *************************************************************************************************************************************************************************
ok: [172.16.3.2] => {
    "msg": "Internet Connection status 200"
}

TASK [Install Internet Speed dependencies] ***********************************************************************************************************************************************************************
fatal: [172.16.3.2]: FAILED! => {"changed": false, "msg": "Failed to update apt cache: unknown reason"}

PLAY RECAP *******************************************************************************************************************************************************************************************************
172.16.3.2                 : ok=7    changed=1    unreachable=0    failed=1    skipped=0    rescued=0    ignored=0   

apart from having to witness that my gpu drivers got deleted the script failed due to Failed to update apt cache: unknown reason

Then I ssh’d into dgx
→ ran rm /etc/apt/sources.list.d/kubernetes.list to clear that
→ ran apt update

and I got this

Hit:1 http://us.archive.ubuntu.com/ubuntu focal InRelease                                                                                                                              
Get:2 http://security.ubuntu.com/ubuntu focal-security InRelease [114 kB]                                                                                                              
Get:3 http://us.archive.ubuntu.com/ubuntu focal-updates InRelease [114 kB]                                                                              
Hit:4 https://repo.download.nvidia.com/baseos/ubuntu/focal/x86_64 focal InRelease                              
Err:4 https://repo.download.nvidia.com/baseos/ubuntu/focal/x86_64 focal InRelease                              
  The following signatures couldn't be verified because the public key is not available: NO_PUBKEY 208CE844D9F220AD
Hit:6 https://repo.download.nvidia.com/baseos/ubuntu/focal/x86_64 focal-updates InRelease                                                          
Err:6 https://repo.download.nvidia.com/baseos/ubuntu/focal/x86_64 focal-updates InRelease                 
  The following signatures couldn't be verified because the public key is not available: NO_PUBKEY 208CE844D9F220AD
Get:7 http://security.ubuntu.com/ubuntu focal-security/main amd64 DEP-11 Metadata [60.0 kB]               
Get:8 http://us.archive.ubuntu.com/ubuntu focal-updates/main amd64 DEP-11 Metadata [275 kB]                                 
Get:9 http://security.ubuntu.com/ubuntu focal-security/universe amd64 DEP-11 Metadata [95.0 kB]                                                          
Get:10 http://security.ubuntu.com/ubuntu focal-security/multiverse amd64 DEP-11 Metadata [940 B]                                            
Get:11 http://us.archive.ubuntu.com/ubuntu focal-updates/universe amd64 DEP-11 Metadata [409 kB]                                                                         
Get:12 http://us.archive.ubuntu.com/ubuntu focal-updates/multiverse amd64 DEP-11 Metadata [944 B]    
Hit:5 https://packages.cloud.google.com/apt kubernetes-xenial InRelease                       
Fetched 1,068 kB in 5s (230 kB/s)
Reading package lists... Done
Building dependency tree       
Reading state information... Done
51 packages can be upgraded. Run 'apt list --upgradable' to see them.
W: An error occurred during the signature verification. The repository is not updated and the previous index files will be used. GPG error: https://repo.download.nvidia.com/baseos/ubuntu/focal/x86_64 focal InRelease: The following signatures couldn't be verified because the public key is not available: NO_PUBKEY 208CE844D9F220AD
W: An error occurred during the signature verification. The repository is not updated and the previous index files will be used. GPG error: https://repo.download.nvidia.com/baseos/ubuntu/focal/x86_64 focal-updates InRelease: The following signatures couldn't be verified because the public key is not available: NO_PUBKEY 208CE844D9F220AD
W: Failed to fetch https://repo.download.nvidia.com/baseos/ubuntu/focal/x86_64/dists/focal/InRelease  The following signatures couldn't be verified because the public key is not available: NO_PUBKEY 208CE844D9F220AD
W: Failed to fetch https://repo.download.nvidia.com/baseos/ubuntu/focal/x86_64/dists/focal-updates/InRelease  The following signatures couldn't be verified because the public key is not available: NO_PUBKEY 208CE844D9F220AD
W: Some index files failed to download. They have been ignored, or old ones used instead.

I think this happned because the install script deleted some stuff so
Problem 1: How Do I cirrect this

On top of that this has taken down my gpu drivers
when I run nvsm show health I get

sudo nvsm show health
Running standard show health
Info
----
Timestamp      :   Tue Mar 14 10:43:18 GMT 2023
Version       :   22.09.03
Checks
------
Verify installed DIMM memory sticks.................................. Healthy
Verify Network Adapters.............................................. Healthy
Verify installed GPU's............................................... Healthy
Verify installed VGA Controllers..................................... Healthy
Verify PCIe switches................................................. Healthy
Verify DIMM part number.............................................. Healthy
Verify DIMM vendors.................................................. Healthy
Verify DIMM vendors consistency...................................... Healthy
Verify chassis [Serial:1560422011502] fan presence................... Healthy
Quick health check of GPU using DCGM................................. Unknown
  No GPU resources found. No response or error while connecting to DCGM.
  Could not load NVML library
Status of volumes.................................................... Healthy
Check SMART status of NVMe devices................................... Healthy
Verify installed NVMe devices........................................ Healthy
Verify chassis [Serial:1560422011502] power supply presence.......... Healthy
NVMe link speed [0000:03:00.0][16GT/s]............................... Healthy
NVMe link width [0000:03:00.0][x4]................................... Healthy
NVMe link speed [0000:41:00.0][8GT/s]................................ Healthy
NVMe link width [0000:41:00.0][x4]................................... Healthy
NetworkAdapter link speed [0000:42:00.0][8GT/s]...................... Healthy
NetworkAdapter link width [0000:42:00.0][x4]......................... Healthy
NetworkAdapter link speed [0000:42:00.1][8GT/s]...................... Healthy
NetworkAdapter link width [0000:42:00.1][x4]......................... Healthy
Check BMC sensor thresholds for chassis [Serial: 1560422011502] ..... Healthy
  Checked 65 sensor values against BMC thresholds.
BMC Firmware Revision [1.24.00]
BaseOS Version [5.4.2]
BIOS Version [L10.16]
Linux kernel version [5.4.0-144-generic]
System Uptime [up 1 hour, 22 minutes]
Serial Number [SERIAL_NUMBER]
Chassis [...] Power Supply Info
  [PSU1] Vendor[Delta] Model[ECD15050001] Serial[...] Firmware[3.8]
System Summary
--------------
  Product Name:DGX Station A100 920-23487-2530-0R0
  Manufacturer:NVIDIA
  Serial Number:1[..]
  Uptime:up 1 hour, 22 minutes
MotherBoard:
  BIOS Version:L10.16
  Serial Number:
BMC:
  Firmware Version:1.24.00
  IPMI Version:2.0
Software:
  BaseOS Version:5.4.2
  Kernel Version:5.4.0-144-generic
  OS Version:Ubuntu 20.04.5 LTS (Focal Fossa)
Health Summary
--------------
22 out of 23 checks are healthy
0 out of 23 checks are unhealthy
1 out of 23 checks are unknown
0 out of 23 checks are informational
0 out of 23 checks are disabled
0 out of 23 checks are skipped
Overall system status is unhealthy
Problem detected.
1. Please run 'sudo nvsm dump health'
2. Please open a case with NVIDIA Enterprise Support at this address https://nvid.nvidia.com/dashboard/
3. Attach the log file from /tmp/nvsm-health-<hostname>-<timestamp>.tar.xz
100.0% [=========================================]
Status: Unhealthy

Then I thougt Method 2.1 failed so I better run the setup from with the dgx rather than from my, this brings us to the next attempt

2.2[Attempt 2.2]

Basically same as 2.1 but ran from the dgx
changed the host file IPs to localhost (same effect if I gave it the IP taken up by the network interface)
and ran the setup again

I got

PLAY [all] *****************************************************************************************************************************************************************************

TASK [Gathering Facts] *****************************************************************************************************************************************************************
ok: [127.0.0.1]

TASK [check os] ************************************************************************************************************************************************************************
ok: [127.0.0.1]

TASK [check os version] ****************************************************************************************************************************************************************
ok: [127.0.0.1]

TASK [check disk size sufficient] ******************************************************************************************************************************************************
ok: [127.0.0.1]

TASK [check sufficient memory] *********************************************************************************************************************************************************
ok: [127.0.0.1]

TASK [check sufficient number of cpu cores] ********************************************************************************************************************************************
ok: [127.0.0.1]

TASK [check sudo privileges] ***********************************************************************************************************************************************************
ok: [127.0.0.1]

TASK [capture gpus per node] ***********************************************************************************************************************************************************
changed: [127.0.0.1]

TASK [check not more than 1 gpu per node] **********************************************************************************************************************************************
ok: [127.0.0.1]

TASK [check exactly 1 master] **********************************************************************************************************************************************************
ok: [127.0.0.1 -> localhost]

TASK [capture host details] ************************************************************************************************************************************************************
ok: [127.0.0.1 -> localhost]

TASK [print host details] **************************************************************************************************************************************************************
ok: [127.0.0.1 -> localhost] => {
    "host_details": [
        {
            "host": "dgx",
            "os": "Ubuntu",
            "os_version": "20.04"
        }
    ]
}

TASK [check all instances have single os] **********************************************************************************************************************************************
ok: [127.0.0.1 -> localhost]

TASK [check all instances have single os version] **************************************************************************************************************************************
ok: [127.0.0.1 -> localhost]

TASK [capture os] **********************************************************************************************************************************************************************
changed: [127.0.0.1 -> localhost]

TASK [capture os version] **************************************************************************************************************************************************************
changed: [127.0.0.1 -> localhost]

PLAY RECAP *****************************************************************************************************************************************************************************
127.0.0.1                  : ok=16   changed=3    unreachable=0    failed=0    skipped=0    rescued=0    ignored=0   

cat: target-os: No such file or directory

Basiclly lots of problems, Can you please

  • point out where things have goine wrong
  • make some suggestions to help fix the dgx (I can reinstall drivers but I’d like to get the TAO API running as well)

Cheers,
Ganindu.

Several cases.

  1. If you are running in bare metal without anything in system, please
    $ bash setup.sh check-inventory.yml
    $ bash setup.sh install
  2. If nouveau driver is installed and in use
    $ bash setup.sh uninstall
    $ kubectl delete crd clusterpolicies.nvidia.com
    $ sudo reboot
    $ bash setup.sh check-inventory.yml
    $ bash setup.sh install
  3. If nvidia gpu driver is pre-installed,
    run below commands to uninstall gpu driver.
    $ sudo apt purge nvidia-driver-*
    $ sudo apt autoremove
    $ sudo apt autoclean
    then,
    $ bash setup.sh uninstall
    $ kubectl delete crd clusterpolicies.nvidia.com
    $ bash setup.sh check-inventory.yml
    $ bash setup.sh install

Since you are running in only one machine. So it is the master. No other nodes. For example, you can set “hosts” file as below.

# List all hosts below.
# For single node deployment, listing the master is enough.
[master]
# Example of host accessible using ssh private key
# 1.1.1.1 ansible_ssh_user='ubuntu' ansible_ssh_private_key_file='/path/to/key.pem'
10.176.11.112 ansible_ssh_user='local-morganh' ansible_ssh_pass='passwd'

[nodes]
# Example of host accessible using ssh password
# 1.1.1.2 ansible_ssh_user='ubuntu' ansible_ssh_pass='some-password' ansible_ssh_extra_args='-o StrictHostKeyChecking=no'

More info can be found in https://developer.nvidia.com/blog/training-like-an-ai-pro-using-tao-automl/

Hi @Morganh,

Thanks fir getting back to me, The only reason I thought of using a single node multi role (master node) cluster is becasue of this requirement
Like the picture above shows my cpu only server is running Ubuntu server 22.04.

So if we step back a second to arrive at the initial motive that lead to all this.

Primarily I just want my local cpu server (as shown in the diagram) to be the master node. And my PC (or other computers/laptops that may be visible in the network) to be able to make API calls to the master which will then be delegated to the GPU node (dgx station from the diagram) via kubernetics magic.

alternatively if my primary method can’t be achived for some reason (e.g. my k8 master node not having the right ubuntu version) I am happy to compromise with a single node multi role setup where the dgx station a100 can serve as the sole node as long as I can use that in the conventional manner as well (this actually applies to both cases as I don;t want to loose the ability to ssh into the dgx and run my experiments with scripts or notebooks)

If I expand a little on my thinking:

Method 1 (Ideally preffered by me)

Under my current requirements (already having a k8 master node (ubuntu 22.04))

  1. install helm on dgx?

  2. I register the dgx as a GPU node with the join command (ssh into dgx and run)?

  3. run the command below to set up TAO stuff (this is the bit where I don’t understand ) in the dgx (ssh into dgx and run)?

helm fetch https://helm.ngc.nvidia.com/nvidia/tao/charts/tao-toolkit-api-4.0.0.tgz --username='$oauthtoken' --password=<YOUR API KEY>
mkdir tao-toolkit-api && tar -zxvf tao-toolkit-api-4.0.0.tgz -C tao-toolkit-api

Deployment - NVIDIA Docs page says about customisation. abd the chart’s tao-toolkit-api/values.yaml . Where do I place this file (let’s say I run the command above in an empty directory in the dgx?)

  1. then run the command beow int he dgx to deploy the API service?
helm install tao-toolkit-api tao-toolkit-api/ --namespace default
  1. run the command below in master node (cpu server to check if things work)?
kubectl describe pods tao-toolkit-api -n default

Method 2 (single node multi role (all DGX))

From your answer I don’t understand the need to purge the driver?

does tihis mean I can’t use the DGX as usual when I am not using the ut as a TAO gpu node?
does it then reinstall some different on demand gpu driver? (I totally don’t understand this )

I want to keep using the DGX as normal while this being used as a TAO API gpu node on the side? it that not a possibility?

Cheers,
Ganindu.

For the Attempt2.2, can you share the “hosts” file?

Do you mean prefer to running CPU server(master, Ubuntu22.04) and dgx(worker node) ?

Answer to question 1 (on attempt 2.2)

I basically modified the one from the previous attempt

[master]
127.0.0.1 ansible_ssh_user='dgxuser' ansible_ssh_private_key_file='path/to/key.file' ansible_ssh_extra_args='-o StrictHostKeyChecking=No'

[nodes]

I ran the bash setup.sh install immidiately after the first atttempt and again after reinstalling the gpu driver and once again after following the nstalling-cuda-drivers section of the dgx software stack installation guide

(on a side note is that enough to restore everything the earlier setup.sh run git rid of?,
I only checked with sudo nvsm show health as a system state check
)

With my limited k8 knowledge I wasn’t sure if it worked because the k8 tools were not there. (however the second run (after maually reinstalling drivers as I mentioned above) from with the dgx didn’t remove drivers (i can still run nvidia-smi) )

Answer to question 2 (on the ideally prefered method)

Yes! (that would be awesome!!!)


Hope this information is useful, please let me know if you want more details on those.

Let’s target to install on DGX only. Try to set DGX as single node.

Set hosts file.

[master]
# xxx.xxx.xxx.xxx is the ip of dgx. 
xxx.xxx.xxx.xxx  ansible_ssh_user='dgxuser' ansible_ssh_pass='dgx passwd'

Then, can you upload the full log when run below two commands?
$ bash setup.sh check-inventory.yml
$ bash setup.sh install

Hi I made the edit:

but then I had to edit /etc/ansible/ansible.cfg and uncomment stuff to get

[defaults]
host_key_checking = false

The command

(K8PY) g@dgx:~/Workspace/sandbox/TAO/getting_started_v4.0.0/setup/quickstart_api_bare_metal$ bash setup.sh check-inventory.yml

gave the output shown below (basically nothing, I also tried with the defaults by not typing in the fileneme (./hosts) )but got the same result)

Then for bash setup.sh install

I got the following

output.txt (189.0 KB)

When I now run nvidia-smi

I get

nvidia-smi

Command 'nvidia-smi' not found, but can be installed with:

sudo apt install nvidia-340               # version 340.108-0ubuntu5.20.04.2, or
sudo apt install nvidia-utils-390         # version 390.157-0ubuntu0.20.04.1
sudo apt install nvidia-utils-450-server  # version 450.216.04-0ubuntu0.20.04.1
sudo apt install nvidia-utils-470         # version 470.161.03-0ubuntu0.20.04.1
sudo apt install nvidia-utils-470-server  # version 470.161.03-0ubuntu0.20.04.1
sudo apt install nvidia-utils-510         # version 510.108.03-0ubuntu0.20.04.1
sudo apt install nvidia-utils-515         # version 515.86.01-0ubuntu0.20.04.1
sudo apt install nvidia-utils-515-server  # version 515.86.01-0ubuntu0.20.04.3
sudo apt install nvidia-utils-525         # version 525.85.05-0ubuntu0.20.04.1
sudo apt install nvidia-utils-525-server  # version 525.85.12-0ubuntu0.20.04.1
sudo apt install nvidia-utils-435         # version 435.21-0ubuntu7
sudo apt install nvidia-utils-440         # version 440.82+really.440.64-0ubuntu6
sudo apt install nvidia-utils-418-server  # version 418.226.00-0ubuntu0.20.04.2

when I run sudo nvsm show health I get

g@dgx:~$ sudo nvsm show health 
Running standard show health

Info
----
Timestamp           :     Thu Mar 16 14:45:24 GMT 2023
Version             :     22.09.03

Checks
------
Verify installed DIMM memory sticks.................................. Healthy
Verify Network Adapters.............................................. Healthy
Verify installed GPU's............................................... Healthy
Verify installed VGA Controllers..................................... Healthy
Verify PCIe switches................................................. Healthy
Verify DIMM part number.............................................. Healthy
Verify DIMM vendors.................................................. Healthy
Verify DIMM vendors consistency...................................... Healthy
Verify chassis [Serial:SN] fan presence................... Healthy
Quick health check of GPU using DCGM................................. Unknown
    No GPU resources found. No response or error while connecting to DCGM.
    Could not load NVML library
Status of volumes.................................................... Healthy
Check SMART status of NVMe devices................................... Healthy
Verify installed NVMe devices........................................ Healthy
Verify chassis [Serial:SN] power supply presence.......... Healthy
NVMe link speed [0000:03:00.0][16GT/s]............................... Healthy
NVMe link width [0000:03:00.0][x4]................................... Healthy
NVMe link speed [0000:41:00.0][8GT/s]................................ Healthy
NVMe link width [0000:41:00.0][x4]................................... Healthy
NetworkAdapter link speed [0000:42:00.0][8GT/s]...................... Healthy
NetworkAdapter link width [0000:42:00.0][x4]......................... Healthy
NetworkAdapter link speed [0000:42:00.1][8GT/s]...................... Healthy
NetworkAdapter link width [0000:42:00.1][x4]......................... Healthy
Check BMC sensor thresholds for chassis [Serial: SN] ..... Healthy
    Checked 65 sensor values against BMC thresholds.
BMC Firmware Revision [1.24.00]
BaseOS Version [5.4.2]
BIOS Version [L10.16]
Linux kernel version [5.4.0-144-generic]
System Uptime [up 2 hours, 6 minutes]
Serial Number [SN]
Chassis [SN] Power Supply Info
    [PSU1] Vendor[Delta] Model[ECD15050001] Serial[SN] Firmware[3.8]

System Summary
--------------
    Product Name:DGX Station A100 920-23487-2530-0R0
    Manufacturer:NVIDIA
    Serial Number: SN
    Uptime:up 2 hours, 6 minutes

MotherBoard:
    BIOS Version:L10.16
    Serial Number:SN

BMC:
    Firmware Version:1.24.00
    IPMI Version:2.0

Software:
    BaseOS Version:5.4.2
    Kernel Version:5.4.0-144-generic
    OS Version:Ubuntu 20.04.5 LTS (Focal Fossa)

Health Summary
--------------

22 out of 23 checks are healthy
0 out of 23 checks are unhealthy
1 out of 23 checks are unknown
0 out of 23 checks are informational
0 out of 23 checks are disabled
0 out of 23 checks are skipped
Overall system status is unhealthy

Problem detected.


1. Please run 'sudo nvsm dump health'
2. Please open a case with NVIDIA Enterprise Support at this address https://nvid.nvidia.com/dashboard/
3. Attach the log file from /tmp/nvsm-health-<hostname>-<timestamp>.tar.xz
100.0% [=========================================]
Status: Unhealthy

:(

You are correct. I forget to mention.

For your latest log,

TASK [Waiting for the Cluster to become available] ***

Can you run below?
$ kubectl delete crd clusterpolicies.nvidia.com
$ bash setup.sh install

In the same window or a seperate window? (the last window I ran the install command is still active and maybe Waiting for the Cluster to become available as it says and the terminal is not released)
Peek 2023-03-16 15-04

Do you want me to (Ctrl + C) that?

Yes, just ctrl+c to cancel.

I got

output.txt (190.1 KB)

the nvidia-smi and nvsm show health is still the same (Unhealthy)

Could you add below? Appreciate for your patience.
$ bash setup.sh uninstall
$ kubectl delete crd clusterpolicies.nvidia.com
$ bash setup.sh check-inventory.yml
$ bash setup.sh install

No worries! thanks a lot for the support!!

uninstall

bash setup.sh uninstall
Provide the path to the hosts file [./hosts]: 

PLAY [all] ********************************************************************************************************************************************************************************************************************************************************

TASK [Gathering Facts] ********************************************************************************************************************************************************************************************************************************************
ok: [172.16.3.2]

TASK [check os] ***************************************************************************************************************************************************************************************************************************************************
ok: [172.16.3.2]

TASK [check os version] *******************************************************************************************************************************************************************************************************************************************
ok: [172.16.3.2]

TASK [check disk size sufficient] *********************************************************************************************************************************************************************************************************************************
ok: [172.16.3.2]

TASK [check sufficient memory] ************************************************************************************************************************************************************************************************************************************
ok: [172.16.3.2]

TASK [check sufficient number of cpu cores] ***********************************************************************************************************************************************************************************************************************
ok: [172.16.3.2]

TASK [check sudo privileges] **************************************************************************************************************************************************************************************************************************************
ok: [172.16.3.2]

TASK [capture gpus per node] **************************************************************************************************************************************************************************************************************************************
changed: [172.16.3.2]

TASK [check not more than 1 gpu per node] *************************************************************************************************************************************************************************************************************************
ok: [172.16.3.2]

TASK [check exactly 1 master] *************************************************************************************************************************************************************************************************************************************
ok: [172.16.3.2 -> localhost]

TASK [capture host details] ***************************************************************************************************************************************************************************************************************************************
ok: [172.16.3.2 -> localhost]

TASK [print host details] *****************************************************************************************************************************************************************************************************************************************
ok: [172.16.3.2 -> localhost] => {
    "host_details": [
        {
            "host": "dgx",
            "os": "Ubuntu",
            "os_version": "20.04"
        }
    ]
}

TASK [check all instances have single os] *************************************************************************************************************************************************************************************************************************
ok: [172.16.3.2 -> localhost]

TASK [check all instances have single os version] *****************************************************************************************************************************************************************************************************************
ok: [172.16.3.2 -> localhost]

TASK [capture os] *************************************************************************************************************************************************************************************************************************************************
ok: [172.16.3.2 -> localhost]

TASK [capture os version] *****************************************************************************************************************************************************************************************************************************************
ok: [172.16.3.2 -> localhost]

PLAY RECAP ********************************************************************************************************************************************************************************************************************************************************
172.16.3.2                 : ok=16   changed=1    unreachable=0    failed=0    skipped=0    rescued=0    ignored=0   


PLAY [all] ********************************************************************************************************************************************************************************************************************************************************

TASK [Gathering Facts] ********************************************************************************************************************************************************************************************************************************************
ok: [172.16.3.2]

TASK [uninstall nvidia using installer] ***************************************************************************************************************************************************************************************************************************
fatal: [172.16.3.2]: FAILED! => {"changed": false, "cmd": "nvidia-installer --uninstall --silent", "msg": "[Errno 2] No such file or directory: b'nvidia-installer'", "rc": 2, "stderr": "", "stderr_lines": [], "stdout": "", "stdout_lines": []}
...ignoring

TASK [uninstall nvidia and cuda drivers] **************************************************************************************************************************************************************************************************************************
ok: [172.16.3.2]

PLAY RECAP ********************************************************************************************************************************************************************************************************************************************************
172.16.3.2                 : ok=3    changed=0    unreachable=0    failed=0    skipped=0    rescued=0    ignored=1   


PLAY [master] *****************************************************************************************************************************************************************************************************************************************************

TASK [Gathering Facts] ********************************************************************************************************************************************************************************************************************************************
ok: [172.16.3.2]

TASK [Uninstall the GPU Operator with MIG] ************************************************************************************************************************************************************************************************************************
skipping: [172.16.3.2]

PLAY [all] ********************************************************************************************************************************************************************************************************************************************************

TASK [Gathering Facts] ********************************************************************************************************************************************************************************************************************************************
ok: [172.16.3.2]

TASK [Reset Kubernetes component] *********************************************************************************************************************************************************************************************************************************
changed: [172.16.3.2]

TASK [IPTables Cleanup] *******************************************************************************************************************************************************************************************************************************************
changed: [172.16.3.2]

TASK [Remove Conatinerd and Kubernetes packages for Ubuntu] *******************************************************************************************************************************************************************************************************
changed: [172.16.3.2]

TASK [Remove Docker and Kubernetes packages for Ubuntu] ***********************************************************************************************************************************************************************************************************
skipping: [172.16.3.2]

TASK [Remove NVIDIA Docker for Cloud Native Core Developers] ******************************************************************************************************************************************************************************************************
skipping: [172.16.3.2]

TASK [Remove dependencies that are no longer required] ************************************************************************************************************************************************************************************************************
changed: [172.16.3.2]

TASK [Remove installed packages for RHEL/CentOS] ******************************************************************************************************************************************************************************************************************
skipping: [172.16.3.2]

TASK [Cleanup Containerd Process] *********************************************************************************************************************************************************************************************************************************
changed: [172.16.3.2]

TASK [Cleanup Directories for Cloud Native Core Developers] *******************************************************************************************************************************************************************************************************
skipping: [172.16.3.2] => (item=/etc/docker) 
skipping: [172.16.3.2] => (item=/var/lib/docker) 
skipping: [172.16.3.2] => (item=/var/run/docker) 
skipping: [172.16.3.2] => (item=/run/docker.sock) 
skipping: [172.16.3.2] => (item=/run/docker) 

TASK [Cleanup Directories] ****************************************************************************************************************************************************************************************************************************************
changed: [172.16.3.2] => (item=/var/lib/etcd)
changed: [172.16.3.2] => (item=/etc/kubernetes)
changed: [172.16.3.2] => (item=/usr/local/bin/helm)
ok: [172.16.3.2] => (item=/var/lib/crio)
ok: [172.16.3.2] => (item=/etc/crio)
ok: [172.16.3.2] => (item=/usr/local/bin/crio)
changed: [172.16.3.2] => (item=/var/log/containers)
ok: [172.16.3.2] => (item=/etc/apt/sources.list.d/devel*)
ok: [172.16.3.2] => (item=/etc/sysctl.d/99-kubernetes-cri.conf)
changed: [172.16.3.2] => (item=/etc/modules-load.d/containerd.conf)
ok: [172.16.3.2] => (item=/etc/modules-load.d/crio.conf)
ok: [172.16.3.2] => (item=/etc/apt/trusted.gpg.d/libcontainers*)
changed: [172.16.3.2] => (item=/etc/default/kubelet)
changed: [172.16.3.2] => (item=/etc/cni/net.d)

TASK [Reboot the system] ******************************************************************************************************************************************************************************************************************************************
skipping: [172.16.3.2]

PLAY RECAP ********************************************************************************************************************************************************************************************************************************************************
172.16.3.2                 : ok=8    changed=6    unreachable=0    failed=0    skipped=6    rescued=0    ignored=0   


delete cluster polocy

 kubectl delete crd clusterpolicies.nvidia.com
-bash: /usr/bin/kubectl: No such file or directory

check inventory

bash setup.sh check-inventory.yml
Provide the path to the hosts file [./hosts]: 

reinstall

bash setup.sh instal

output.txt (190.0 KB)

If you want to jump in a teams call let me know please.

P.S

nvidia-smi (not present) and nvsm health is still Unhealthy

Previously, there are similar topics for stuck “TASK [Waiting for the Cluster to become available]”.
See TAO Toolkit 4.0 setup issue - #19 by Morganh
AutoML installation problem [Waiting for the Cluster to become available] - #7 by Morganh
It is solved with above commands.

To debug, could you open another terminal to check the logs via below? Some marks(****) depends on real name.

$ kubectl get pods
$ kubectl get pod -n nvidia-gpu-operator
$ kubectl logs -n nvidia-gpu-operator nvidia-driver-daemonset-*****
$ kubectl pod get -n gpu-operator-operator nvidia-cuda-validator-****

Aplogies @Morganh how do I fill the ***'s I’m not excatly sure, what did you mean by real name?

You can get the name after running
$ kubectl get pod -n nvidia-gpu-operator

I get this

get pods

kubectl get pods
No resources found in default namespace.

kubectl get pod -n nvidia-gpu-operator

kubectl get pod -n nvidia-gpu-operator
NAME                                                              READY   STATUS                  RESTARTS       AGE
gpu-feature-discovery-lmrkw                                       0/1     Init:0/1                0              2m6s
gpu-operator-1678981498-node-feature-discovery-master-79ddmqcr6   1/1     Running                 0              2m13s
gpu-operator-1678981498-node-feature-discovery-worker-chxdx       1/1     Running                 2 (117s ago)   24m
gpu-operator-7bfc5f55-cgrxl                                       1/1     Running                 0              2m13s
nvidia-container-toolkit-daemonset-pb4dq                          0/1     Init:0/1                0              2m6s
nvidia-dcgm-exporter-jp6wr                                        0/1     Init:0/1                0              2m7s
nvidia-device-plugin-daemonset-pd925                              0/1     Init:0/1                0              2m7s
nvidia-driver-daemonset-gwmlj                                     0/1     Init:CrashLoopBackOff   9 (2m6s ago)   25m
nvidia-operator-validator-5qz4c                                   0/1     Init:0/4                0              2m6s

I seem to not get nvidia-gpu-operator nvidia-driver-daemonset-** or gpu-operator-operator nvidia-cuda-validator-***

This one: nvidia-driver-daemonset-gwmlj

Sorry for the delay

kubectl logs -n nvidia-gpu-operator nvidia-driver-daemonset-gwmlj
Error from server (BadRequest): container "nvidia-driver-ctr" in pod "nvidia-driver-daemonset-gwmlj" is waiting to start: PodInitializing

Can you open a new terminal to run
$ kubectl delete crd clusterpolicies.nvidia.com