• Hardware DGX Station A100
- Enterprise Support case reference: 00568165
• How to reproduce the issue ? (This is for errors. Please share the command line and the detailed log here.)
Note: I’m not an experianced kubernetes user this is my first dablble in it so apologies if this is very obvious
Our network topology:
I tried to do this in several ways.
1. [Attempt 1]: Use the cpu server as master and the DGX as a worker node
As outlined here step by step.
Then downloaded the tao-getting-started_v4.0.0
to the pc (shown in the network topology diagram) and tried to run the setup.sh
my hosts file:
# List all hosts below.
# For single node deployment, listing the master is enough.
[master]
ip_addr_of_the_cpu_server ansible_ssh_user='serveruser' ansible_ssh_private_key_file='path/to/key.file' ansible_ssh_extra_args='-o StrictHostKeyChecking=No'
[nodes]
ip_addr_of_the_dgx ansible_ssh_user='dgxuser' ansible_ssh_private_key_file='path/to/key.file' ansible_ssh_extra_args='-o StrictHostKeyChecking=No'
my tao-toolkit-api-ansible-values.yml
file
ngc_api_key: my_ngc_api_key
ngc_email: my_email
api_chart: https://helm.ngc.nvidia.com/nvidia/tao/charts/tao-toolkit-api-4.0.0.tgz
api_values: ./tao-toolkit-api-helm-values.yml
cluster_name: my_cluster_name
Then i realised (after script run attempt) that tao-getting-started_v4.0.0/setup/quickstart_api_bare_metal/setup.sh
call the check-inventory.yml
has the following assertion
- name: check os version
assert:
that: "ansible_distribution_version in ['18.04', '20.04']"
which fails because
when I run ansible localhost -m setup -a 'filter=ansible_distribution_version'
on the master node (hostname gsrv
) I get
[WARNING]: No inventory was parsed, only implicit localhost is available
localhost | SUCCESS => {
"ansible_facts": {
"ansible_distribution_version": "22.04"
},
"changed": false
}
This I belive (as the network topology diagram shows ) is because the server OS version is not compatible.
So Method 1 failed! :(
2. [Method 2 ]: Use dgx as a single node cluser (because I thought downgrading the cpou server to 20.04 was not woth the hassle)
So I went on to install k8 on dgx following this guide with the blessing of enterprise support (assuming I was able to sucessfully convey my motives to them through the horrible comment box interface)
All went well! and then I had two options from there
2.1 [Attempt 2.1] From my pc (because I had the tao-getting-started_v4.0.0
directory from my previous attempt)
- I modified the hosts file to only have a master, changed the master IP to the dgx station and the correct keys
- I did not change the
tao-toolkit-api-ansible-values.yml
file
Then I ran the setup script again.
This is what I got
PLAY [all] ***********************************************************************************************************************************************
TASK [Gathering Facts] ***********************************************************************************************************************************
ok: [172.16.3.2]
TASK [check os] ******************************************************************************************************************************************
ok: [172.16.3.2]
TASK [check os version] **********************************************************************************************************************************
ok: [172.16.3.2] => {
"changed": false,
"msg": "All assertions passed"
}
TASK [check disk size sufficient] ************************************************************************************************************************
ok: [172.16.3.2]
TASK [check sufficient memory] ***************************************************************************************************************************
ok: [172.16.3.2]
TASK [check sufficient number of cpu cores] **************************************************************************************************************
ok: [172.16.3.2]
TASK [check sudo privileges] *****************************************************************************************************************************
ok: [172.16.3.2]
TASK [capture gpus per node] *****************************************************************************************************************************
changed: [172.16.3.2]
TASK [check not more than 1 gpu per node] ****************************************************************************************************************
ok: [172.16.3.2]
TASK [check exactly 1 master] ****************************************************************************************************************************
ok: [172.16.3.2 -> localhost]
TASK [capture host details] ******************************************************************************************************************************
ok: [172.16.3.2 -> localhost]
TASK [print host details] ********************************************************************************************************************************
ok: [172.16.3.2 -> localhost] => {
"host_details": [
{
"host": "dgx",
"os": "Ubuntu",
"os_version": "20.04"
}
]
}
TASK [check all instances have single os] ****************************************************************************************************************
ok: [172.16.3.2 -> localhost]
TASK [check all instances have single os version] ********************************************************************************************************
ok: [172.16.3.2 -> localhost]
TASK [capture os] ****************************************************************************************************************************************
changed: [172.16.3.2 -> localhost]
TASK [capture os version] ********************************************************************************************************************************
changed: [172.16.3.2 -> localhost]
PLAY RECAP ***********************************************************************************************************************************************
172.16.3.2 : ok=16 changed=3 unreachable=0 failed=0 skipped=0 rescued=0 ignored=0
PLAY [all] ***********************************************************************************************************************************************
TASK [Gathering Facts] ***********************************************************************************************************************************
ok: [172.16.3.2]
TASK [uninstall nvidia using installer] ******************************************************************************************************************
fatal: [172.16.3.2]: FAILED! => {"changed": false, "cmd": "nvidia-installer --uninstall --silent", "msg": "[Errno 2] No such file or directory: b'nvidia-installer'", "rc": 2, "stderr": "", "stderr_lines": [], "stdout": "", "stdout_lines": []}
...ignoring
TASK [uninstall nvidia and cuda drivers] *****************************************************************************************************************
changed: [172.16.3.2]
PLAY RECAP ***********************************************************************************************************************************************
172.16.3.2 : ok=3 changed=1 unreachable=0 failed=0 skipped=0 rescued=0 ignored=1
PLAY [master] ****************************************************************************************************************************************************************************************************
TASK [Gathering Facts] *******************************************************************************************************************************************************************************************
ok: [172.16.3.2]
TASK [Uninstall the GPU Operator with MIG] ***********************************************************************************************************************************************************************
skipping: [172.16.3.2]
PLAY [all] *******************************************************************************************************************************************************************************************************
TASK [Gathering Facts] *******************************************************************************************************************************************************************************************
ok: [172.16.3.2]
TASK [Reset Kubernetes component] ********************************************************************************************************************************************************************************
changed: [172.16.3.2]
TASK [IPTables Cleanup] ******************************************************************************************************************************************************************************************
changed: [172.16.3.2]
TASK [Remove Conatinerd and Kubernetes packages for Ubuntu] ******************************************************************************************************************************************************
changed: [172.16.3.2]
TASK [Remove Docker and Kubernetes packages for Ubuntu] **********************************************************************************************************************************************************
skipping: [172.16.3.2]
TASK [Remove NVIDIA Docker for Cloud Native Core Developers] *****************************************************************************************************************************************************
skipping: [172.16.3.2]
TASK [Remove dependencies that are no longer required] ***********************************************************************************************************************************************************
changed: [172.16.3.2]
TASK [Remove installed packages for RHEL/CentOS] *****************************************************************************************************************************************************************
skipping: [172.16.3.2]
TASK [Cleanup Containerd Process] ********************************************************************************************************************************************************************************
changed: [172.16.3.2]
TASK [Cleanup Directories for Cloud Native Core Developers] ******************************************************************************************************************************************************
skipping: [172.16.3.2] => (item=/etc/docker)
skipping: [172.16.3.2] => (item=/var/lib/docker)
skipping: [172.16.3.2] => (item=/var/run/docker)
skipping: [172.16.3.2] => (item=/run/docker.sock)
skipping: [172.16.3.2] => (item=/run/docker)
TASK [Cleanup Directories] ***************************************************************************************************************************************************************************************
changed: [172.16.3.2] => (item=/var/lib/etcd)
changed: [172.16.3.2] => (item=/etc/kubernetes)
ok: [172.16.3.2] => (item=/usr/local/bin/helm)
ok: [172.16.3.2] => (item=/var/lib/crio)
ok: [172.16.3.2] => (item=/etc/crio)
ok: [172.16.3.2] => (item=/usr/local/bin/crio)
changed: [172.16.3.2] => (item=/var/log/containers)
ok: [172.16.3.2] => (item=/etc/apt/sources.list.d/devel*)
ok: [172.16.3.2] => (item=/etc/sysctl.d/99-kubernetes-cri.conf)
ok: [172.16.3.2] => (item=/etc/modules-load.d/containerd.conf)
ok: [172.16.3.2] => (item=/etc/modules-load.d/crio.conf)
ok: [172.16.3.2] => (item=/etc/apt/trusted.gpg.d/libcontainers*)
ok: [172.16.3.2] => (item=/etc/default/kubelet)
changed: [172.16.3.2] => (item=/etc/cni/net.d)
TASK [Reboot the system] *****************************************************************************************************************************************************************************************
skipping: [172.16.3.2]
PLAY RECAP *******************************************************************************************************************************************************************************************************
172.16.3.2 : ok=8 changed=6 unreachable=0 failed=0 skipped=6 rescued=0 ignored=0
PLAY [all] *******************************************************************************************************************************************************************************************************
TASK [Gathering Facts] *******************************************************************************************************************************************************************************************
ok: [172.16.3.2]
TASK [set_fact] **************************************************************************************************************************************************************************************************
ok: [172.16.3.2]
TASK [Checking Nouveau is disabled] ******************************************************************************************************************************************************************************
changed: [172.16.3.2]
TASK [unload nouveau] ********************************************************************************************************************************************************************************************
ok: [172.16.3.2]
TASK [blacklist nouveau] *****************************************************************************************************************************************************************************************
ok: [172.16.3.2]
TASK [Test Internet Connection] **********************************************************************************************************************************************************************************
ok: [172.16.3.2]
TASK [Report Internet Connection status] *************************************************************************************************************************************************************************
ok: [172.16.3.2] => {
"msg": "Internet Connection status 200"
}
TASK [Install Internet Speed dependencies] ***********************************************************************************************************************************************************************
fatal: [172.16.3.2]: FAILED! => {"changed": false, "msg": "Failed to update apt cache: unknown reason"}
PLAY RECAP *******************************************************************************************************************************************************************************************************
172.16.3.2 : ok=7 changed=1 unreachable=0 failed=1 skipped=0 rescued=0 ignored=0
apart from having to witness that my gpu drivers got deleted the script failed due to Failed to update apt cache: unknown reason
Then I ssh’d into dgx
→ ran rm /etc/apt/sources.list.d/kubernetes.list
to clear that
→ ran apt update
and I got this
Hit:1 http://us.archive.ubuntu.com/ubuntu focal InRelease
Get:2 http://security.ubuntu.com/ubuntu focal-security InRelease [114 kB]
Get:3 http://us.archive.ubuntu.com/ubuntu focal-updates InRelease [114 kB]
Hit:4 https://repo.download.nvidia.com/baseos/ubuntu/focal/x86_64 focal InRelease
Err:4 https://repo.download.nvidia.com/baseos/ubuntu/focal/x86_64 focal InRelease
The following signatures couldn't be verified because the public key is not available: NO_PUBKEY 208CE844D9F220AD
Hit:6 https://repo.download.nvidia.com/baseos/ubuntu/focal/x86_64 focal-updates InRelease
Err:6 https://repo.download.nvidia.com/baseos/ubuntu/focal/x86_64 focal-updates InRelease
The following signatures couldn't be verified because the public key is not available: NO_PUBKEY 208CE844D9F220AD
Get:7 http://security.ubuntu.com/ubuntu focal-security/main amd64 DEP-11 Metadata [60.0 kB]
Get:8 http://us.archive.ubuntu.com/ubuntu focal-updates/main amd64 DEP-11 Metadata [275 kB]
Get:9 http://security.ubuntu.com/ubuntu focal-security/universe amd64 DEP-11 Metadata [95.0 kB]
Get:10 http://security.ubuntu.com/ubuntu focal-security/multiverse amd64 DEP-11 Metadata [940 B]
Get:11 http://us.archive.ubuntu.com/ubuntu focal-updates/universe amd64 DEP-11 Metadata [409 kB]
Get:12 http://us.archive.ubuntu.com/ubuntu focal-updates/multiverse amd64 DEP-11 Metadata [944 B]
Hit:5 https://packages.cloud.google.com/apt kubernetes-xenial InRelease
Fetched 1,068 kB in 5s (230 kB/s)
Reading package lists... Done
Building dependency tree
Reading state information... Done
51 packages can be upgraded. Run 'apt list --upgradable' to see them.
W: An error occurred during the signature verification. The repository is not updated and the previous index files will be used. GPG error: https://repo.download.nvidia.com/baseos/ubuntu/focal/x86_64 focal InRelease: The following signatures couldn't be verified because the public key is not available: NO_PUBKEY 208CE844D9F220AD
W: An error occurred during the signature verification. The repository is not updated and the previous index files will be used. GPG error: https://repo.download.nvidia.com/baseos/ubuntu/focal/x86_64 focal-updates InRelease: The following signatures couldn't be verified because the public key is not available: NO_PUBKEY 208CE844D9F220AD
W: Failed to fetch https://repo.download.nvidia.com/baseos/ubuntu/focal/x86_64/dists/focal/InRelease The following signatures couldn't be verified because the public key is not available: NO_PUBKEY 208CE844D9F220AD
W: Failed to fetch https://repo.download.nvidia.com/baseos/ubuntu/focal/x86_64/dists/focal-updates/InRelease The following signatures couldn't be verified because the public key is not available: NO_PUBKEY 208CE844D9F220AD
W: Some index files failed to download. They have been ignored, or old ones used instead.
I think this happned because the install script deleted some stuff so
Problem 1: How Do I cirrect this
On top of that this has taken down my gpu drivers
when I run nvsm show health
I get
sudo nvsm show health
Running standard show health
Info
----
Timestamp : Tue Mar 14 10:43:18 GMT 2023
Version : 22.09.03
Checks
------
Verify installed DIMM memory sticks.................................. Healthy
Verify Network Adapters.............................................. Healthy
Verify installed GPU's............................................... Healthy
Verify installed VGA Controllers..................................... Healthy
Verify PCIe switches................................................. Healthy
Verify DIMM part number.............................................. Healthy
Verify DIMM vendors.................................................. Healthy
Verify DIMM vendors consistency...................................... Healthy
Verify chassis [Serial:1560422011502] fan presence................... Healthy
Quick health check of GPU using DCGM................................. Unknown
No GPU resources found. No response or error while connecting to DCGM.
Could not load NVML library
Status of volumes.................................................... Healthy
Check SMART status of NVMe devices................................... Healthy
Verify installed NVMe devices........................................ Healthy
Verify chassis [Serial:1560422011502] power supply presence.......... Healthy
NVMe link speed [0000:03:00.0][16GT/s]............................... Healthy
NVMe link width [0000:03:00.0][x4]................................... Healthy
NVMe link speed [0000:41:00.0][8GT/s]................................ Healthy
NVMe link width [0000:41:00.0][x4]................................... Healthy
NetworkAdapter link speed [0000:42:00.0][8GT/s]...................... Healthy
NetworkAdapter link width [0000:42:00.0][x4]......................... Healthy
NetworkAdapter link speed [0000:42:00.1][8GT/s]...................... Healthy
NetworkAdapter link width [0000:42:00.1][x4]......................... Healthy
Check BMC sensor thresholds for chassis [Serial: 1560422011502] ..... Healthy
Checked 65 sensor values against BMC thresholds.
BMC Firmware Revision [1.24.00]
BaseOS Version [5.4.2]
BIOS Version [L10.16]
Linux kernel version [5.4.0-144-generic]
System Uptime [up 1 hour, 22 minutes]
Serial Number [SERIAL_NUMBER]
Chassis [...] Power Supply Info
[PSU1] Vendor[Delta] Model[ECD15050001] Serial[...] Firmware[3.8]
System Summary
--------------
Product Name:DGX Station A100 920-23487-2530-0R0
Manufacturer:NVIDIA
Serial Number:1[..]
Uptime:up 1 hour, 22 minutes
MotherBoard:
BIOS Version:L10.16
Serial Number:
BMC:
Firmware Version:1.24.00
IPMI Version:2.0
Software:
BaseOS Version:5.4.2
Kernel Version:5.4.0-144-generic
OS Version:Ubuntu 20.04.5 LTS (Focal Fossa)
Health Summary
--------------
22 out of 23 checks are healthy
0 out of 23 checks are unhealthy
1 out of 23 checks are unknown
0 out of 23 checks are informational
0 out of 23 checks are disabled
0 out of 23 checks are skipped
Overall system status is unhealthy
Problem detected.
1. Please run 'sudo nvsm dump health'
2. Please open a case with NVIDIA Enterprise Support at this address https://nvid.nvidia.com/dashboard/
3. Attach the log file from /tmp/nvsm-health-<hostname>-<timestamp>.tar.xz
100.0% [=========================================]
Status: Unhealthy
Then I thougt Method 2.1 failed so I better run the setup from with the dgx rather than from my, this brings us to the next attempt
2.2[Attempt 2.2]
Basically same as 2.1 but ran from the dgx
changed the host file IPs to localhost (same effect if I gave it the IP taken up by the network interface)
and ran the setup again
I got
PLAY [all] *****************************************************************************************************************************************************************************
TASK [Gathering Facts] *****************************************************************************************************************************************************************
ok: [127.0.0.1]
TASK [check os] ************************************************************************************************************************************************************************
ok: [127.0.0.1]
TASK [check os version] ****************************************************************************************************************************************************************
ok: [127.0.0.1]
TASK [check disk size sufficient] ******************************************************************************************************************************************************
ok: [127.0.0.1]
TASK [check sufficient memory] *********************************************************************************************************************************************************
ok: [127.0.0.1]
TASK [check sufficient number of cpu cores] ********************************************************************************************************************************************
ok: [127.0.0.1]
TASK [check sudo privileges] ***********************************************************************************************************************************************************
ok: [127.0.0.1]
TASK [capture gpus per node] ***********************************************************************************************************************************************************
changed: [127.0.0.1]
TASK [check not more than 1 gpu per node] **********************************************************************************************************************************************
ok: [127.0.0.1]
TASK [check exactly 1 master] **********************************************************************************************************************************************************
ok: [127.0.0.1 -> localhost]
TASK [capture host details] ************************************************************************************************************************************************************
ok: [127.0.0.1 -> localhost]
TASK [print host details] **************************************************************************************************************************************************************
ok: [127.0.0.1 -> localhost] => {
"host_details": [
{
"host": "dgx",
"os": "Ubuntu",
"os_version": "20.04"
}
]
}
TASK [check all instances have single os] **********************************************************************************************************************************************
ok: [127.0.0.1 -> localhost]
TASK [check all instances have single os version] **************************************************************************************************************************************
ok: [127.0.0.1 -> localhost]
TASK [capture os] **********************************************************************************************************************************************************************
changed: [127.0.0.1 -> localhost]
TASK [capture os version] **************************************************************************************************************************************************************
changed: [127.0.0.1 -> localhost]
PLAY RECAP *****************************************************************************************************************************************************************************
127.0.0.1 : ok=16 changed=3 unreachable=0 failed=0 skipped=0 rescued=0 ignored=0
cat: target-os: No such file or directory
Basiclly lots of problems, Can you please
- point out where things have goine wrong
- make some suggestions to help fix the dgx (I can reinstall drivers but I’d like to get the TAO API running as well)
Cheers,
Ganindu.