In the same window or a seperate window? (the last window I ran the install
command is still active and maybe Waiting for the Cluster to become available
as it says and the terminal is not released)
Do you want me to (Ctrl
+ C
) that?
In the same window or a seperate window? (the last window I ran the install
command is still active and maybe Waiting for the Cluster to become available
as it says and the terminal is not released)
Do you want me to (Ctrl
+ C
) that?
Yes, just ctrl+c to cancel.
Could you add below? Appreciate for your patience.
$ bash setup.sh uninstall
$ kubectl delete crd clusterpolicies.nvidia.com
$ bash setup.sh check-inventory.yml
$ bash setup.sh install
No worries! thanks a lot for the support!!
uninstall
bash setup.sh uninstall
Provide the path to the hosts file [./hosts]:
PLAY [all] ********************************************************************************************************************************************************************************************************************************************************
TASK [Gathering Facts] ********************************************************************************************************************************************************************************************************************************************
ok: [172.16.3.2]
TASK [check os] ***************************************************************************************************************************************************************************************************************************************************
ok: [172.16.3.2]
TASK [check os version] *******************************************************************************************************************************************************************************************************************************************
ok: [172.16.3.2]
TASK [check disk size sufficient] *********************************************************************************************************************************************************************************************************************************
ok: [172.16.3.2]
TASK [check sufficient memory] ************************************************************************************************************************************************************************************************************************************
ok: [172.16.3.2]
TASK [check sufficient number of cpu cores] ***********************************************************************************************************************************************************************************************************************
ok: [172.16.3.2]
TASK [check sudo privileges] **************************************************************************************************************************************************************************************************************************************
ok: [172.16.3.2]
TASK [capture gpus per node] **************************************************************************************************************************************************************************************************************************************
changed: [172.16.3.2]
TASK [check not more than 1 gpu per node] *************************************************************************************************************************************************************************************************************************
ok: [172.16.3.2]
TASK [check exactly 1 master] *************************************************************************************************************************************************************************************************************************************
ok: [172.16.3.2 -> localhost]
TASK [capture host details] ***************************************************************************************************************************************************************************************************************************************
ok: [172.16.3.2 -> localhost]
TASK [print host details] *****************************************************************************************************************************************************************************************************************************************
ok: [172.16.3.2 -> localhost] => {
"host_details": [
{
"host": "dgx",
"os": "Ubuntu",
"os_version": "20.04"
}
]
}
TASK [check all instances have single os] *************************************************************************************************************************************************************************************************************************
ok: [172.16.3.2 -> localhost]
TASK [check all instances have single os version] *****************************************************************************************************************************************************************************************************************
ok: [172.16.3.2 -> localhost]
TASK [capture os] *************************************************************************************************************************************************************************************************************************************************
ok: [172.16.3.2 -> localhost]
TASK [capture os version] *****************************************************************************************************************************************************************************************************************************************
ok: [172.16.3.2 -> localhost]
PLAY RECAP ********************************************************************************************************************************************************************************************************************************************************
172.16.3.2 : ok=16 changed=1 unreachable=0 failed=0 skipped=0 rescued=0 ignored=0
PLAY [all] ********************************************************************************************************************************************************************************************************************************************************
TASK [Gathering Facts] ********************************************************************************************************************************************************************************************************************************************
ok: [172.16.3.2]
TASK [uninstall nvidia using installer] ***************************************************************************************************************************************************************************************************************************
fatal: [172.16.3.2]: FAILED! => {"changed": false, "cmd": "nvidia-installer --uninstall --silent", "msg": "[Errno 2] No such file or directory: b'nvidia-installer'", "rc": 2, "stderr": "", "stderr_lines": [], "stdout": "", "stdout_lines": []}
...ignoring
TASK [uninstall nvidia and cuda drivers] **************************************************************************************************************************************************************************************************************************
ok: [172.16.3.2]
PLAY RECAP ********************************************************************************************************************************************************************************************************************************************************
172.16.3.2 : ok=3 changed=0 unreachable=0 failed=0 skipped=0 rescued=0 ignored=1
PLAY [master] *****************************************************************************************************************************************************************************************************************************************************
TASK [Gathering Facts] ********************************************************************************************************************************************************************************************************************************************
ok: [172.16.3.2]
TASK [Uninstall the GPU Operator with MIG] ************************************************************************************************************************************************************************************************************************
skipping: [172.16.3.2]
PLAY [all] ********************************************************************************************************************************************************************************************************************************************************
TASK [Gathering Facts] ********************************************************************************************************************************************************************************************************************************************
ok: [172.16.3.2]
TASK [Reset Kubernetes component] *********************************************************************************************************************************************************************************************************************************
changed: [172.16.3.2]
TASK [IPTables Cleanup] *******************************************************************************************************************************************************************************************************************************************
changed: [172.16.3.2]
TASK [Remove Conatinerd and Kubernetes packages for Ubuntu] *******************************************************************************************************************************************************************************************************
changed: [172.16.3.2]
TASK [Remove Docker and Kubernetes packages for Ubuntu] ***********************************************************************************************************************************************************************************************************
skipping: [172.16.3.2]
TASK [Remove NVIDIA Docker for Cloud Native Core Developers] ******************************************************************************************************************************************************************************************************
skipping: [172.16.3.2]
TASK [Remove dependencies that are no longer required] ************************************************************************************************************************************************************************************************************
changed: [172.16.3.2]
TASK [Remove installed packages for RHEL/CentOS] ******************************************************************************************************************************************************************************************************************
skipping: [172.16.3.2]
TASK [Cleanup Containerd Process] *********************************************************************************************************************************************************************************************************************************
changed: [172.16.3.2]
TASK [Cleanup Directories for Cloud Native Core Developers] *******************************************************************************************************************************************************************************************************
skipping: [172.16.3.2] => (item=/etc/docker)
skipping: [172.16.3.2] => (item=/var/lib/docker)
skipping: [172.16.3.2] => (item=/var/run/docker)
skipping: [172.16.3.2] => (item=/run/docker.sock)
skipping: [172.16.3.2] => (item=/run/docker)
TASK [Cleanup Directories] ****************************************************************************************************************************************************************************************************************************************
changed: [172.16.3.2] => (item=/var/lib/etcd)
changed: [172.16.3.2] => (item=/etc/kubernetes)
changed: [172.16.3.2] => (item=/usr/local/bin/helm)
ok: [172.16.3.2] => (item=/var/lib/crio)
ok: [172.16.3.2] => (item=/etc/crio)
ok: [172.16.3.2] => (item=/usr/local/bin/crio)
changed: [172.16.3.2] => (item=/var/log/containers)
ok: [172.16.3.2] => (item=/etc/apt/sources.list.d/devel*)
ok: [172.16.3.2] => (item=/etc/sysctl.d/99-kubernetes-cri.conf)
changed: [172.16.3.2] => (item=/etc/modules-load.d/containerd.conf)
ok: [172.16.3.2] => (item=/etc/modules-load.d/crio.conf)
ok: [172.16.3.2] => (item=/etc/apt/trusted.gpg.d/libcontainers*)
changed: [172.16.3.2] => (item=/etc/default/kubelet)
changed: [172.16.3.2] => (item=/etc/cni/net.d)
TASK [Reboot the system] ******************************************************************************************************************************************************************************************************************************************
skipping: [172.16.3.2]
PLAY RECAP ********************************************************************************************************************************************************************************************************************************************************
172.16.3.2 : ok=8 changed=6 unreachable=0 failed=0 skipped=6 rescued=0 ignored=0
delete cluster polocy
kubectl delete crd clusterpolicies.nvidia.com
-bash: /usr/bin/kubectl: No such file or directory
check inventory
bash setup.sh check-inventory.yml
Provide the path to the hosts file [./hosts]:
reinstall
bash setup.sh instal
output.txt (190.0 KB)
If you want to jump in a teams call let me know please.
P.S
nvidia-smi
(not present) and nvsm health is still Unhealthy
Previously, there are similar topics for stuck “TASK [Waiting for the Cluster to become available]”.
See TAO Toolkit 4.0 setup issue - #19 by Morganh
AutoML installation problem [Waiting for the Cluster to become available] - #7 by Morganh
It is solved with above commands.
To debug, could you open another terminal to check the logs via below? Some marks(****) depends on real name.
$ kubectl get pods
$ kubectl get pod -n nvidia-gpu-operator
$ kubectl logs -n nvidia-gpu-operator nvidia-driver-daemonset-*****
$ kubectl pod get -n gpu-operator-operator nvidia-cuda-validator-****
You can get the name after running
$ kubectl get pod -n nvidia-gpu-operator
I get this
get pods
kubectl get pods
No resources found in default namespace.
kubectl get pod -n nvidia-gpu-operator
kubectl get pod -n nvidia-gpu-operator
NAME READY STATUS RESTARTS AGE
gpu-feature-discovery-lmrkw 0/1 Init:0/1 0 2m6s
gpu-operator-1678981498-node-feature-discovery-master-79ddmqcr6 1/1 Running 0 2m13s
gpu-operator-1678981498-node-feature-discovery-worker-chxdx 1/1 Running 2 (117s ago) 24m
gpu-operator-7bfc5f55-cgrxl 1/1 Running 0 2m13s
nvidia-container-toolkit-daemonset-pb4dq 0/1 Init:0/1 0 2m6s
nvidia-dcgm-exporter-jp6wr 0/1 Init:0/1 0 2m7s
nvidia-device-plugin-daemonset-pd925 0/1 Init:0/1 0 2m7s
nvidia-driver-daemonset-gwmlj 0/1 Init:CrashLoopBackOff 9 (2m6s ago) 25m
nvidia-operator-validator-5qz4c 0/1 Init:0/4 0 2m6s
I seem to not get nvidia-gpu-operator nvidia-driver-daemonset-**
or gpu-operator-operator nvidia-cuda-validator-***
This one: nvidia-driver-daemonset-gwmlj
Sorry for the delay
kubectl logs -n nvidia-gpu-operator nvidia-driver-daemonset-gwmlj
Error from server (BadRequest): container "nvidia-driver-ctr" in pod "nvidia-driver-daemonset-gwmlj" is waiting to start: PodInitializing
while TASK [Waiting for the Cluster to become available]
in the other terminal or SIGINT that?
Yes, keep as is. Should Not ctrl+c.
kubectl delete crd clusterpolicies.nvidia.com
customresourcedefinition.apiextensions.k8s.io "clusterpolicies.nvidia.com" deleted
That seems to be a change… hmm…
Then monitor the original terminal to check if the stuck is gone.
yeah doing stuff!!!
TASK [Waiting for the Cluster to become available] ****************************************************************************************************************************************************************************************************************
changed: [172.16.3.2]
TASK [Validate kubernetes cluster is up] **************************************************************************************************************************************************************************************************************************
changed: [172.16.3.2]
TASK [Check Operating System Version] *****************************************************************************************************************************************************************************************************************************
changed: [172.16.3.2]
TASK [Check Docker Version] ***************************************************************************************************************************************************************************************************************************************
skipping: [172.16.3.2]
TASK [Check Containerd Version] ***********************************************************************************************************************************************************************************************************************************
changed: [172.16.3.2]
TASK [Check Kubernetes Version] ***********************************************************************************************************************************************************************************************************************************
changed: [172.16.3.2]
TASK [Check Helm Version] *****************************************************************************************************************************************************************************************************************************************
changed: [172.16.3.2]
TASK [Check Nvidia GPU Operator Toolkit versions] *****************************************************************************************************************************************************************************************************************
changed: [172.16.3.2]
TASK [Check Nvidia K8s Device versions] ***************************************************************************************************************************************************************************************************************************
changed: [172.16.3.2]
TASK [Check GPU Operator Nvidia Container Driver versions] ********************************************************************************************************************************************************************************************************
changed: [172.16.3.2]
TASK [Check GPU Operator Nvidia DGCM Versions] ********************************************************************************************************************************************************************************************************************
changed: [172.16.3.2]
TASK [Check GPU Operator Node Feature Discovery Versions] *********************************************************************************************************************************************************************************************************
changed: [172.16.3.2]
TASK [Check GPU Operator GPU Feature Discovery Versions] **********************************************************************************************************************************************************************************************************
changed: [172.16.3.2]
TASK [Check Nvidia GPU Operator versions] *************************************************************************************************************************************************************************************************************************
changed: [172.16.3.2]
TASK [Check Nvidia MIG Maanger versions] **************************************************************************************************************************************************************************************************************************
changed: [172.16.3.2]
TASK [Check Nvidia validator versions] ****************************************************************************************************************************************************************************************************************************
changed: [172.16.3.2]
TASK [Check Nvidia DCGM Exporter versions] ************************************************************************************************************************************************************************************************************************
changed: [172.16.3.2]
TASK [Check NVIDIA Driver Version] ********************************************************************************************************************************************************************************************************************************
skipping: [172.16.3.2]
TASK [Check NVIDIA Container ToolKit Version] *********************************************************************************************************************************************************************************************************************
skipping: [172.16.3.2]
TASK [Check Mellanox Network Operator version] ********************************************************************************************************************************************************************************************************************
skipping: [172.16.3.2]
TASK [Check Mellanox MOFED Driver Version] ************************************************************************************************************************************************************************************************************************
skipping: [172.16.3.2]
TASK [Check RDMA Shared Device Plugin Version] ********************************************************************************************************************************************************************************************************************
skipping: [172.16.3.2]
TASK [Check SRIOV Device Plugin Version] **************************************************************************************************************************************************************************************************************************
skipping: [172.16.3.2]
TASK [Check Container Networking Plugins Version] *****************************************************************************************************************************************************************************************************************
skipping: [172.16.3.2]
TASK [Check Multus Version] ***************************************************************************************************************************************************************************************************************************************
skipping: [172.16.3.2]
TASK [Check Whereabouts Version] **********************************************************************************************************************************************************************************************************************************
skipping: [172.16.3.2]
TASK [check master node is up and running] ************************************************************************************************************************************************************************************************************************
changed: [172.16.3.2]
TASK [Check all pods are running for Kubernetes] ******************************************************************************************************************************************************************************************************************
changed: [172.16.3.2]
TASK [validate helm installed] ************************************************************************************************************************************************************************************************************************************
changed: [172.16.3.2]
TASK [Collecting Number of GPU's] *********************************************************************************************************************************************************************************************************************************
changed: [172.16.3.2]
TASK [Create NVIDIA-SMI yaml] *************************************************************************************************************************************************************************************************************************************
skipping: [172.16.3.2]
TASK [Report Operating System Version of RHEL/CentOS] *************************************************************************************************************************************************************************************************************
skipping: [172.16.3.2]
TASK [Report Operating System Version of Ubuntu] ******************************************************************************************************************************************************************************************************************
ok: [172.16.3.2] => {
"msg": "Ubuntu Operating System version 20.04.5 LTS (Focal Fossa)"
}
TASK [Report Docker Version] **************************************************************************************************************************************************************************************************************************************
skipping: [172.16.3.2]
TASK [Report Containerd Version] **********************************************************************************************************************************************************************************************************************************
ok: [172.16.3.2] => {
"msg": " Containerd Version v1.6.2"
}
TASK [Report Kubernetes Version] **********************************************************************************************************************************************************************************************************************************
ok: [172.16.3.2] => {
"msg": "Kubernetes Version v1.23.5"
}
TASK [Report Helm Version] ****************************************************************************************************************************************************************************************************************************************
ok: [172.16.3.2] => {
"msg": "Helm Version 3.8.1"
}
TASK [Report Nvidia GPU Operator version] *************************************************************************************************************************************************************************************************************************
ok: [172.16.3.2] => {
"msg": "Nvidia GPU Operator versions v1.10.1"
}
TASK [Report Nvidia Container Driver Version] *********************************************************************************************************************************************************************************************************************
ok: [172.16.3.2] => {
"msg": "Nvidia Container Driver Version "
}
TASK [Report GPU Operator NV Toolkit Driver] **********************************************************************************************************************************************************************************************************************
ok: [172.16.3.2] => {
"msg": "NV Container Toolkit Version "
}
TASK [Report Nvidia Container Driver Version] *********************************************************************************************************************************************************************************************************************
skipping: [172.16.3.2]
TASK [Report GPU Operator NV Toolkit Driver] **********************************************************************************************************************************************************************************************************************
skipping: [172.16.3.2]
TASK [Report K8sDevice Plugin Version] ****************************************************************************************************************************************************************************************************************************
ok: [172.16.3.2] => {
"msg": "Nvidia K8s Device Plugin Version "
}
TASK [Report Data Center GPU Manager (DCGM) Version] **************************************************************************************************************************************************************************************************************
ok: [172.16.3.2] => {
"msg": "Data Center GPU Manager (DCGM) Version "
}
TASK [Report Node Feature Discovery Version] **********************************************************************************************************************************************************************************************************************
ok: [172.16.3.2] => {
"msg": "Node Feature Discovery Version v0.10.1"
}
TASK [Report GPU Feature Discovery Version] ***********************************************************************************************************************************************************************************************************************
ok: [172.16.3.2] => {
"msg": "GPU Feature Discovery Version "
}
TASK [Report Nvidia validator version] ****************************************************************************************************************************************************************************************************************************
ok: [172.16.3.2] => {
"msg": "Nvidia validator version "
}
TASK [Report Nvidia DCGM Exporter version] ************************************************************************************************************************************************************************************************************************
ok: [172.16.3.2] => {
"msg": "Nvidia DCGM Exporter version "
}
TASK [Report Nvidia MIG Maanger version] **************************************************************************************************************************************************************************************************************************
ok: [172.16.3.2] => {
"msg": "Nvidia MIG Maanger version "
}
TASK [Componenets Matrix Versions Vs Installed Versions] **********************************************************************************************************************************************************************************************************
changed: [172.16.3.2]
TASK [Report Versions] ********************************************************************************************************************************************************************************************************************************************
ok: [172.16.3.2] => {
"msg": [
"===========================================================================================",
" Components Matrix Version || Installed Version ",
"===========================================================================================",
"GPU Operator Version v1.10.1 || v1.10.1",
"Nvidia Container Driver Version 510.47.03 || ",
"GPU Operator NV Toolkit Driver v1.9.0 || ",
"K8sDevice Plugin Version v0.11.0 || ",
"Data Center GPU Manager(DCGM) Version 2.3.4-2.6.4 || ",
"Node Feature Discovery Version v0.10.1 || v0.10.1",
"GPU Feature Discovery Version v0.5.0 || ",
"Nvidia validator version v1.10.1 || ",
"Nvidia MIG Manager version 0.3.0 || ",
"",
"Note: NVIDIA Mig Manager is valid for only Amphere GPU's like A100, A30",
"",
"Please validate between Matrix Version and Installed Version listed above"
]
}
TASK [Componenets Matrix Versions Vs Installed Versions] **********************************************************************************************************************************************************************************************************
skipping: [172.16.3.2]
TASK [Report Versions] ********************************************************************************************************************************************************************************************************************************************
skipping: [172.16.3.2]
TASK [Componenets Matrix Versions Vs Installed Versions] **********************************************************************************************************************************************************************************************************
skipping: [172.16.3.2]
TASK [Report Versions] ********************************************************************************************************************************************************************************************************************************************
skipping: [172.16.3.2]
TASK [Validate the GPU Operator pods State] ***********************************************************************************************************************************************************************************************************************
changed: [172.16.3.2]
TASK [Report GPU Operator Pods] ***********************************************************************************************************************************************************************************************************************************
ok: [172.16.3.2] => {
"msg": [
"nvidia-gpu-operator gpu-operator-1678981498-node-feature-discovery-master-79ddznwhj 1/1 Running 0 5m35s",
"nvidia-gpu-operator gpu-operator-1678981498-node-feature-discovery-worker-chxdx 1/1 Running 4 (5m17s ago) 43m",
"nvidia-gpu-operator gpu-operator-7bfc5f55-ghhvs 1/1 Running 0 5m35s"
]
}
TASK [Validate GPU Operator Version for Cloud Native Core 6.2 and 7.0] ********************************************************************************************************************************************************************************************
changed: [172.16.3.2]
TASK [Validate GPU Operator Version for Cloud Native Core 6.1] ****************************************************************************************************************************************************************************************************
changed: [172.16.3.2]
TASK [Validating the nvidia-smi on NVIDIA Cloud Native Core] ******************************************************************************************************************************************************************************************************
skipping: [172.16.3.2]
TASK [Validating the nvidia-smi on NVIDIA Cloud Native Core] ******************************************************************************************************************************************************************************************************
skipping: [172.16.3.2]
TASK [Validating the nvidia-smi on NVIDIA Cloud Native Core] ******************************************************************************************************************************************************************************************************
ASYNC FAILED on 172.16.3.2: jid=184764697792.434153
fatal: [172.16.3.2]: FAILED! => {"ansible_job_id": "184764697792.434153", "changed": true, "cmd": ["kubectl", "run", "gpu-test", "--rm", "-t", "-i", "--restart=Never", "--image=nvidia/cuda:11.6.0-base-ubuntu20.04", "--limits=nvidia.com/gpu=", "--", "nvidia-smi"], "delta": "0:00:11.738078", "end": "2023-03-16 16:29:29.685476", "finished": 1, "msg": "non-zero return code", "rc": 128, "results_file": "/home/g/.ansible_async/184764697792.434153", "start": "2023-03-16 16:29:17.947398", "started": 1, "stderr": "Flag --limits has been deprecated, has no effect and will be removed in 1.24.\npod default/gpu-test terminated (StartError)\nfailed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: exec: \"nvidia-smi\": executable file not found in $PATH: unknown", "stderr_lines": ["Flag --limits has been deprecated, has no effect and will be removed in 1.24.", "pod default/gpu-test terminated (StartError)", "failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: exec: \"nvidia-smi\": executable file not found in $PATH: unknown"], "stdout": "pod \"gpu-test\" deleted", "stdout_lines": ["pod \"gpu-test\" deleted"]}
...ignoring
TASK [Report Nvidia SMI Validation] *******************************************************************************************************************************************************************************************************************************
ok: [172.16.3.2] => {
"msg": [
"pod \"gpu-test\" deleted"
]
}
TASK [Validating the CUDA with GPU] *******************************************************************************************************************************************************************************************************************************
ASYNC POLL on 172.16.3.2: jid=754237448857.435110 started=1 finished=0
ASYNC POLL on 172.16.3.2: jid=754237448857.435110 started=1 finished=0
ASYNC POLL on 172.16.3.2: jid=754237448857.435110 started=1 finished=0
ASYNC FAILED on 172.16.3.2: jid=754237448857.435110
fatal: [172.16.3.2]: FAILED! => {"ansible_job_id": "754237448857.435110", "changed": true, "cmd": "kubectl run cuda-vector-add --rm -t -i --restart=Never --image=k8s.gcr.io/cuda-vector-add:v0.1", "delta": "0:01:00.060847", "end": "2023-03-16 16:30:33.716430", "finished": 1, "msg": "non-zero return code", "rc": 1, "results_file": "/home/g/.ansible_async/754237448857.435110", "start": "2023-03-16 16:29:33.655583", "started": 1, "stderr": "error: timed out waiting for the condition", "stderr_lines": ["error: timed out waiting for the condition"], "stdout": "pod \"cuda-vector-add\" deleted", "stdout_lines": ["pod \"cuda-vector-add\" deleted"]}
...ignoring
TASK [Validating the nvidia-smi on NVIDIA Cloud Native Core 7.0] **************************************************************************************************************************************************************************************************
skipping: [172.16.3.2]
TASK [Report Nvidia SMI Validation] *******************************************************************************************************************************************************************************************************************************
skipping: [172.16.3.2]
TASK [Report Cuda Validation] *************************************************************************************************************************************************************************************************************************************
ok: [172.16.3.2] => {
"msg": [
"pod \"cuda-vector-add\" deleted"
]
}
TASK [Report Network Operator version] ****************************************************************************************************************************************************************************************************************************
skipping: [172.16.3.2]
TASK [Report Mellanox MOFED Driver Version] ***********************************************************************************************************************************************************************************************************************
skipping: [172.16.3.2]
TASK [Report RDMA Shared Device Plugin Version] *******************************************************************************************************************************************************************************************************************
skipping: [172.16.3.2]
TASK [Report SRIOV Device Plugin Version] *************************************************************************************************************************************************************************************************************************
skipping: [172.16.3.2]
TASK [Report Container Networking Plugin Version] *****************************************************************************************************************************************************************************************************************
skipping: [172.16.3.2]
TASK [Report Multus Version] **************************************************************************************************************************************************************************************************************************************
skipping: [172.16.3.2]
TASK [Report Whereabouts Version] *********************************************************************************************************************************************************************************************************************************
skipping: [172.16.3.2]
TASK [Status Check] ***********************************************************************************************************************************************************************************************************************************************
changed: [172.16.3.2]
TASK [debug] ******************************************************************************************************************************************************************************************************************************************************
ok: [172.16.3.2] => {
"msg": "All tasks should be changed or ok, if it's failed or ignoring means that validation task failed."
}
PLAY RECAP ********************************************************************************************************************************************************************************************************************************************************
172.16.3.2 : ok=47 changed=27 unreachable=0 failed=0 skipped=30 rescued=0 ignored=2
PLAY [all] ********************************************************************************************************************************************************************************************************************************************************
TASK [Gathering Facts] ********************************************************************************************************************************************************************************************************************************************
ok: [172.16.3.2]
TASK [install nfs-common] *****************************************************************************************************************************************************************************************************************************************
ok: [172.16.3.2]
PLAY [master[0]] **************************************************************************************************************************************************************************************************************************************************
TASK [Gathering Facts] ********************************************************************************************************************************************************************************************************************************************
ok: [172.16.3.2]
TASK [install nginx ingress controller] ***************************************************************************************************************************************************************************************************************************
changed: [172.16.3.2] => (item=helm repo add ingress-nginx https://kubernetes.github.io/ingress-nginx --force-update)
changed: [172.16.3.2] => (item=helm repo update)
changed: [172.16.3.2] => (item=helm upgrade --install ingress-nginx ingress-nginx/ingress-nginx --set controller.service.type=NodePort --set controller.service.nodePorts.http=32080 --set controller.service.nodePorts.https=32443)
TASK [install nfs-kernel-server] **********************************************************************************************************************************************************************************************************************************
changed: [172.16.3.2]
TASK [create export directory] ************************************************************************************************************************************************************************************************************************************
changed: [172.16.3.2]
TASK [create export config] ***************************************************************************************************************************************************************************************************************************************
changed: [172.16.3.2]
TASK [export filesystem] ******************************************************************************************************************************************************************************************************************************************
changed: [172.16.3.2]
TASK [restart nfs-kernel-server] **********************************************************************************************************************************************************************************************************************************
changed: [172.16.3.2]
TASK [install storage provisioner] ********************************************************************************************************************************************************************************************************************************
changed: [172.16.3.2] => (item=helm repo add nfs-subdir-external-provisioner https://kubernetes-sigs.github.io/nfs-subdir-external-provisioner/ --force-update)
changed: [172.16.3.2] => (item=helm repo update)
changed: [172.16.3.2] => (item=helm upgrade --install --reset-values --cleanup-on-fail --create-namespace --atomic --wait nfs-subdir-external-provisioner nfs-subdir-external-provisioner/nfs-subdir-external-provisioner --set nfs.server=172.16.3.2 --set nfs.path=/mnt/nfs_share)
TASK [setup imagepullsecret] **************************************************************************************************************************************************************************************************************************************
changed: [172.16.3.2] => (item=kubectl delete secret 'imagepullsecret' --ignore-not-found)
changed: [172.16.3.2] => (item=kubectl create secret docker-registry 'imagepullsecret' --docker-server='nvcr.io' --docker-username='$oauthtoken' --docker-password=a3NvZ2xvcnUxOWNpMjcxM201YzdnMjdtN3Y6ZTZiZDIzNWItNTc3Mi00OTY3LWI3YTQtMmFiYzIzMDNjMjEx --docker-email=ganindu@gmail.com --namespace='default')
TASK [capture node names] *****************************************************************************************************************************************************************************************************************************************
changed: [172.16.3.2]
TASK [setup nvidia-smi on each node] ******************************************************************************************************************************************************************************************************************************
changed: [172.16.3.2] => (item=dgx)
TASK [label nodes with accelerator] *******************************************************************************************************************************************************************************************************************************
changed: [172.16.3.2] => (item=dgx)
TASK [copy tao-toolkit-api helm values to master] *****************************************************************************************************************************************************************************************************************
changed: [172.16.3.2]
TASK [install tao-toolkit-api] ************************************************************************************************************************************************************************************************************************************
changed: [172.16.3.2]
PLAY [all] ********************************************************************************************************************************************************************************************************************************************************
TASK [Gathering Facts] ********************************************************************************************************************************************************************************************************************************************
ok: [172.16.3.2]
TASK [capture cluster ips] ****************************************************************************************************************************************************************************************************************************************
ok: [172.16.3.2 -> localhost]
TASK [store cluster ips] ******************************************************************************************************************************************************************************************************************************************
changed: [172.16.3.2]
TASK [copy ~/.kube/config to /tmp/kube-config] ********************************************************************************************************************************************************************************************************************
changed: [172.16.3.2]
TASK [ensure api endpoint uses publicly accessible ip] ************************************************************************************************************************************************************************************************************
skipping: [172.16.3.2]
TASK [generify context name] **************************************************************************************************************************************************************************************************************************************
skipping: [172.16.3.2]
TASK [replace generic cluster user and context name in kubeconfig with provided cluster name] *********************************************************************************************************************************************************************
skipping: [172.16.3.2]
TASK [fetch file /tmp/cluster_ips from remote to local] ***********************************************************************************************************************************************************************************************************
ok: [172.16.3.2]
TASK [fetch /tmp/kube-config from remote to local] ****************************************************************************************************************************************************************************************************************
ok: [172.16.3.2]
PLAY [localhost] **************************************************************************************************************************************************************************************************************************************************
TASK [Gathering Facts] ********************************************************************************************************************************************************************************************************************************************
ok: [localhost]
TASK [register cluster ips] ***************************************************************************************************************************************************************************************************************************************
changed: [localhost]
TASK [add an kubernetes apt signing key for ubuntu] ***************************************************************************************************************************************************************************************************************
ok: [localhost]
TASK [adding kubernetes apt repository for ubuntu] ****************************************************************************************************************************************************************************************************************
ok: [localhost]
TASK [install kubectl] ********************************************************************************************************************************************************************************************************************************************
ok: [localhost]
TASK [check if helm is installed] *********************************************************************************************************************************************************************************************************************************
changed: [localhost]
TASK [install helm on ubuntu 18.04] *******************************************************************************************************************************************************************************************************************************
skipping: [localhost] => (item=curl -O https://get.helm.sh/helm-v3.3.3-linux-amd64.tar.gz)
skipping: [localhost] => (item=tar -xvzf helm-v3.3.3-linux-amd64.tar.gz)
skipping: [localhost] => (item=cp linux-amd64/helm /usr/local/bin/)
skipping: [localhost] => (item=chmod 755 /usr/local/bin/helm)
skipping: [localhost] => (item=rm -rf helm-v3.3.3-linux-amd64.tar.gz linux-amd64)
TASK [install helm on ubuntu 20.04] *******************************************************************************************************************************************************************************************************************************
skipping: [localhost] => (item=curl -O https://get.helm.sh/helm-v3.8.1-linux-amd64.tar.gz)
skipping: [localhost] => (item=tar -xvzf helm-v3.8.1-linux-amd64.tar.gz)
skipping: [localhost] => (item=cp linux-amd64/helm /usr/local/bin/)
skipping: [localhost] => (item=chmod 755 /usr/local/bin/helm)
skipping: [localhost] => (item=rm -rf helm-v3.8.1-linux-amd64.tar.gz linux-amd64)
TASK [create kube directory] **************************************************************************************************************************************************************************************************************************************
ok: [localhost]
TASK [ensure kubeconfig file exists] ******************************************************************************************************************************************************************************************************************************
changed: [localhost]
TASK [merge kubeconfig to existing] *******************************************************************************************************************************************************************************************************************************
changed: [localhost]
TASK [make merged-kubeconfig default] *****************************************************************************************************************************************************************************************************************************
changed: [localhost]
PLAY RECAP ********************************************************************************************************************************************************************************************************************************************************
172.16.3.2 : ok=22 changed=15 unreachable=0 failed=0 skipped=3 rescued=0 ignored=0
localhost : ok=10 changed=5 unreachable=0 failed=0 skipped=2 rescued=0 ignored=0
(K8PY) g@dgx:~/Workspace/sandbox/TAO/getting_started_v4.0.0/setup/quickstart_api_bare_metal$
I think the block is goone now!! Great stuff (can I now register this with my kube master on the cpu server ? )
Great. It is working now.
Then, for getting started, you can download notebooks. Remote Client - NVIDIA Docs or refer to the blog https://developer.nvidia.com/blog/training-like-an-ai-pro-using-tao-automl/
Thanks a lot I think The trick was to effectively delete the crd
thing.
However I still don’t have nvidia-smi
and nvsm show health
is still Unhealthy
We want to be able to use the dgx as we used to as well? (is that a deal breaker?, if so we might have to revert all this sadly)
Officially, current method will install tao api with gpu driver uninstalled.
And actually we have some ways to install tao api on non-bare-metal machines. That means using tao-api in case who does not want to uninstall gpu driver and other sw.
We propose to install k8s first then deploy tao api. It is a bit comprehensive and not mentioned in user guide yet. User can follow this guide to install k8s cluster with GPU support firstly: https://docs.nvidia.com/datacenter/cloud-native/kubernetes/install-k8s.html#option-2-installing-kubernetes-using-kubeadm . Then they could follow https://docs.nvidia.com/tao/tao-toolkit/text/tao_toolkit_api/api_deployment.html to install tao-toolkit-api