TAO Toolkit 4.0 setup issue

Please provide the following information when requesting support.

I am Using 3090 GPU.
when I want to use TAO toolkit 4.0 in api_baremetal environment.

after bash setup.sh install
TASK [Waiting for the Cluster to become available]
Waiting endlessly.

gpu-operator pod in nvidia-gpu-operator namespace is still in init.
this is gpu-operator pod event log

Blockquote
Warning FailedCreatePodSandBox 2m3s (x141 over 32m) kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = failed to get sandbox runtime: no runtime for β€œnvidia” is configured

also, calio-node pod same problem.
this is calio-node event log

Blockquote
Events: β”‚
β”‚ Type Reason Age From Message β”‚
β”‚ ---- ------ ---- ---- ------- β”‚
β”‚ Normal Scheduled 7m35s default-scheduler Successfully assigned kube-system/calico-node-759mk to mykim β”‚
β”‚ Normal Started 7m32s kubelet Started container install-cni β”‚
β”‚ Normal Pulled 7m32s kubelet Container image β€œdocker.io/calico/cni:v3.21.6” already present on machine β”‚
β”‚ Normal Created 7m32s kubelet Created container upgrade-ipam β”‚
β”‚ Normal Started 7m32s kubelet Started container upgrade-ipam β”‚
β”‚ Normal Pulled 7m32s kubelet Container image β€œdocker.io/calico/cni:v3.21.6” already present on machine β”‚
β”‚ Normal Created 7m32s kubelet Created container install-cni β”‚
β”‚ Normal Pulled 7m31s kubelet Container image β€œdocker.io/calico/pod2daemon-flexvol:v3.21.6” already present on machine β”‚
β”‚ Normal Created 7m31s kubelet Created container flexvol-driver β”‚
β”‚ Normal Started 7m31s kubelet Started container flexvol-driver β”‚
β”‚ Normal Started 7m30s kubelet Started container calico-node β”‚
β”‚ Normal Pulled 7m30s kubelet Container image β€œdocker.io/calico/node:v3.21.6” already present on machine β”‚
β”‚ Normal Created 7m30s kubelet Created container calico-node β”‚
β”‚ Warning Unhealthy 7m28s (x2 over 7m29s) kubelet Readiness probe failed: calico/node is not ready: BIRD is not ready: Error querying BIRD: unable to connect to BIRDv4 socket: dial unix β”‚
β”‚ /var/run/calico/bird.ctl: connect: connection refused β”‚
β”‚ Warning Unhealthy 7m20s kubelet Readiness probe failed: 2022-12-23 02:37:56.054 [INFO][794] confd/health.go 180: Number of node(s) with BGP peering established = 0 β”‚
β”‚ calico/node is not ready: felix is not ready: readiness probe reporting 503 β”‚
β”‚ Warning Unhealthy 7m10s kubelet Readiness probe failed: 2022-12-23 02:38:06.049 [INFO][1519] confd/health.go 180: Number of node(s) with BGP peering established = 0 β”‚
β”‚ calico/node is not ready: BIRD is not ready: BGP not established with 192.168.2.118 β”‚
β”‚ Warning Unhealthy 7m kubelet Readiness probe failed: 2022-12-23 02:38:16.050 [INFO][2208] confd/health.go 180: Number of node(s) with BGP peering established = 1 β”‚
β”‚ calico/node is not ready: felix is not ready: readiness probe reporting 503 β”‚
β”‚ Warning Unhealthy 6m50s kubelet Readiness probe failed: 2022-12-23 02:38:26.045 [INFO][2880] confd/health.go 180: Number of node(s) with BGP peering established = 1 β”‚
β”‚ calico/node is not ready: felix is not ready: readiness probe reporting 503 β”‚
β”‚ Warning Unhealthy 6m30s kubelet Readiness probe failed: 2022-12-23 02:38:46.036 [INFO][4290] confd/health.go 180: Number of node(s) with BGP peering established = 1 β”‚
β”‚ calico/node is not ready: felix is not ready: readiness probe reporting 503 β”‚
β”‚ Warning Unhealthy 6m20s kubelet Readiness probe failed: 2022-12-23 02:38:56.057 [INFO][4970] confd/health.go 180: Number of node(s) with BGP peering established = 1 β”‚
β”‚ calico/node is not ready: felix is not ready: readiness probe reporting 503 β”‚
β”‚ Warning Unhealthy 6m10s kubelet Readiness probe failed: 2022-12-23 02:39:06.058 [INFO][5659] confd/health.go 180: Number of node(s) with BGP peering established = 1 β”‚
β”‚ calico/node is not ready: felix is not ready: readiness probe reporting 503 β”‚
β”‚ Warning Unhealthy 6m10s kubelet Readiness probe failed: 2022-12-23 02:39:06.144 [INFO][5684] confd/health.go 180: Number of node(s) with BGP peering established = 1 β”‚
β”‚ calico/node is not ready: felix is not ready: readiness probe reporting 503 β”‚
β”‚ Warning Unhealthy 2m31s (x26 over 6m) kubelet (combined from similar events): Readiness probe failed: 2022-12-23 02:42:45.505 [INFO][21279] confd/health.go 180: Number of node(s) with BGP peeri β”‚
β”‚ ng established = 1 β”‚
β”‚ calico/node is not ready: felix is not ready: readiness probe reporting 503

How do I solve this?

1 Like

I’m also having the same issue in the following topic: How to Deploy TAO 4.0 (with AutoML) Support without Kubernetes?

I strongly believe there should be a standalone Dockerfile/Docker Image deployment for the whole TAO Toolkit API services. Having both Ansible and Kubernetes giving so much pain while troubleshooting the whole unnecessarily complex deployment process.

1 Like

I will sync with internal team for your request. But actually users can use the provided one-click deploy script to deploy either on bare-metal setup or on managed Kubernetes service like Amazon EKS. Jupyter notebooks to train using the APIs directly or using the client app is provided under notebooks/api_starter_kit .
See more info in TAO Toolkit Quick Start Guide β€” TAO Toolkit 4.0 documentation and https://developer.nvidia.com/blog/training-like-an-ai-pro-using-tao-automl/

For β€œWaiting for the Cluster to become available” , to narrow down, could you try to set single node deployment?
You can also set single node deployment, listing the master is enough. See more in β€œhosts” file.

I just follow the blog to setup tao api in two machines(one master and one node). And the installation works well.
Could you check .hosts file?

i am using ubuntu 20.04 after format.

I’m also following topic: document and blog

i used command "ngc registry resource download-version β€œnvidia/tao/tao-getting-started:4.0.0"”

   Transfer id: tao-getting-started_v4.0.0
   Download status: Completed
   Downloaded local path: /home/ubuntu/tao-getting-started_v4.0.0
   Total files downloaded: 375
   Total downloaded size: 2.43 MB
   Started at: 2022-12-26 15:37:06.390305
   Completed at: 2022-12-26 15:37:21.413422
   Duration taken: 15s
-----------------------------------------

in accordance with the guidelines, I entered the cd tao-getting-started_v4.0.0/cv/resource/setup/quickstart_api_bare_metal path,

but the path is different from me.
cd tao-getting-started_v4.0.0/setup/quickstart_api_bare_metal
It’s part of it, but I think the version is a little different.

Also, to answer your question,
I tried

[master]
127.0.0.2 ansible_ssh_user='ubuntu' ansible_ssh_pass='password' ansible_ssh_extra_args='-o StrictHostKeyChecking=no'

[master]
[192.168.1.XX<IP Address>] ansible_ssh_user='ubuntu' ansible_ssh_pass='password'

kubectl get pods -n nvidia-gpu-operator


NAME                                                              READY   STATUS                  RESTARTS       AGE
gpu-feature-discovery-85dvf                                       0/1     Init:0/1                0              15m
gpu-operator-1672039233-node-feature-discovery-master-5479ktvt8   1/1     Running                 0              15m
gpu-operator-1672039233-node-feature-discovery-worker-2txp6       1/1     Running                 0              14m
gpu-operator-7bfc5f55-tn7r5                                       1/1     Running                 0              15m
nvidia-container-toolkit-daemonset-8jf4h                          0/1     Init:0/1                0              15m
nvidia-dcgm-exporter-jd8cz                                        0/1     Init:0/1                0              15m
nvidia-device-plugin-daemonset-wr8mr                              0/1     Init:0/1                0              15m
nvidia-driver-daemonset-zt9cg                                     0/1     Init:CrashLoopBackOff   7 (4m3s ago)   15m
nvidia-operator-validator-6x4d9                                   0/1     Init:0/4                0              15m

It’s a second attempt, and the same result
it’s still in this state

In the blog, there is a small mismatching error in the path.
Your path is correct.

Can you open a new terminal to run
$ kubectl get pods

Originally, how did you set the hosts file? Can you share the content?

Can you set something similar to below and retry?

[master]
master_ip ansible_ssh_user=β€˜master_name’ ansible_ssh_pass=β€˜master_passwd’

[nodes]
user_ip ansible_ssh_user=β€˜node_name’ ansible_ssh_pass=β€˜node_passwd’

$ kubectl get pods
No resources found in default namespace.

$ kubectl get pods -n nvidia-gpu-operator

NAME                                                              READY   STATUS                  RESTARTS          AGE
gpu-feature-discovery-85dvf                                       0/1     Init:0/1                0                 18h
gpu-operator-1672039233-node-feature-discovery-master-5479ktvt8   1/1     Running                 0                 18h
gpu-operator-1672039233-node-feature-discovery-worker-2txp6       1/1     Running                 0                 18h
gpu-operator-7bfc5f55-tn7r5                                       1/1     Running                 0                 18h
nvidia-container-toolkit-daemonset-8jf4h                          0/1     Init:0/1                0                 18h
nvidia-dcgm-exporter-jd8cz                                        0/1     Init:0/1                0                 18h
nvidia-device-plugin-daemonset-wr8mr                              0/1     Init:0/1                0                 18h
nvidia-driver-daemonset-zt9cg                                     0/1     Init:CrashLoopBackOff   221 (3m37s ago)   18h
nvidia-operator-validator-6x4d9                                   0/1     Init:0/4                0                 18h

$ kubectl get pods -n kube-system

calico-kube-controllers-7f76d48f74-nsph5   1/1     Running   0          18h
calico-node-xbsc7                          1/1     Running   0          18h
coredns-64897985d-dlknr                    1/1     Running   0          18h
coredns-64897985d-wt2jq                    1/1     Running   0          18h
etcd-mykim118                              1/1     Running   0          18h
kube-apiserver-mykim118                    1/1     Running   0          18h
kube-controller-manager-mykim118           1/1     Running   0          18h
kube-proxy-q8zkw                           1/1     Running   0          18h
kube-scheduler-mykim118                    1/1     Running   0          18h

i think I typed the wrong web editor
actually, each one of them was a representation.

this is last one

[master]
127.0.0.2 ansible_ssh_user='ubuntu' ansible_ssh_pass='user1'

For your information, ubuntu account has sudo privileges.
echo β€œubuntu ALL=(ALL) NOPASSWD:ALL” >> /etc/sudoers

also, i did ansible_ssh_user = β€˜root’
but it’s same result

Can you check the logs for the failed pod?
For example,
$ kubectl logs -n nvidia-gpu-operator nvidia-driver-daemonset-zt9cg

result is

stream logs failed container "nvidia-driver-ctr" in pod "nvidia-driver-daemonset-xnmnz" is waiting to start: PodInitializing for nvidia-gpu-operator/nvidia-driver-daemonset-xnmnz (nvidia-driver-ctr)

To add more, this is describe command

Name:                 nvidia-driver-daemonset-xnmnz                                                                                                                                                     β”‚
β”‚ Namespace:            nvidia-gpu-operator                                                                                                                                                               β”‚
β”‚ Priority:             2000001000                                                                                                                                                                        β”‚
β”‚ Priority Class Name:  system-node-critical                                                                                                                                                              β”‚
β”‚ Node:                 mykim/192.168.2.118                                                                                                                                                               β”‚
β”‚ Start Time:           Tue, 03 Jan 2023 11:21:19 +0900                                                                                                                                                   β”‚
β”‚ Labels:               app=nvidia-driver-daemonset                                                                                                                                                       β”‚
β”‚                       controller-revision-hash=589ff6c946                                                                                                                                               β”‚
β”‚                       pod-template-generation=1                                                                                                                                                         β”‚
β”‚ Annotations:          cni.projectcalico.org/containerID: 119b07b7509f1ddf4335f907512a69f645e328c16cd88f2dd4f1ac4b401279c2                                                                               β”‚
β”‚                       cni.projectcalico.org/podIP: 192.168.34.132/32                                                                                                                                    β”‚
β”‚                       cni.projectcalico.org/podIPs: 192.168.34.132/32                                                                                                                                   β”‚
β”‚ Status:               Pending                                                                                                                                                                           β”‚
β”‚ IP:                   192.168.34.132                                                                                                                                                                    β”‚
β”‚ IPs:                                                                                                                                                                                                    β”‚
β”‚   IP:           192.168.34.132                                                                                                                                                                          β”‚
β”‚ Controlled By:  DaemonSet/nvidia-driver-daemonset                                                                                                                                                       β”‚
β”‚ Init Containers:                                                                                                                                                                                        β”‚
β”‚   k8s-driver-manager:                                                                                                                                                                                   β”‚
β”‚     Container ID:  containerd://21a3bec1df5b73394219bbb699cf9323d01367963fe43957ef56c88329b8afda                                                                                                        β”‚
β”‚     Image:         nvcr.io/nvidia/cloud-native/k8s-driver-manager:v0.3.0                                                                                                                                β”‚
β”‚     Image ID:      nvcr.io/nvidia/cloud-native/k8s-driver-manager@sha256:5b16056257acc51b517d9cdb1da3218693cefc214af93789e6e214fd2b4cacf1                                                               β”‚
β”‚     Port:          <none>                                                                                                                                                                               β”‚
β”‚     Host Port:     <none>                                                                                                                                                                               β”‚
β”‚     Command:                                                                                                                                                                                            β”‚
β”‚       driver-manager                                                                                                                                                                                    β”‚
β”‚     Args:                                                                                                                                                                                               β”‚
β”‚       uninstall_driver                                                                                                                                                                                  β”‚
β”‚     State:          Waiting                                                                                                                                                                             β”‚
β”‚       Reason:       CrashLoopBackOff                                                                                                                                                                    β”‚
β”‚     Last State:     Terminated                                                                                                                                                                          β”‚
β”‚       Reason:       Error                                                                                                                                                                               β”‚
β”‚       Exit Code:    1                                                                                                                                                                                   β”‚
β”‚       Started:      Tue, 03 Jan 2023 11:32:35 +0900                                                                                                                                                     β”‚
β”‚       Finished:     Tue, 03 Jan 2023 11:32:36 +0900                                                                                                                                                     β”‚
β”‚     Ready:          False                                                                                                                                                                               β”‚
β”‚     Restart Count:  7                                                                                                                                                                                   β”‚
β”‚     Environment:                                                                                                                                                                                        β”‚
β”‚       NODE_NAME:                    (v1:spec.nodeName)                                                                                                                                                  β”‚
β”‚       NVIDIA_VISIBLE_DEVICES:      void                                                                                                                                                                 β”‚
β”‚       ENABLE_AUTO_DRAIN:           true                                                                                                                                                                 β”‚
β”‚       DRAIN_USE_FORCE:             false                                                                                                                                                                β”‚
β”‚       DRAIN_POD_SELECTOR_LABEL:                                                                                                                                                                         β”‚
β”‚       DRAIN_TIMEOUT_SECONDS:       0s
DRAIN_DELETE_EMPTYDIR_DATA:  false                                                                                                                                                                β”‚
β”‚       OPERATOR_NAMESPACE:          nvidia-gpu-operator (v1:metadata.namespace)                                                                                                                          β”‚
β”‚     Mounts:                                                                                                                                                                                             β”‚
β”‚       /run/nvidia from run-nvidia (rw)                                                                                                                                                                  β”‚
β”‚       /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-l2mph (ro)                                                                                                                     β”‚
β”‚ Containers:                                                                                                                                                                                             β”‚
β”‚   nvidia-driver-ctr:                                                                                                                                                                                    β”‚
β”‚     Container ID:                                                                                                                                                                                       β”‚
β”‚     Image:         nvcr.io/nvidia/driver:510.47.03-ubuntu20.04                                                                                                                                          β”‚
β”‚     Image ID:                                                                                                                                                                                           β”‚
β”‚     Port:          <none>                                                                                                                                                                               β”‚
β”‚     Host Port:     <none>                                                                                                                                                                               β”‚
β”‚     Command:                                                                                                                                                                                            β”‚
β”‚       nvidia-driver                                                                                                                                                                                     β”‚
β”‚     Args:                                                                                                                                                                                               β”‚
β”‚       init                                                                                                                                                                                              β”‚
β”‚     State:          Waiting                                                                                                                                                                             β”‚
β”‚       Reason:       PodInitializing                                                                                                                                                                     β”‚
β”‚     Ready:          False                                                                                                                                                                               β”‚
β”‚     Restart Count:  0                                                                                                                                                                                   β”‚
β”‚     Environment:    <none>                                                                                                                                                                              β”‚
β”‚     Mounts:                                                                                                                                                                                             β”‚
β”‚       /dev/log from dev-log (rw)                                                                                                                                                                        β”‚
β”‚       /host-etc/os-release from host-os-release (ro)                                                                                                                                                    β”‚
β”‚       /run/mellanox/drivers from run-mellanox-drivers (rw)                                                                                                                                              β”‚
β”‚       /run/mellanox/drivers/usr/src from mlnx-ofed-usr-src (rw)                                                                                                                                         β”‚
β”‚       /run/nvidia from run-nvidia (rw)                                                                                                                                                                  β”‚
β”‚       /run/nvidia-topologyd from run-nvidia-topologyd (rw)                                                                                                                                              β”‚
β”‚       /var/log from var-log (rw)                                                                                                                                                                        β”‚
β”‚       /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-l2mph (ro)                                                                                                                     β”‚
β”‚ Conditions:                                                                                                                                                                                             β”‚
β”‚   Type              Status                                                                                                                                                                              β”‚
β”‚   Initialized       False                                                                                                                                                                               β”‚
β”‚   Ready             False                                                                                                                                                                               β”‚
β”‚   ContainersReady   False                                                                                                                                                                               β”‚
β”‚   PodScheduled      True
Volumes:                                                                                                                                                                                                β”‚
β”‚   run-nvidia:                                                                                                                                                                                           β”‚
β”‚     Type:          HostPath (bare host directory volume)                                                                                                                                                β”‚
β”‚     Path:          /run/nvidia                                                                                                                                                                          β”‚
β”‚     HostPathType:  DirectoryOrCreate                                                                                                                                                                    β”‚
β”‚   var-log:                                                                                                                                                                                              β”‚
β”‚     Type:          HostPath (bare host directory volume)                                                                                                                                                β”‚
β”‚     Path:          /var/log                                                                                                                                                                             β”‚
β”‚     HostPathType:                                                                                                                                                                                       β”‚
β”‚   dev-log:                                                                                                                                                                                              β”‚
β”‚     Type:          HostPath (bare host directory volume)                                                                                                                                                β”‚
β”‚     Path:          /dev/log                                                                                                                                                                             β”‚
β”‚     HostPathType:                                                                                                                                                                                       β”‚
β”‚   host-os-release:                                                                                                                                                                                      β”‚
β”‚     Type:          HostPath (bare host directory volume)                                                                                                                                                β”‚
β”‚     Path:          /etc/os-release                                                                                                                                                                      β”‚
β”‚     HostPathType:                                                                                                                                                                 β”‚
β”‚   mlnx-ofed-usr-src:                                                                                                                                                                                    β”‚
β”‚     Type:          HostPath (bare host directory volume)                                                                                                                                                β”‚
β”‚     Path:          /run/mellanox/drivers/usr/src                                                                                                                                                        β”‚
β”‚     HostPathType:  DirectoryOrCreate                                                                                                                                                                    β”‚
β”‚   run-mellanox-drivers:                                                                                                                                                                                 β”‚
β”‚     Type:          HostPath (bare host directory volume)                                                                                                                                                β”‚
β”‚     Path:          /run/mellanox/drivers                                                                                                                                                                β”‚
β”‚     HostPathType:  DirectoryOrCreate                                                                                                                                                                    β”‚
β”‚   run-nvidia-validations:                                                                                                                                                                               β”‚
β”‚     Type:          HostPath (bare host directory volume)                                                                                                                                                β”‚
β”‚     Path:          /run/nvidia/validations                                                                                                                                                              β”‚
β”‚     HostPathType:  DirectoryOrCreate                                                                                                                                                                    β”‚
β”‚   kube-api-access-l2mph:                                                                                                                                                                                β”‚
β”‚     Type:                    Projected (a volume that contains injected data from multiple sources)                                                                                                     β”‚
β”‚     TokenExpirationSeconds:  3607                                                                                                                                                                       β”‚
β”‚     ConfigMapName:           kube-root-ca.crt                                                                                                                                                           β”‚
β”‚     ConfigMapOptional:       <nil>                                                                                                                                                                      β”‚
β”‚     DownwardAPI:             true                                                                                                                                                                       β”‚
β”‚ QoS Class:                   BestEffort                                                                                                                                                                 β”‚
β”‚ Node-Selectors:              nvidia.com/gpu.deploy.driver=true                                                                                                                                          β”‚
β”‚ Tolerations:                 node.kubernetes.io/disk-pressure:NoSchedule op=Exists                                                                                                                      β”‚
β”‚                              node.kubernetes.io/memory-pressure:NoSchedule op=Exists                                                                                                                    β”‚
β”‚                              node.kubernetes.io/not-ready:NoExecute op=Exists                                                                                                                           β”‚
β”‚                              node.kubernetes.io/pid-pressure:NoSchedule op=Exists                                                                                                                       β”‚
β”‚                              node.kubernetes.io/unreachable:NoExecute op=Exists                                                                                                                         β”‚
β”‚                              node.kubernetes.io/unschedulable:NoSchedule op=Exists                                                                                                                      β”‚
β”‚                              nvidia.com/gpu:NoSchedule op=Exists                                                                                                                                        β”‚
β”‚ Events:                                                                                                                                                                                                 β”‚
β”‚   Type     Reason     Age                  From               Message                                                                                                                                   β”‚
β”‚   ----     ------     ----                 ----               -------                                                                                                                                   β”‚
β”‚   Normal   Scheduled  14m                  default-scheduler  Successfully assigned nvidia-gpu-operator/nvidia-driver-daemonset-xnmnz to mykim                                                          β”‚
β”‚   Normal   Pulling    14m                  kubelet            Pulling image "nvcr.io/nvidia/cloud-native/k8s-driver-manager:v0.3.0"                                                                     β”‚
β”‚   Normal   Pulled     14m                  kubelet            Successfully pulled image "nvcr.io/nvidia/cloud-native/k8s-driver-manager:v0.3.0" in 8.932693332s                                         β”‚
β”‚   Normal   Created    12m (x5 over 14m)    kubelet            Created container k8s-driver-manager                                                                                                      β”‚
β”‚   Normal   Started    12m (x5 over 14m)    kubelet            Started container k8s-driver-manager                                                                                                      β”‚
β”‚   Normal   Pulled     12m (x4 over 13m)    kubelet            Container image "nvcr.io/nvidia/cloud-native/k8s-driver-manager:v0.3.0" already present on machine                                        β”‚
β”‚   Warning  BackOff    4m7s (x46 over 13m)  kubelet            Back-off restarting failed container

restarting the pod and collecting logs immediately.

stream logs failed container "nvidia-driver-ctr" in pod "nvidia-driver-daemonset-thjj8" is waiting to start: PodInitializing for nvidia-gpu-operator/nvidia-driver-daemonset-thjj8 (nvidia-driver-ctr) 

β”‚β”‚ k8s-driver-manager Getting current value of the '[nvidia.com/gpu.deploy.operator-validator](http://nvidia.com/gpu.deploy.operator-validator)' node label 

β”‚β”‚ k8s-driver-manager Current value of '[nvidia.com/gpu.deploy.operator-validator=true](http://nvidia.com/gpu.deploy.operator-validator=true)' 

β”‚β”‚ k8s-driver-manager Getting current value of the '[nvidia.com/gpu.deploy.container-toolkit](http://nvidia.com/gpu.deploy.container-toolkit)' node label 

β”‚β”‚ k8s-driver-manager Current value of '[nvidia.com/gpu.deploy.container-toolkit=true](http://nvidia.com/gpu.deploy.container-toolkit=true)' 

β”‚β”‚ k8s-driver-manager Getting current value of the '[nvidia.com/gpu.deploy.device-plugin](http://nvidia.com/gpu.deploy.device-plugin)' node label 

β”‚β”‚ k8s-driver-manager Current value of '[nvidia.com/gpu.deploy.device-plugin=true](http://nvidia.com/gpu.deploy.device-plugin=true)' 

β”‚β”‚ k8s-driver-manager Getting current value of the '[nvidia.com/gpu.deploy.gpu-feature-discovery](http://nvidia.com/gpu.deploy.gpu-feature-discovery)' node label 

β”‚β”‚ k8s-driver-manager Current value of '[nvidia.com/gpu.deploy.gpu-feature-discovery=true](http://nvidia.com/gpu.deploy.gpu-feature-discovery=true)' 

β”‚β”‚ k8s-driver-manager Getting current value of the '[nvidia.com/gpu.deploy.dcgm-exporter](http://nvidia.com/gpu.deploy.dcgm-exporter)' node label 

β”‚β”‚ k8s-driver-manager Current value of '[nvidia.com/gpu.deploy.dcgm-exporter=true](http://nvidia.com/gpu.deploy.dcgm-exporter=true)' 

β”‚β”‚ k8s-driver-manager Getting current value of the '[nvidia.com/gpu.deploy.dcgm](http://nvidia.com/gpu.deploy.dcgm)' node label 

β”‚β”‚ k8s-driver-manager Current value of '[nvidia.com/gpu.deploy.dcgm=true](http://nvidia.com/gpu.deploy.dcgm=true)' 

β”‚β”‚ k8s-driver-manager Getting current value of the '[nvidia.com/gpu.deploy.mig-manager](http://nvidia.com/gpu.deploy.mig-manager)' node label 

β”‚β”‚ k8s-driver-manager Current value of '[nvidia.com/gpu.deploy.mig-manager=](http://nvidia.com/gpu.deploy.mig-manager=)' 

β”‚β”‚ k8s-driver-manager Getting current value of the '[nvidia.com/gpu.deploy.nvsm](http://nvidia.com/gpu.deploy.nvsm)' node label 

β”‚β”‚ k8s-driver-manager Current value of '[nvidia.com/gpu.deploy.nvsm=](http://nvidia.com/gpu.deploy.nvsm=)' 

β”‚β”‚ k8s-driver-manager Uncordoning node mykim... 

β”‚β”‚ k8s-driver-manager node/mykim already uncordoned 

β”‚β”‚ k8s-driver-manager Rescheduling all GPU clients on the current node by enabling their component-specific nodeSelector labels 

β”‚β”‚ k8s-driver-manager node/mykim not labeled 

β”‚β”‚ k8s-driver-manager Unloading nouveau driver... 

β”‚β”‚ k8s-driver-manager rmmod: ERROR: Module nouveau is in use 

β”‚β”‚ k8s-driver-manager Failed to unload nouveau driver 

β”‚β”‚ Stream closed EOF for nvidia-gpu-operator/nvidia-driver-daemonset-thjj8 (k8s-driver-manager) 

β”‚β”‚ stream logs failed container "nvidia-driver-ctr" in pod "nvidia-driver-daemonset-thjj8" is waitin

Could you

Blacklist the nouveau driver.
sudo echo -e "\nblacklist nouveau\noptions nouveau modeset=0\n" >> /etc/modprobe.d/blacklist-nouveau.conf

Regenerate the kernel initramfs.
sudo update-initramfs -u

Reboot.
sudo reboot

Refer to Setup β€” TAO Toolkit 3.22.05 documentation

Please run below workaround when issue happens.
$ kubectl delete crd clusterpolicies.nvidia.com

i following with the procedure β€œk8s-driver-manager Failed to unload nouveau driver” .
i Performed the bash setup.sh installation process.

when reboot OS, Ubuntu 20.04 has occurred loading screen stuck.
grub rescue to fix linux boot failure.

but restarting pod gpu-operator-7bfc5f55-8jx8r

$kubectl get pod -n nvidia-gpu-operator


NAME                                                              READY   STATUS             RESTARTS      AGE
gpu-operator-1672814931-node-feature-discovery-master-b6f69h5fd   1/1     Running            8 (21m ago)   18h
gpu-operator-1672814931-node-feature-discovery-worker-fqx67       1/1     Running            8 (21m ago)   18h
gpu-operator-7bfc5f55-8jx8r                                       0/1     CrashLoopBackOff   5 (85s ago)   18m

and i attached pod log

| 1.672879965216436e+09    INFO    controller-runtime.metrics    Metrics server is starting to listen  β”‚
β”‚ 1.672879965216878e+09    INFO    setup    starting manager                                           β”‚
β”‚ 1.6728799652173266e+09    INFO    Starting server    {"kind": "health probe", "addr": ":8081"}       β”‚
β”‚ 1.6728799652173545e+09    INFO    Starting server    {"path": "/metrics", "kind": "metrics", "addr": β”‚
β”‚ I0105 00:52:45.217405       1 leaderelection.go:248] attempting to acquire leader lease nvidia-gpu-o β”‚
β”‚ I0105 00:53:03.509532       1 leaderelection.go:258] successfully acquired lease nvidia-gpu-operator β”‚
β”‚ 1.672879983509869e+09    INFO    controller.clusterpolicy-controller    Starting EventSource    {"so β”‚
β”‚ 1.6728799835099726e+09    INFO    controller.clusterpolicy-controller    Starting EventSource    {"s β”‚
β”‚ 1.6728799835099897e+09    INFO    controller.clusterpolicy-controller    Starting EventSource    {"s β”‚
β”‚ 1.6728799835099986e+09    INFO    controller.clusterpolicy-controller    Starting Controller         β”‚
β”‚ 1.6728799835095918e+09    DEBUG    events    Normal    {"object": {"kind":"ConfigMap","namespace":"n β”‚
β”‚ 1.6728799835100212e+09    DEBUG    events    Normal    {"object": {"kind":"Lease","namespace":"nvidi β”‚
β”‚ 1.6728799837144682e+09    ERROR    controller-runtime.source    if kind is a CRD, it should be insta β”‚
β”‚ sigs.k8s.io/controller-runtime/pkg/source.(*Kind).Start.func1.1                                      β”‚
β”‚     /workspace/vendor/sigs.k8s.io/controller-runtime/pkg/source/source.go:137                        β”‚
β”‚ k8s.io/apimachinery/pkg/util/wait.runConditionWithCrashProtectionWithContext                         β”‚
β”‚     /workspace/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:233                                  β”‚
β”‚ k8s.io/apimachinery/pkg/util/wait.poll                                                               β”‚
β”‚     /workspace/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:580                                  β”‚
β”‚ k8s.io/apimachinery/pkg/util/wait.PollImmediateUntilWithContext                                      β”‚
β”‚     /workspace/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:545                                  β”‚
β”‚ sigs.k8s.io/controller-runtime/pkg/source.(*Kind).Start.func1                                        β”‚
β”‚     /workspace/vendor/sigs.k8s.io/controller-runtime/pkg/source/source.go:131                        β”‚
β”‚ I0105 00:53:04.861065       1 request.go:665] Waited for 1.040331054s due to client-side throttling, β”‚
β”‚ 1.6728799854137976e+09    ERROR    controllers.ClusterPolicy    Unable to list ClusterPolicies    {" β”‚
β”‚ sigs.k8s.io/controller-runtime/pkg/handler.(*enqueueRequestsFromMapFunc).mapAndEnqueue

Can you run
$ bash setup.sh uninstall
$ kubectl delete crd clusterpolicies.nvidia.com
$ bash setup.sh install

$ kubectl delete crd clusterpolicies.nvidia.com works!
Thank you very much!

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.