TAO Toolkit 4.0 setup issue

Please provide the following information when requesting support.

I am Using 3090 GPU.
when I want to use TAO toolkit 4.0 in api_baremetal environment.

after bash setup.sh install
TASK [Waiting for the Cluster to become available]
Waiting endlessly.

gpu-operator pod in nvidia-gpu-operator namespace is still in init.
this is gpu-operator pod event log

Blockquote
Warning FailedCreatePodSandBox 2m3s (x141 over 32m) kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = failed to get sandbox runtime: no runtime for “nvidia” is configured

also, calio-node pod same problem.
this is calio-node event log

Blockquote
Events: │
│ Type Reason Age From Message │
│ ---- ------ ---- ---- ------- │
│ Normal Scheduled 7m35s default-scheduler Successfully assigned kube-system/calico-node-759mk to mykim │
│ Normal Started 7m32s kubelet Started container install-cni │
│ Normal Pulled 7m32s kubelet Container image “docker.io/calico/cni:v3.21.6” already present on machine │
│ Normal Created 7m32s kubelet Created container upgrade-ipam │
│ Normal Started 7m32s kubelet Started container upgrade-ipam │
│ Normal Pulled 7m32s kubelet Container image “docker.io/calico/cni:v3.21.6” already present on machine │
│ Normal Created 7m32s kubelet Created container install-cni │
│ Normal Pulled 7m31s kubelet Container image “docker.io/calico/pod2daemon-flexvol:v3.21.6” already present on machine │
│ Normal Created 7m31s kubelet Created container flexvol-driver │
│ Normal Started 7m31s kubelet Started container flexvol-driver │
│ Normal Started 7m30s kubelet Started container calico-node │
│ Normal Pulled 7m30s kubelet Container image “docker.io/calico/node:v3.21.6” already present on machine │
│ Normal Created 7m30s kubelet Created container calico-node │
│ Warning Unhealthy 7m28s (x2 over 7m29s) kubelet Readiness probe failed: calico/node is not ready: BIRD is not ready: Error querying BIRD: unable to connect to BIRDv4 socket: dial unix │
│ /var/run/calico/bird.ctl: connect: connection refused │
│ Warning Unhealthy 7m20s kubelet Readiness probe failed: 2022-12-23 02:37:56.054 [INFO][794] confd/health.go 180: Number of node(s) with BGP peering established = 0 │
│ calico/node is not ready: felix is not ready: readiness probe reporting 503 │
│ Warning Unhealthy 7m10s kubelet Readiness probe failed: 2022-12-23 02:38:06.049 [INFO][1519] confd/health.go 180: Number of node(s) with BGP peering established = 0 │
│ calico/node is not ready: BIRD is not ready: BGP not established with 192.168.2.118 │
│ Warning Unhealthy 7m kubelet Readiness probe failed: 2022-12-23 02:38:16.050 [INFO][2208] confd/health.go 180: Number of node(s) with BGP peering established = 1 │
│ calico/node is not ready: felix is not ready: readiness probe reporting 503 │
│ Warning Unhealthy 6m50s kubelet Readiness probe failed: 2022-12-23 02:38:26.045 [INFO][2880] confd/health.go 180: Number of node(s) with BGP peering established = 1 │
│ calico/node is not ready: felix is not ready: readiness probe reporting 503 │
│ Warning Unhealthy 6m30s kubelet Readiness probe failed: 2022-12-23 02:38:46.036 [INFO][4290] confd/health.go 180: Number of node(s) with BGP peering established = 1 │
│ calico/node is not ready: felix is not ready: readiness probe reporting 503 │
│ Warning Unhealthy 6m20s kubelet Readiness probe failed: 2022-12-23 02:38:56.057 [INFO][4970] confd/health.go 180: Number of node(s) with BGP peering established = 1 │
│ calico/node is not ready: felix is not ready: readiness probe reporting 503 │
│ Warning Unhealthy 6m10s kubelet Readiness probe failed: 2022-12-23 02:39:06.058 [INFO][5659] confd/health.go 180: Number of node(s) with BGP peering established = 1 │
│ calico/node is not ready: felix is not ready: readiness probe reporting 503 │
│ Warning Unhealthy 6m10s kubelet Readiness probe failed: 2022-12-23 02:39:06.144 [INFO][5684] confd/health.go 180: Number of node(s) with BGP peering established = 1 │
│ calico/node is not ready: felix is not ready: readiness probe reporting 503 │
│ Warning Unhealthy 2m31s (x26 over 6m) kubelet (combined from similar events): Readiness probe failed: 2022-12-23 02:42:45.505 [INFO][21279] confd/health.go 180: Number of node(s) with BGP peeri │
│ ng established = 1 │
│ calico/node is not ready: felix is not ready: readiness probe reporting 503

How do I solve this?

1 Like

I’m also having the same issue in the following topic: How to Deploy TAO 4.0 (with AutoML) Support without Kubernetes?

I strongly believe there should be a standalone Dockerfile/Docker Image deployment for the whole TAO Toolkit API services. Having both Ansible and Kubernetes giving so much pain while troubleshooting the whole unnecessarily complex deployment process.

1 Like

I will sync with internal team for your request. But actually users can use the provided one-click deploy script to deploy either on bare-metal setup or on managed Kubernetes service like Amazon EKS. Jupyter notebooks to train using the APIs directly or using the client app is provided under notebooks/api_starter_kit .
See more info in TAO Toolkit Quick Start Guide — TAO Toolkit 4.0 documentation and https://developer.nvidia.com/blog/training-like-an-ai-pro-using-tao-automl/

For “Waiting for the Cluster to become available” , to narrow down, could you try to set single node deployment?
You can also set single node deployment, listing the master is enough. See more in “hosts” file.

I just follow the blog to setup tao api in two machines(one master and one node). And the installation works well.
Could you check .hosts file?

i am using ubuntu 20.04 after format.

I’m also following topic: document and blog

i used command "ngc registry resource download-version “nvidia/tao/tao-getting-started:4.0.0"”

   Transfer id: tao-getting-started_v4.0.0
   Download status: Completed
   Downloaded local path: /home/ubuntu/tao-getting-started_v4.0.0
   Total files downloaded: 375
   Total downloaded size: 2.43 MB
   Started at: 2022-12-26 15:37:06.390305
   Completed at: 2022-12-26 15:37:21.413422
   Duration taken: 15s
-----------------------------------------

in accordance with the guidelines, I entered the cd tao-getting-started_v4.0.0/cv/resource/setup/quickstart_api_bare_metal path,

but the path is different from me.
cd tao-getting-started_v4.0.0/setup/quickstart_api_bare_metal
It’s part of it, but I think the version is a little different.

Also, to answer your question,
I tried

[master]
127.0.0.2 ansible_ssh_user='ubuntu' ansible_ssh_pass='password' ansible_ssh_extra_args='-o StrictHostKeyChecking=no'

[master]
[192.168.1.XX<IP Address>] ansible_ssh_user='ubuntu' ansible_ssh_pass='password'

kubectl get pods -n nvidia-gpu-operator


NAME                                                              READY   STATUS                  RESTARTS       AGE
gpu-feature-discovery-85dvf                                       0/1     Init:0/1                0              15m
gpu-operator-1672039233-node-feature-discovery-master-5479ktvt8   1/1     Running                 0              15m
gpu-operator-1672039233-node-feature-discovery-worker-2txp6       1/1     Running                 0              14m
gpu-operator-7bfc5f55-tn7r5                                       1/1     Running                 0              15m
nvidia-container-toolkit-daemonset-8jf4h                          0/1     Init:0/1                0              15m
nvidia-dcgm-exporter-jd8cz                                        0/1     Init:0/1                0              15m
nvidia-device-plugin-daemonset-wr8mr                              0/1     Init:0/1                0              15m
nvidia-driver-daemonset-zt9cg                                     0/1     Init:CrashLoopBackOff   7 (4m3s ago)   15m
nvidia-operator-validator-6x4d9                                   0/1     Init:0/4                0              15m

It’s a second attempt, and the same result
it’s still in this state

In the blog, there is a small mismatching error in the path.
Your path is correct.

Can you open a new terminal to run
$ kubectl get pods

Originally, how did you set the hosts file? Can you share the content?

Can you set something similar to below and retry?

[master]
master_ip ansible_ssh_user=‘master_name’ ansible_ssh_pass=‘master_passwd’

[nodes]
user_ip ansible_ssh_user=‘node_name’ ansible_ssh_pass=‘node_passwd’

$ kubectl get pods
No resources found in default namespace.

$ kubectl get pods -n nvidia-gpu-operator

NAME                                                              READY   STATUS                  RESTARTS          AGE
gpu-feature-discovery-85dvf                                       0/1     Init:0/1                0                 18h
gpu-operator-1672039233-node-feature-discovery-master-5479ktvt8   1/1     Running                 0                 18h
gpu-operator-1672039233-node-feature-discovery-worker-2txp6       1/1     Running                 0                 18h
gpu-operator-7bfc5f55-tn7r5                                       1/1     Running                 0                 18h
nvidia-container-toolkit-daemonset-8jf4h                          0/1     Init:0/1                0                 18h
nvidia-dcgm-exporter-jd8cz                                        0/1     Init:0/1                0                 18h
nvidia-device-plugin-daemonset-wr8mr                              0/1     Init:0/1                0                 18h
nvidia-driver-daemonset-zt9cg                                     0/1     Init:CrashLoopBackOff   221 (3m37s ago)   18h
nvidia-operator-validator-6x4d9                                   0/1     Init:0/4                0                 18h

$ kubectl get pods -n kube-system

calico-kube-controllers-7f76d48f74-nsph5   1/1     Running   0          18h
calico-node-xbsc7                          1/1     Running   0          18h
coredns-64897985d-dlknr                    1/1     Running   0          18h
coredns-64897985d-wt2jq                    1/1     Running   0          18h
etcd-mykim118                              1/1     Running   0          18h
kube-apiserver-mykim118                    1/1     Running   0          18h
kube-controller-manager-mykim118           1/1     Running   0          18h
kube-proxy-q8zkw                           1/1     Running   0          18h
kube-scheduler-mykim118                    1/1     Running   0          18h

i think I typed the wrong web editor
actually, each one of them was a representation.

this is last one

[master]
127.0.0.2 ansible_ssh_user='ubuntu' ansible_ssh_pass='user1'

For your information, ubuntu account has sudo privileges.
echo “ubuntu ALL=(ALL) NOPASSWD:ALL” >> /etc/sudoers

also, i did ansible_ssh_user = ‘root’
but it’s same result

Can you check the logs for the failed pod?
For example,
$ kubectl logs -n nvidia-gpu-operator nvidia-driver-daemonset-zt9cg

result is

stream logs failed container "nvidia-driver-ctr" in pod "nvidia-driver-daemonset-xnmnz" is waiting to start: PodInitializing for nvidia-gpu-operator/nvidia-driver-daemonset-xnmnz (nvidia-driver-ctr)

To add more, this is describe command

Name:                 nvidia-driver-daemonset-xnmnz                                                                                                                                                     │
│ Namespace:            nvidia-gpu-operator                                                                                                                                                               │
│ Priority:             2000001000                                                                                                                                                                        │
│ Priority Class Name:  system-node-critical                                                                                                                                                              │
│ Node:                 mykim/192.168.2.118                                                                                                                                                               │
│ Start Time:           Tue, 03 Jan 2023 11:21:19 +0900                                                                                                                                                   │
│ Labels:               app=nvidia-driver-daemonset                                                                                                                                                       │
│                       controller-revision-hash=589ff6c946                                                                                                                                               │
│                       pod-template-generation=1                                                                                                                                                         │
│ Annotations:          cni.projectcalico.org/containerID: 119b07b7509f1ddf4335f907512a69f645e328c16cd88f2dd4f1ac4b401279c2                                                                               │
│                       cni.projectcalico.org/podIP: 192.168.34.132/32                                                                                                                                    │
│                       cni.projectcalico.org/podIPs: 192.168.34.132/32                                                                                                                                   │
│ Status:               Pending                                                                                                                                                                           │
│ IP:                   192.168.34.132                                                                                                                                                                    │
│ IPs:                                                                                                                                                                                                    │
│   IP:           192.168.34.132                                                                                                                                                                          │
│ Controlled By:  DaemonSet/nvidia-driver-daemonset                                                                                                                                                       │
│ Init Containers:                                                                                                                                                                                        │
│   k8s-driver-manager:                                                                                                                                                                                   │
│     Container ID:  containerd://21a3bec1df5b73394219bbb699cf9323d01367963fe43957ef56c88329b8afda                                                                                                        │
│     Image:         nvcr.io/nvidia/cloud-native/k8s-driver-manager:v0.3.0                                                                                                                                │
│     Image ID:      nvcr.io/nvidia/cloud-native/k8s-driver-manager@sha256:5b16056257acc51b517d9cdb1da3218693cefc214af93789e6e214fd2b4cacf1                                                               │
│     Port:          <none>                                                                                                                                                                               │
│     Host Port:     <none>                                                                                                                                                                               │
│     Command:                                                                                                                                                                                            │
│       driver-manager                                                                                                                                                                                    │
│     Args:                                                                                                                                                                                               │
│       uninstall_driver                                                                                                                                                                                  │
│     State:          Waiting                                                                                                                                                                             │
│       Reason:       CrashLoopBackOff                                                                                                                                                                    │
│     Last State:     Terminated                                                                                                                                                                          │
│       Reason:       Error                                                                                                                                                                               │
│       Exit Code:    1                                                                                                                                                                                   │
│       Started:      Tue, 03 Jan 2023 11:32:35 +0900                                                                                                                                                     │
│       Finished:     Tue, 03 Jan 2023 11:32:36 +0900                                                                                                                                                     │
│     Ready:          False                                                                                                                                                                               │
│     Restart Count:  7                                                                                                                                                                                   │
│     Environment:                                                                                                                                                                                        │
│       NODE_NAME:                    (v1:spec.nodeName)                                                                                                                                                  │
│       NVIDIA_VISIBLE_DEVICES:      void                                                                                                                                                                 │
│       ENABLE_AUTO_DRAIN:           true                                                                                                                                                                 │
│       DRAIN_USE_FORCE:             false                                                                                                                                                                │
│       DRAIN_POD_SELECTOR_LABEL:                                                                                                                                                                         │
│       DRAIN_TIMEOUT_SECONDS:       0s
DRAIN_DELETE_EMPTYDIR_DATA:  false                                                                                                                                                                │
│       OPERATOR_NAMESPACE:          nvidia-gpu-operator (v1:metadata.namespace)                                                                                                                          │
│     Mounts:                                                                                                                                                                                             │
│       /run/nvidia from run-nvidia (rw)                                                                                                                                                                  │
│       /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-l2mph (ro)                                                                                                                     │
│ Containers:                                                                                                                                                                                             │
│   nvidia-driver-ctr:                                                                                                                                                                                    │
│     Container ID:                                                                                                                                                                                       │
│     Image:         nvcr.io/nvidia/driver:510.47.03-ubuntu20.04                                                                                                                                          │
│     Image ID:                                                                                                                                                                                           │
│     Port:          <none>                                                                                                                                                                               │
│     Host Port:     <none>                                                                                                                                                                               │
│     Command:                                                                                                                                                                                            │
│       nvidia-driver                                                                                                                                                                                     │
│     Args:                                                                                                                                                                                               │
│       init                                                                                                                                                                                              │
│     State:          Waiting                                                                                                                                                                             │
│       Reason:       PodInitializing                                                                                                                                                                     │
│     Ready:          False                                                                                                                                                                               │
│     Restart Count:  0                                                                                                                                                                                   │
│     Environment:    <none>                                                                                                                                                                              │
│     Mounts:                                                                                                                                                                                             │
│       /dev/log from dev-log (rw)                                                                                                                                                                        │
│       /host-etc/os-release from host-os-release (ro)                                                                                                                                                    │
│       /run/mellanox/drivers from run-mellanox-drivers (rw)                                                                                                                                              │
│       /run/mellanox/drivers/usr/src from mlnx-ofed-usr-src (rw)                                                                                                                                         │
│       /run/nvidia from run-nvidia (rw)                                                                                                                                                                  │
│       /run/nvidia-topologyd from run-nvidia-topologyd (rw)                                                                                                                                              │
│       /var/log from var-log (rw)                                                                                                                                                                        │
│       /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-l2mph (ro)                                                                                                                     │
│ Conditions:                                                                                                                                                                                             │
│   Type              Status                                                                                                                                                                              │
│   Initialized       False                                                                                                                                                                               │
│   Ready             False                                                                                                                                                                               │
│   ContainersReady   False                                                                                                                                                                               │
│   PodScheduled      True
Volumes:                                                                                                                                                                                                │
│   run-nvidia:                                                                                                                                                                                           │
│     Type:          HostPath (bare host directory volume)                                                                                                                                                │
│     Path:          /run/nvidia                                                                                                                                                                          │
│     HostPathType:  DirectoryOrCreate                                                                                                                                                                    │
│   var-log:                                                                                                                                                                                              │
│     Type:          HostPath (bare host directory volume)                                                                                                                                                │
│     Path:          /var/log                                                                                                                                                                             │
│     HostPathType:                                                                                                                                                                                       │
│   dev-log:                                                                                                                                                                                              │
│     Type:          HostPath (bare host directory volume)                                                                                                                                                │
│     Path:          /dev/log                                                                                                                                                                             │
│     HostPathType:                                                                                                                                                                                       │
│   host-os-release:                                                                                                                                                                                      │
│     Type:          HostPath (bare host directory volume)                                                                                                                                                │
│     Path:          /etc/os-release                                                                                                                                                                      │
│     HostPathType:                                                                                                                                                                 │
│   mlnx-ofed-usr-src:                                                                                                                                                                                    │
│     Type:          HostPath (bare host directory volume)                                                                                                                                                │
│     Path:          /run/mellanox/drivers/usr/src                                                                                                                                                        │
│     HostPathType:  DirectoryOrCreate                                                                                                                                                                    │
│   run-mellanox-drivers:                                                                                                                                                                                 │
│     Type:          HostPath (bare host directory volume)                                                                                                                                                │
│     Path:          /run/mellanox/drivers                                                                                                                                                                │
│     HostPathType:  DirectoryOrCreate                                                                                                                                                                    │
│   run-nvidia-validations:                                                                                                                                                                               │
│     Type:          HostPath (bare host directory volume)                                                                                                                                                │
│     Path:          /run/nvidia/validations                                                                                                                                                              │
│     HostPathType:  DirectoryOrCreate                                                                                                                                                                    │
│   kube-api-access-l2mph:                                                                                                                                                                                │
│     Type:                    Projected (a volume that contains injected data from multiple sources)                                                                                                     │
│     TokenExpirationSeconds:  3607                                                                                                                                                                       │
│     ConfigMapName:           kube-root-ca.crt                                                                                                                                                           │
│     ConfigMapOptional:       <nil>                                                                                                                                                                      │
│     DownwardAPI:             true                                                                                                                                                                       │
│ QoS Class:                   BestEffort                                                                                                                                                                 │
│ Node-Selectors:              nvidia.com/gpu.deploy.driver=true                                                                                                                                          │
│ Tolerations:                 node.kubernetes.io/disk-pressure:NoSchedule op=Exists                                                                                                                      │
│                              node.kubernetes.io/memory-pressure:NoSchedule op=Exists                                                                                                                    │
│                              node.kubernetes.io/not-ready:NoExecute op=Exists                                                                                                                           │
│                              node.kubernetes.io/pid-pressure:NoSchedule op=Exists                                                                                                                       │
│                              node.kubernetes.io/unreachable:NoExecute op=Exists                                                                                                                         │
│                              node.kubernetes.io/unschedulable:NoSchedule op=Exists                                                                                                                      │
│                              nvidia.com/gpu:NoSchedule op=Exists                                                                                                                                        │
│ Events:                                                                                                                                                                                                 │
│   Type     Reason     Age                  From               Message                                                                                                                                   │
│   ----     ------     ----                 ----               -------                                                                                                                                   │
│   Normal   Scheduled  14m                  default-scheduler  Successfully assigned nvidia-gpu-operator/nvidia-driver-daemonset-xnmnz to mykim                                                          │
│   Normal   Pulling    14m                  kubelet            Pulling image "nvcr.io/nvidia/cloud-native/k8s-driver-manager:v0.3.0"                                                                     │
│   Normal   Pulled     14m                  kubelet            Successfully pulled image "nvcr.io/nvidia/cloud-native/k8s-driver-manager:v0.3.0" in 8.932693332s                                         │
│   Normal   Created    12m (x5 over 14m)    kubelet            Created container k8s-driver-manager                                                                                                      │
│   Normal   Started    12m (x5 over 14m)    kubelet            Started container k8s-driver-manager                                                                                                      │
│   Normal   Pulled     12m (x4 over 13m)    kubelet            Container image "nvcr.io/nvidia/cloud-native/k8s-driver-manager:v0.3.0" already present on machine                                        │
│   Warning  BackOff    4m7s (x46 over 13m)  kubelet            Back-off restarting failed container

restarting the pod and collecting logs immediately.

stream logs failed container "nvidia-driver-ctr" in pod "nvidia-driver-daemonset-thjj8" is waiting to start: PodInitializing for nvidia-gpu-operator/nvidia-driver-daemonset-thjj8 (nvidia-driver-ctr) 

││ k8s-driver-manager Getting current value of the '[nvidia.com/gpu.deploy.operator-validator](http://nvidia.com/gpu.deploy.operator-validator)' node label 

││ k8s-driver-manager Current value of '[nvidia.com/gpu.deploy.operator-validator=true](http://nvidia.com/gpu.deploy.operator-validator=true)' 

││ k8s-driver-manager Getting current value of the '[nvidia.com/gpu.deploy.container-toolkit](http://nvidia.com/gpu.deploy.container-toolkit)' node label 

││ k8s-driver-manager Current value of '[nvidia.com/gpu.deploy.container-toolkit=true](http://nvidia.com/gpu.deploy.container-toolkit=true)' 

││ k8s-driver-manager Getting current value of the '[nvidia.com/gpu.deploy.device-plugin](http://nvidia.com/gpu.deploy.device-plugin)' node label 

││ k8s-driver-manager Current value of '[nvidia.com/gpu.deploy.device-plugin=true](http://nvidia.com/gpu.deploy.device-plugin=true)' 

││ k8s-driver-manager Getting current value of the '[nvidia.com/gpu.deploy.gpu-feature-discovery](http://nvidia.com/gpu.deploy.gpu-feature-discovery)' node label 

││ k8s-driver-manager Current value of '[nvidia.com/gpu.deploy.gpu-feature-discovery=true](http://nvidia.com/gpu.deploy.gpu-feature-discovery=true)' 

││ k8s-driver-manager Getting current value of the '[nvidia.com/gpu.deploy.dcgm-exporter](http://nvidia.com/gpu.deploy.dcgm-exporter)' node label 

││ k8s-driver-manager Current value of '[nvidia.com/gpu.deploy.dcgm-exporter=true](http://nvidia.com/gpu.deploy.dcgm-exporter=true)' 

││ k8s-driver-manager Getting current value of the '[nvidia.com/gpu.deploy.dcgm](http://nvidia.com/gpu.deploy.dcgm)' node label 

││ k8s-driver-manager Current value of '[nvidia.com/gpu.deploy.dcgm=true](http://nvidia.com/gpu.deploy.dcgm=true)' 

││ k8s-driver-manager Getting current value of the '[nvidia.com/gpu.deploy.mig-manager](http://nvidia.com/gpu.deploy.mig-manager)' node label 

││ k8s-driver-manager Current value of '[nvidia.com/gpu.deploy.mig-manager=](http://nvidia.com/gpu.deploy.mig-manager=)' 

││ k8s-driver-manager Getting current value of the '[nvidia.com/gpu.deploy.nvsm](http://nvidia.com/gpu.deploy.nvsm)' node label 

││ k8s-driver-manager Current value of '[nvidia.com/gpu.deploy.nvsm=](http://nvidia.com/gpu.deploy.nvsm=)' 

││ k8s-driver-manager Uncordoning node mykim... 

││ k8s-driver-manager node/mykim already uncordoned 

││ k8s-driver-manager Rescheduling all GPU clients on the current node by enabling their component-specific nodeSelector labels 

││ k8s-driver-manager node/mykim not labeled 

││ k8s-driver-manager Unloading nouveau driver... 

││ k8s-driver-manager rmmod: ERROR: Module nouveau is in use 

││ k8s-driver-manager Failed to unload nouveau driver 

││ Stream closed EOF for nvidia-gpu-operator/nvidia-driver-daemonset-thjj8 (k8s-driver-manager) 

││ stream logs failed container "nvidia-driver-ctr" in pod "nvidia-driver-daemonset-thjj8" is waitin

Could you

Blacklist the nouveau driver.
sudo echo -e "\nblacklist nouveau\noptions nouveau modeset=0\n" >> /etc/modprobe.d/blacklist-nouveau.conf

Regenerate the kernel initramfs.
sudo update-initramfs -u

Reboot.
sudo reboot

Refer to Setup — TAO Toolkit 3.22.05 documentation

Please run below workaround when issue happens.
$ kubectl delete crd clusterpolicies.nvidia.com

i following with the procedure “k8s-driver-manager Failed to unload nouveau driver” .
i Performed the bash setup.sh installation process.

when reboot OS, Ubuntu 20.04 has occurred loading screen stuck.
grub rescue to fix linux boot failure.

but restarting pod gpu-operator-7bfc5f55-8jx8r

$kubectl get pod -n nvidia-gpu-operator


NAME                                                              READY   STATUS             RESTARTS      AGE
gpu-operator-1672814931-node-feature-discovery-master-b6f69h5fd   1/1     Running            8 (21m ago)   18h
gpu-operator-1672814931-node-feature-discovery-worker-fqx67       1/1     Running            8 (21m ago)   18h
gpu-operator-7bfc5f55-8jx8r                                       0/1     CrashLoopBackOff   5 (85s ago)   18m

and i attached pod log

| 1.672879965216436e+09    INFO    controller-runtime.metrics    Metrics server is starting to listen  │
│ 1.672879965216878e+09    INFO    setup    starting manager                                           │
│ 1.6728799652173266e+09    INFO    Starting server    {"kind": "health probe", "addr": ":8081"}       │
│ 1.6728799652173545e+09    INFO    Starting server    {"path": "/metrics", "kind": "metrics", "addr": │
│ I0105 00:52:45.217405       1 leaderelection.go:248] attempting to acquire leader lease nvidia-gpu-o │
│ I0105 00:53:03.509532       1 leaderelection.go:258] successfully acquired lease nvidia-gpu-operator │
│ 1.672879983509869e+09    INFO    controller.clusterpolicy-controller    Starting EventSource    {"so │
│ 1.6728799835099726e+09    INFO    controller.clusterpolicy-controller    Starting EventSource    {"s │
│ 1.6728799835099897e+09    INFO    controller.clusterpolicy-controller    Starting EventSource    {"s │
│ 1.6728799835099986e+09    INFO    controller.clusterpolicy-controller    Starting Controller         │
│ 1.6728799835095918e+09    DEBUG    events    Normal    {"object": {"kind":"ConfigMap","namespace":"n │
│ 1.6728799835100212e+09    DEBUG    events    Normal    {"object": {"kind":"Lease","namespace":"nvidi │
│ 1.6728799837144682e+09    ERROR    controller-runtime.source    if kind is a CRD, it should be insta │
│ sigs.k8s.io/controller-runtime/pkg/source.(*Kind).Start.func1.1                                      │
│     /workspace/vendor/sigs.k8s.io/controller-runtime/pkg/source/source.go:137                        │
│ k8s.io/apimachinery/pkg/util/wait.runConditionWithCrashProtectionWithContext                         │
│     /workspace/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:233                                  │
│ k8s.io/apimachinery/pkg/util/wait.poll                                                               │
│     /workspace/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:580                                  │
│ k8s.io/apimachinery/pkg/util/wait.PollImmediateUntilWithContext                                      │
│     /workspace/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:545                                  │
│ sigs.k8s.io/controller-runtime/pkg/source.(*Kind).Start.func1                                        │
│     /workspace/vendor/sigs.k8s.io/controller-runtime/pkg/source/source.go:131                        │
│ I0105 00:53:04.861065       1 request.go:665] Waited for 1.040331054s due to client-side throttling, │
│ 1.6728799854137976e+09    ERROR    controllers.ClusterPolicy    Unable to list ClusterPolicies    {" │
│ sigs.k8s.io/controller-runtime/pkg/handler.(*enqueueRequestsFromMapFunc).mapAndEnqueue

Can you run
$ bash setup.sh uninstall
$ kubectl delete crd clusterpolicies.nvidia.com
$ bash setup.sh install

$ kubectl delete crd clusterpolicies.nvidia.com works!
Thank you very much!

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.