Please provide the following information when requesting support.
I am Using 3090 GPU.
when I want to use TAO toolkit 4.0 in api_baremetal environment.
after bash setup.sh install
TASK [Waiting for the Cluster to become available]
Waiting endlessly.
gpu-operator pod in nvidia-gpu-operator namespace is still in init.
this is gpu-operator pod event log
Blockquote
Warning FailedCreatePodSandBox 2m3s (x141 over 32m) kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = failed to get sandbox runtime: no runtime for “nvidia” is configured
also, calio-node pod same problem.
this is calio-node event log
Blockquote
Events: │
│ Type Reason Age From Message │
│ ---- ------ ---- ---- ------- │
│ Normal Scheduled 7m35s default-scheduler Successfully assigned kube-system/calico-node-759mk to mykim │
│ Normal Started 7m32s kubelet Started container install-cni │
│ Normal Pulled 7m32s kubelet Container image “docker.io/calico/cni:v3.21.6” already present on machine │
│ Normal Created 7m32s kubelet Created container upgrade-ipam │
│ Normal Started 7m32s kubelet Started container upgrade-ipam │
│ Normal Pulled 7m32s kubelet Container image “docker.io/calico/cni:v3.21.6” already present on machine │
│ Normal Created 7m32s kubelet Created container install-cni │
│ Normal Pulled 7m31s kubelet Container image “docker.io/calico/pod2daemon-flexvol:v3.21.6” already present on machine │
│ Normal Created 7m31s kubelet Created container flexvol-driver │
│ Normal Started 7m31s kubelet Started container flexvol-driver │
│ Normal Started 7m30s kubelet Started container calico-node │
│ Normal Pulled 7m30s kubelet Container image “docker.io/calico/node:v3.21.6” already present on machine │
│ Normal Created 7m30s kubelet Created container calico-node │
│ Warning Unhealthy 7m28s (x2 over 7m29s) kubelet Readiness probe failed: calico/node is not ready: BIRD is not ready: Error querying BIRD: unable to connect to BIRDv4 socket: dial unix │
│ /var/run/calico/bird.ctl: connect: connection refused │
│ Warning Unhealthy 7m20s kubelet Readiness probe failed: 2022-12-23 02:37:56.054 [INFO][794] confd/health.go 180: Number of node(s) with BGP peering established = 0 │
│ calico/node is not ready: felix is not ready: readiness probe reporting 503 │
│ Warning Unhealthy 7m10s kubelet Readiness probe failed: 2022-12-23 02:38:06.049 [INFO][1519] confd/health.go 180: Number of node(s) with BGP peering established = 0 │
│ calico/node is not ready: BIRD is not ready: BGP not established with 192.168.2.118 │
│ Warning Unhealthy 7m kubelet Readiness probe failed: 2022-12-23 02:38:16.050 [INFO][2208] confd/health.go 180: Number of node(s) with BGP peering established = 1 │
│ calico/node is not ready: felix is not ready: readiness probe reporting 503 │
│ Warning Unhealthy 6m50s kubelet Readiness probe failed: 2022-12-23 02:38:26.045 [INFO][2880] confd/health.go 180: Number of node(s) with BGP peering established = 1 │
│ calico/node is not ready: felix is not ready: readiness probe reporting 503 │
│ Warning Unhealthy 6m30s kubelet Readiness probe failed: 2022-12-23 02:38:46.036 [INFO][4290] confd/health.go 180: Number of node(s) with BGP peering established = 1 │
│ calico/node is not ready: felix is not ready: readiness probe reporting 503 │
│ Warning Unhealthy 6m20s kubelet Readiness probe failed: 2022-12-23 02:38:56.057 [INFO][4970] confd/health.go 180: Number of node(s) with BGP peering established = 1 │
│ calico/node is not ready: felix is not ready: readiness probe reporting 503 │
│ Warning Unhealthy 6m10s kubelet Readiness probe failed: 2022-12-23 02:39:06.058 [INFO][5659] confd/health.go 180: Number of node(s) with BGP peering established = 1 │
│ calico/node is not ready: felix is not ready: readiness probe reporting 503 │
│ Warning Unhealthy 6m10s kubelet Readiness probe failed: 2022-12-23 02:39:06.144 [INFO][5684] confd/health.go 180: Number of node(s) with BGP peering established = 1 │
│ calico/node is not ready: felix is not ready: readiness probe reporting 503 │
│ Warning Unhealthy 2m31s (x26 over 6m) kubelet (combined from similar events): Readiness probe failed: 2022-12-23 02:42:45.505 [INFO][21279] confd/health.go 180: Number of node(s) with BGP peeri │
│ ng established = 1 │
│ calico/node is not ready: felix is not ready: readiness probe reporting 503
I strongly believe there should be a standalone Dockerfile/Docker Image deployment for the whole TAO Toolkit API services. Having both Ansible and Kubernetes giving so much pain while troubleshooting the whole unnecessarily complex deployment process.
For “Waiting for the Cluster to become available” , to narrow down, could you try to set single node deployment?
You can also set single node deployment, listing the master is enough. See more in “hosts” file.
i used command "ngc registry resource download-version “nvidia/tao/tao-getting-started:4.0.0"”
Transfer id: tao-getting-started_v4.0.0
Download status: Completed
Downloaded local path: /home/ubuntu/tao-getting-started_v4.0.0
Total files downloaded: 375
Total downloaded size: 2.43 MB
Started at: 2022-12-26 15:37:06.390305
Completed at: 2022-12-26 15:37:21.413422
Duration taken: 15s
-----------------------------------------
in accordance with the guidelines, I entered the cd tao-getting-started_v4.0.0/cv/resource/setup/quickstart_api_bare_metal path,
but the path is different from me.
cd tao-getting-started_v4.0.0/setup/quickstart_api_bare_metal
It’s part of it, but I think the version is a little different.
stream logs failed container "nvidia-driver-ctr" in pod "nvidia-driver-daemonset-xnmnz" is waiting to start: PodInitializing for nvidia-gpu-operator/nvidia-driver-daemonset-xnmnz (nvidia-driver-ctr)
restarting the pod and collecting logs immediately.
stream logs failed container "nvidia-driver-ctr" in pod "nvidia-driver-daemonset-thjj8" is waiting to start: PodInitializing for nvidia-gpu-operator/nvidia-driver-daemonset-thjj8 (nvidia-driver-ctr)
││ k8s-driver-manager Getting current value of the '[nvidia.com/gpu.deploy.operator-validator](http://nvidia.com/gpu.deploy.operator-validator)' node label
││ k8s-driver-manager Current value of '[nvidia.com/gpu.deploy.operator-validator=true](http://nvidia.com/gpu.deploy.operator-validator=true)'
││ k8s-driver-manager Getting current value of the '[nvidia.com/gpu.deploy.container-toolkit](http://nvidia.com/gpu.deploy.container-toolkit)' node label
││ k8s-driver-manager Current value of '[nvidia.com/gpu.deploy.container-toolkit=true](http://nvidia.com/gpu.deploy.container-toolkit=true)'
││ k8s-driver-manager Getting current value of the '[nvidia.com/gpu.deploy.device-plugin](http://nvidia.com/gpu.deploy.device-plugin)' node label
││ k8s-driver-manager Current value of '[nvidia.com/gpu.deploy.device-plugin=true](http://nvidia.com/gpu.deploy.device-plugin=true)'
││ k8s-driver-manager Getting current value of the '[nvidia.com/gpu.deploy.gpu-feature-discovery](http://nvidia.com/gpu.deploy.gpu-feature-discovery)' node label
││ k8s-driver-manager Current value of '[nvidia.com/gpu.deploy.gpu-feature-discovery=true](http://nvidia.com/gpu.deploy.gpu-feature-discovery=true)'
││ k8s-driver-manager Getting current value of the '[nvidia.com/gpu.deploy.dcgm-exporter](http://nvidia.com/gpu.deploy.dcgm-exporter)' node label
││ k8s-driver-manager Current value of '[nvidia.com/gpu.deploy.dcgm-exporter=true](http://nvidia.com/gpu.deploy.dcgm-exporter=true)'
││ k8s-driver-manager Getting current value of the '[nvidia.com/gpu.deploy.dcgm](http://nvidia.com/gpu.deploy.dcgm)' node label
││ k8s-driver-manager Current value of '[nvidia.com/gpu.deploy.dcgm=true](http://nvidia.com/gpu.deploy.dcgm=true)'
││ k8s-driver-manager Getting current value of the '[nvidia.com/gpu.deploy.mig-manager](http://nvidia.com/gpu.deploy.mig-manager)' node label
││ k8s-driver-manager Current value of '[nvidia.com/gpu.deploy.mig-manager=](http://nvidia.com/gpu.deploy.mig-manager=)'
││ k8s-driver-manager Getting current value of the '[nvidia.com/gpu.deploy.nvsm](http://nvidia.com/gpu.deploy.nvsm)' node label
││ k8s-driver-manager Current value of '[nvidia.com/gpu.deploy.nvsm=](http://nvidia.com/gpu.deploy.nvsm=)'
││ k8s-driver-manager Uncordoning node mykim...
││ k8s-driver-manager node/mykim already uncordoned
││ k8s-driver-manager Rescheduling all GPU clients on the current node by enabling their component-specific nodeSelector labels
││ k8s-driver-manager node/mykim not labeled
││ k8s-driver-manager Unloading nouveau driver...
││ k8s-driver-manager rmmod: ERROR: Module nouveau is in use
││ k8s-driver-manager Failed to unload nouveau driver
││ Stream closed EOF for nvidia-gpu-operator/nvidia-driver-daemonset-thjj8 (k8s-driver-manager)
││ stream logs failed container "nvidia-driver-ctr" in pod "nvidia-driver-daemonset-thjj8" is waitin