AutoML installation problem [Waiting for the Cluster to become available]

I follow this blog “https://developer.nvidia.com/blog/training-like-an-ai-pro-using-tao-automl/” to set up my environment.
Stuck in [Waiting for the Cluster to become available].

My host file look like this.

My pod logs look like this.

Thank you for help!

You are going to setup single node deployment, right?
In your host file, the [nodes] is not configured.

Yes, I want to setup single node deployment.
Host file says “For single node deploymet, listing the master is enough”, so I didn’t configure [node].

From the file cnc/cnc-validation.yaml, the “Waiting for the cluster to become available” is stuck due to below command.

  • name: Waiting for the Cluster to become available
    args:
    executable: /bin/bash
    shell: |
    state=$(kubectl get pods -n nvidia-gpu-operator | egrep -v ‘Running|Completed|NAME’ | wc -l)
    while [ $state != 0 ]
    do
    sleep 10
    state=$(kubectl get pods -n nvidia-gpu-operator | egrep -v ‘Running|Completed|NAME’ | wc -l)
    done
    register: status
    when: “cnc_version > 4.1 and ansible_architecture == ‘x86_64’”

You already check it via “kubectl get pods -n nvidia-gpu-operator” .
Can you open a new terminal to check several logs against the failed pod?
For example, for your failed pod,
$ kubectl logs -n nvidia-gpu-operator nvidia-driver-daemonset-s8tqq

Pod logs look like this.

Can you check the logs for more pods?

More, can you upload the full log when you run below?
bash setup.sh check-inventory.yml
bash setup.sh install

Please run below workaround when issue happens.
$ kubectl delete crd clusterpolicies.nvidia.com

1 Like

“kubectl delete crd clusterpolicies.nvidia.com” works!
Thank you very much!

This is a new topic. Please create a new one. Thanks.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.