AutoML installation problem [Waiting for the Cluster to become available]

jerrywang8472 · January 3, 2023, 3:17am

I follow this blog “https://developer.nvidia.com/blog/training-like-an-ai-pro-using-tao-automl/” to set up my environment.
Stuck in [Waiting for the Cluster to become available].

My host file look like this.

My pod logs look like this.

Thank you for help!

Morganh · January 3, 2023, 6:11am

You are going to setup single node deployment, right?
In your host file, the [nodes] is not configured.

jerrywang8472 · January 3, 2023, 6:50am

Yes, I want to setup single node deployment.
Host file says “For single node deploymet, listing the master is enough”, so I didn’t configure [node].

Morganh · January 3, 2023, 6:58am

From the file cnc/cnc-validation.yaml, the “Waiting for the cluster to become available” is stuck due to below command.

name: Waiting for the Cluster to become available
args:
executable: /bin/bash
shell: |
state=$(kubectl get pods -n nvidia-gpu-operator | egrep -v ‘Running|Completed|NAME’ | wc -l)
while [ $state != 0 ]
do
sleep 10
state=$(kubectl get pods -n nvidia-gpu-operator | egrep -v ‘Running|Completed|NAME’ | wc -l)
done
register: status
when: “cnc_version > 4.1 and ansible_architecture == ‘x86_64’”

You already check it via “kubectl get pods -n nvidia-gpu-operator” .
Can you open a new terminal to check several logs against the failed pod?
For example, for your failed pod,
$ kubectl logs -n nvidia-gpu-operator nvidia-driver-daemonset-s8tqq

jerrywang8472 · January 3, 2023, 7:05am

Pod logs look like this.

Morganh · January 3, 2023, 8:15am

Can you check the logs for more pods?

More, can you upload the full log when you run below?
bash setup.sh check-inventory.yml
bash setup.sh install

Morganh · January 4, 2023, 9:54am

Please run below workaround when issue happens.
$ kubectl delete crd clusterpolicies.nvidia.com

jerrywang8472 · January 5, 2023, 12:39am

“kubectl delete crd clusterpolicies.nvidia.com” works!
Thank you very much!

Morganh · January 6, 2023, 6:16am

This is a new topic. Please create a new one. Thanks.

system · January 20, 2023, 6:16am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
How to Deploy TAO 4.0 (with AutoML) Support without Kubernetes? TAO Toolkit automation , ansible , kubernetes , tao	11	1155	January 4, 2023
TAO 5.0 job stuck in "pending" and job_id stuck in "ContainerCreating" TAO Toolkit	6	534	July 28, 2023
AutoML “401 UNAUTHORIZED” problem TAO Toolkit	21	1342	January 16, 2023
TAO Toolkit 4.0 setup issue - similar as a previous issue TAO Toolkit	5	778	February 6, 2023
AutoML training speed and GPU problem TAO Toolkit	28	1354	March 29, 2023
AutoML for v4.0.2 with Efficientnet_b1_relu TAO Toolkit	35	744	July 25, 2023
Release tao-toolkit-api failed, and has been uninstalled due to atomic being set: timed out waiting for the condition TAO Toolkit kubernetes	23	780	May 2, 2024
One click script deployment on AWS of VSS Visual AI Agent	5	119	May 23, 2025
Tao Auto ML setup/installation issue for bare metal(single node/local deployment) TAO Toolkit tao , jetson	1	28	March 22, 2025
TAO API (kubernetes pod) troubleshooting: TAO API jobs stuck in "Pending" state indefinitely TAO Toolkit api , tao	25	1216	June 22, 2023

AutoML installation problem [Waiting for the Cluster to become available]

Related topics