Release tao-toolkit-api failed, and has been uninstalled due to atomic being set: timed out waiting for the condition

nasserha · February 13, 2024, 10:56am

I am trying to install the TAO autoML kubernetes cluster on Azure.

After running bash setup.sh install

I am getting this error:

│ Error: release tao-toolkit-api failed, and has been uninstalled due to atomic being set: timed out waiting for the condition
│ 
│   with helm_release.tao_toolkit_api,
│   on api-config.tf line 145, in resource "helm_release" "tao_toolkit_api":
│  145: resource "helm_release" "tao_toolkit_api" {
│ 
╵
╷
│ Error: StatefulSet default/nvidia-smi is not finished rolling out
│ 
│   with kubernetes_stateful_set_v1.nvidia_smi,
│   on gpu-operator.tf line 147, in resource "kubernetes_stateful_set_v1" "nvidia_smi":
│  147: resource "kubernetes_stateful_set_v1" "nvidia_smi" {

• Hardware (T4c)
• TLT Version: toolkit_version: 5.0.0

I have a long log from the terminal but I do not know what is useful to paste here for help. But let me know what can help and I will paste it.

Thanks

Morganh · February 14, 2024, 4:18pm

Could you copy the full log and upload it as a .txt file? Thanks.

nasserha · February 15, 2024, 1:21pm

I am now not even able to get something running. Not sure why, but nothing changed since I last tried to install it.

│ Error: Cycle: kubernetes_stateful_set_v1.nvidia_smi (destroy), module.aks.azurerm_kubernetes_cluster.this, module.aks.output.kube_config (expand), provider["registry.terraform.io/hashicorp/kubernetes"]

Morganh · February 15, 2024, 4:47pm

Please retry on Azure or try a new machine.

nasserha · February 16, 2024, 10:02am

I created a totally new machine, installed TAO and did nothing but setting up the kubernetes. I get the below error.

Morganh · February 17, 2024, 3:00pm

Could you try to search and find the terraform?

nasserha · February 18, 2024, 8:16am

user@machine:~$ which terraform
user@machine:~$ sudo su - root
user@machine:~# which terraform
user@machine:~#terraform --version
Terraform v1.2.4
on linux_amd64

Your version of Terraform is out of date! The latest version
is 1.7.3. You can update by downloading from https://www.terraform.io/downloads.html

So there are two issue with the setup.sh scripts:

It’s installing an old version. I am not sure whether this is intentional.
The ${HOME}/bin/terraform does not exist in $PATH.

user@machine:~# echo $PATH /root/.virtualenvs/launcher/bin:/opt/miniconda/condabin:/opt/ngc-cli:/opt/miniconda/condabin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin

Given this configuration, what is better?

Adding root/bin/terraform to the PATH?
or Changing the setup.sh script by replacing ${HOME}/bin/terraform with terraform simply?

Thanks

Morganh · February 19, 2024, 3:53am

Could you check if /root/bin/terraform is available?
If not , you can try to use find to check where is terraform.

Yes, you can use this way. Also, if you can find the exact path of terraform, you can use the path as well.

nasserha · February 19, 2024, 9:21am

After changing the path of terraform, I am getting the error that I first mentionned when I openned this post.

I thought to reinstall (uninstall, then install), but, something is getting stuck at the un-installation level.

Plan: 0 to add, 0 to change, 27 to destroy.
Do you want to proceed to uninstall [y/n] ?y
╷
│ Warning: "use_microsoft_graph": [DEPRECATED] This field now defaults to `true` and will be removed in v1.3 of Terraform Core due to the deprecation of ADAL by Microsoft.
│ 
**Some more logs ...**
╵
kubernetes_namespace_v1.gpu_operator: Destroying... [id=nvidia-gpu-operator]
kubernetes_namespace_v1.gpu_operator: Still destroying... [id=nvidia-gpu-operator, 10s elapsed]
kubernetes_namespace_v1.gpu_operator: Still destroying... [id=nvidia-gpu-operator, 20s elapsed]
kubernetes_namespace_v1.gpu_operator: Still destroying... [id=nvidia-gpu-operator, 30s elapsed]
kubernetes_namespace_v1.gpu_operator: Still destroying... [id=nvidia-gpu-operator, 40s elapsed]
kubernetes_namespace_v1.gpu_operator: Still destroying... [id=nvidia-gpu-operator, 50s elapsed]
kubernetes_namespace_v1.gpu_operator: Still destroying... [id=nvidia-gpu-operator, 4m20s elapsed]
kubernetes_namespace_v1.gpu_operator: Still destroying... [id=nvidia-gpu-operator, 4m30s elapsed]
kubernetes_namespace_v1.gpu_operator: Still destroying... [id=nvidia-gpu-operator, 4m40s elapsed]
kubernetes_namespace_v1.gpu_operator: Still destroying... [id=nvidia-gpu-operator, 4m50s elapsed]
╷
│ Error: context deadline exceeded

The operator stays in the mode Still destroying for 3 minutes and then gives up.

nasserha · February 19, 2024, 9:36am

Now, for the installation again, it’s still stuck at:

module.aks.azurerm_kubernetes_cluster.this: Still modifying... [id=/subscriptions/78b4d5f1-fca5-4af5-b686-...inerService/managedClusters/tao-automl, 10m50s elapsed]
module.aks.azurerm_kubernetes_cluster.this: Still modifying... [id=/subscriptions/78b4d5f1-fca5-4af5-b686-...inerService/managedClusters/tao-automl, 11m0s elapsed]
module.aks.azurerm_kubernetes_cluster.this: Still modifying... [id=/subscriptions/78b4d5f1-fca5-4af5-b686-...inerService/managedClusters/tao-automl, 11m10s elapsed]
module.aks.azurerm_kubernetes_cluster.this: Modifications complete after 11m16s [id=/subscriptions/78b4d5f1-fca5-4af5-b686-34747c61c20f/resourceGroups/tao-automl/providers/Microsoft.ContainerService/managedClusters/tao-automl]
kubernetes_config_map_v1.install_nfs_common: Creating...
kubernetes_secret_v1.imagepullsecret: Creating...
kubernetes_config_map_v1.upgrade_gpu_driver: Creating...
kubernetes_config_map_v1.install_nfs_common: Creation complete after 0s [id=default/install-nfs-common]
kubernetes_secret_v1.imagepullsecret: Creation complete after 0s [id=default/imagepullsecret]
kubernetes_daemon_set_v1.install_nfs_common: Creating...
kubernetes_daemon_set_v1.install_nfs_common: Creation complete after 0s [id=default/install-nfs-common]
helm_release.ingress_nginx: Creating...
helm_release.nfs_subdir_external_provisioner: Creating...
helm_release.ingress_nginx: Still creating... [10s elapsed]
helm_release.nfs_subdir_external_provisioner: Still creating... [10s elapsed]
helm_release.nfs_subdir_external_provisioner: Creation complete after 12s [id=nfs-subdir-external-provisioner]
helm_release.ingress_nginx: Still creating... [20s elapsed]
helm_release.ingress_nginx: Creation complete after 24s [id=ingress-nginx]
╷
│ Error: configmaps "upgrade-gpu-driver" is forbidden: unable to create new content in namespace nvidia-gpu-operator because it is being terminated
│ 
│   with kubernetes_config_map_v1.upgrade_gpu_driver,
│   on gpu-operator.tf line 11, in resource "kubernetes_config_map_v1" "upgrade_gpu_driver":
│   11: resource "kubernetes_config_map_v1" "upgrade_gpu_driver" {

Morganh · February 23, 2024, 3:20am

Sorry for late reply. Could you uninstall the driver as below?
$sudo apt purge nvidia-driver-xxx
$sudo apt autoremove
$sudo apt autoclean

Then, after rebooting, please do not install the nvidia-driver again.
Then, run below to check if it works.
$ bash setup.sh uninstall

If works, please run installation.
$ bash setup.sh install

nasserha · February 23, 2024, 8:58am

dpkg -l | grep nvidia
ii  libnvidia-container-tools                1.10.0-1                          amd64        NVIDIA container runtime library (command-line tools)
ii  libnvidia-container1:amd64               1.10.0-1                          amd64        NVIDIA container runtime library
ii  nvidia-container-runtime                 3.10.0-1                          all          NVIDIA container runtime
ii  nvidia-container-toolkit                 1.10.0-1                          amd64        NVIDIA container runtime hook
hi  nvidia-fabricmanager-515                 515.48.07-1                       amd64        Fabric Manager for NVSwitch based systems.

Which one should I uninstall?

Morganh · February 23, 2024, 8:59am

Please enter xxx when you check with $nvidia-smi

nasserha · February 23, 2024, 9:01am

$nvidia-smi
Fri Feb 23 09:01:21 2024       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.48.07    Driver Version: 515.48.07    CUDA Version: 11.7     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla T4            On   | 00000001:00:00.0 Off |                    0 |
| N/A   32C    P8    14W /  70W |      2MiB / 15360MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Tesla T4            On   | 00000002:00:00.0 Off |                    0 |
| N/A   34C    P8    14W /  70W |      2MiB / 15360MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  Tesla T4            On   | 00000003:00:00.0 Off |                    0 |
| N/A   33C    P8    14W /  70W |      2MiB / 15360MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  Tesla T4            On   | 00000004:00:00.0 Off |                    0 |
| N/A   35C    P8    14W /  70W |      2MiB / 15360MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+



$sudo apt purge nvidia-driver-515
Reading package lists... Done
Building dependency tree       
Reading state information... Done
**Package 'nvidia-driver-515' is not installed, so not removed**
0 upgraded, 0 newly installed, 0 to remove and 232 not upgraded.

Morganh · February 23, 2024, 9:27am

What is the latest status of $kubectl get pods?

nasserha · February 23, 2024, 9:43am

The connection to the server localhost:8080 was refused - did you specify the right host or port?

Morganh · February 23, 2024, 9:45am

Please try to use below way for hosts file. Assume your local machine has ip of 10.34.4.222, and its passwd is xxx, then set it as

[master]
10.34.4.222 ansible_ssh_user='ubuntu' ansible_ssh_pass='xxx' ansible_ssh_extra_args='-o StrictHostKeyChecking=no'

Then run bash setup.sh install. Please share me with the full log. You can upload it as txt file. Thanks.

nasserha · February 23, 2024, 1:36pm

I added the host to hosts as you suggested and installed again:

Attached the output file
output_file.txt (78.6 KB)

Morganh · February 25, 2024, 4:14pm

nasserha:

output_file.txt (78.6 KB)

time_sleep.wait_for_gpu_operator_up: Still creating… [9m51s elapsed]
time_sleep.wait_for_gpu_operator_up: Creation complete after 10m0s [id=2024-02-23T11:01:31Z]
kubernetes_service_v1.nvidia_smi: Creating…
kubernetes_service_v1.nvidia_smi: Creation complete after 1s [id=default/nvidia-smi]
kubernetes_stateful_set_v1.nvidia_smi: Creating…
helm_release.tao_toolkit_api: Creating…
kubernetes_stateful_set_v1.nvidia_smi: Still creating… [10s elapsed]
kubernetes_stateful_set_v1.nvidia_smi: Still creating… [20s elapsed]
kubernetes_stateful_set_v1.nvidia_smi: Still creating… [30s elapsed]
kubernetes_stateful_set_v1.nvidia_smi: Still creating… [40s elapsed]
kubernetes_stateful_set_v1.nvidia_smi: Still creating… [50s elapsed]
kubernetes_stateful_set_v1.nvidia_smi: Still creating… [1m0s elapsed]
kubernetes_stateful_set_v1.nvidia_smi: Still creating… [1m10s elapsed]
kubernetes_stateful_set_v1.nvidia_smi: Still creating… [1m20s elapsed]

From the log, seems that kubernetes_stateful_set_v1.nvidia_smi is still creating. Did you meet error in the end?

Morganh · February 26, 2024, 4:31pm

BTW, may I know if the environment is on Azure AKS or Azure VMs?
More info is in Azure Containers Services: Pricing and Feature Comparison - CAST AI – Kubernetes Automation Platform.
Thanks.

Topic		Replies	Views
Unable to install TAO Toolkit 5.2.0 API on bare metal TAO Toolkit installation , api	58	835	February 29, 2024
Install TAO bare metal fail TAO Toolkit	8	65	November 26, 2024
NVIDIA Driver Installation skipped during bare-metal install TAO Toolkit	24	884	July 25, 2023
Tao Auto ML setup/installation issue for bare metal(single node/local deployment) TAO Toolkit tao , jetson	1	28	March 22, 2025
Exception: TAO4 AutoML with PeopleNet. Round 2 TAO Toolkit	49	943	June 28, 2023
How to Deploy TAO 4.0 (with AutoML) Support without Kubernetes? TAO Toolkit automation , ansible , kubernetes , tao	11	1155	January 4, 2023
Installing nvidia TAO toolkit API TAO Toolkit tao	12	617	April 9, 2024
TAO Toolkit 4.0 setup issue - similar as a previous issue TAO Toolkit	5	778	February 6, 2023
Completely purge and reinstall nvidia gpu operator TAO Toolkit	41	5786	September 5, 2023
TAO API (kubernetes pod) troubleshooting: TAO API jobs stuck in "Pending" state indefinitely TAO Toolkit api , tao	25	1215	June 22, 2023

Release tao-toolkit-api failed, and has been uninstalled due to atomic being set: timed out waiting for the condition

Related topics