Release tao-toolkit-api failed, and has been uninstalled due to atomic being set: timed out waiting for the condition

I am trying to install the TAO autoML kubernetes cluster on Azure.

After running bash setup.sh install

I am getting this error:

│ Error: release tao-toolkit-api failed, and has been uninstalled due to atomic being set: timed out waiting for the condition
│ 
│   with helm_release.tao_toolkit_api,
│   on api-config.tf line 145, in resource "helm_release" "tao_toolkit_api":
│  145: resource "helm_release" "tao_toolkit_api" {
│ 
╵
╷
│ Error: StatefulSet default/nvidia-smi is not finished rolling out
│ 
│   with kubernetes_stateful_set_v1.nvidia_smi,
│   on gpu-operator.tf line 147, in resource "kubernetes_stateful_set_v1" "nvidia_smi":
│  147: resource "kubernetes_stateful_set_v1" "nvidia_smi" {

• Hardware (T4c)
• TLT Version: toolkit_version: 5.0.0

I have a long log from the terminal but I do not know what is useful to paste here for help. But let me know what can help and I will paste it.

Thanks

Could you copy the full log and upload it as a .txt file? Thanks.

I am now not even able to get something running. Not sure why, but nothing changed since I last tried to install it.

│ Error: Cycle: kubernetes_stateful_set_v1.nvidia_smi (destroy), module.aks.azurerm_kubernetes_cluster.this, module.aks.output.kube_config (expand), provider["registry.terraform.io/hashicorp/kubernetes"]

Please retry on Azure or try a new machine.

I created a totally new machine, installed TAO and did nothing but setting up the kubernetes. I get the below error.

Could you try to search and find the terraform?

user@machine:~$ which terraform
user@machine:~$ sudo su - root
user@machine:~# which terraform
user@machine:~#terraform --version
Terraform v1.2.4
on linux_amd64

Your version of Terraform is out of date! The latest version
is 1.7.3. You can update by downloading from https://www.terraform.io/downloads.html

So there are two issue with the setup.sh scripts:

  • It’s installing an old version. I am not sure whether this is intentional.
  • The ${HOME}/bin/terraform does not exist in $PATH.
user@machine:~# echo $PATH /root/.virtualenvs/launcher/bin:/opt/miniconda/condabin:/opt/ngc-cli:/opt/miniconda/condabin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin

Given this configuration, what is better?

  • Adding root/bin/terraform to the PATH?
  • or Changing the setup.sh script by replacing ${HOME}/bin/terraform with terraform simply?

Thanks

Could you check if /root/bin/terraform is available?
If not , you can try to use find to check where is terraform.

Yes, you can use this way. Also, if you can find the exact path of terraform, you can use the path as well.

After changing the path of terraform, I am getting the error that I first mentionned when I openned this post.

I thought to reinstall (uninstall, then install), but, something is getting stuck at the un-installation level.

Plan: 0 to add, 0 to change, 27 to destroy.
Do you want to proceed to uninstall [y/n] ?y
╷
│ Warning: "use_microsoft_graph": [DEPRECATED] This field now defaults to `true` and will be removed in v1.3 of Terraform Core due to the deprecation of ADAL by Microsoft.
│ 
**Some more logs ...**
╵
kubernetes_namespace_v1.gpu_operator: Destroying... [id=nvidia-gpu-operator]
kubernetes_namespace_v1.gpu_operator: Still destroying... [id=nvidia-gpu-operator, 10s elapsed]
kubernetes_namespace_v1.gpu_operator: Still destroying... [id=nvidia-gpu-operator, 20s elapsed]
kubernetes_namespace_v1.gpu_operator: Still destroying... [id=nvidia-gpu-operator, 30s elapsed]
kubernetes_namespace_v1.gpu_operator: Still destroying... [id=nvidia-gpu-operator, 40s elapsed]
kubernetes_namespace_v1.gpu_operator: Still destroying... [id=nvidia-gpu-operator, 50s elapsed]
kubernetes_namespace_v1.gpu_operator: Still destroying... [id=nvidia-gpu-operator, 4m20s elapsed]
kubernetes_namespace_v1.gpu_operator: Still destroying... [id=nvidia-gpu-operator, 4m30s elapsed]
kubernetes_namespace_v1.gpu_operator: Still destroying... [id=nvidia-gpu-operator, 4m40s elapsed]
kubernetes_namespace_v1.gpu_operator: Still destroying... [id=nvidia-gpu-operator, 4m50s elapsed]
╷
│ Error: context deadline exceeded

The operator stays in the mode Still destroying for 3 minutes and then gives up.

Now, for the installation again, it’s still stuck at:

module.aks.azurerm_kubernetes_cluster.this: Still modifying... [id=/subscriptions/78b4d5f1-fca5-4af5-b686-...inerService/managedClusters/tao-automl, 10m50s elapsed]
module.aks.azurerm_kubernetes_cluster.this: Still modifying... [id=/subscriptions/78b4d5f1-fca5-4af5-b686-...inerService/managedClusters/tao-automl, 11m0s elapsed]
module.aks.azurerm_kubernetes_cluster.this: Still modifying... [id=/subscriptions/78b4d5f1-fca5-4af5-b686-...inerService/managedClusters/tao-automl, 11m10s elapsed]
module.aks.azurerm_kubernetes_cluster.this: Modifications complete after 11m16s [id=/subscriptions/78b4d5f1-fca5-4af5-b686-34747c61c20f/resourceGroups/tao-automl/providers/Microsoft.ContainerService/managedClusters/tao-automl]
kubernetes_config_map_v1.install_nfs_common: Creating...
kubernetes_secret_v1.imagepullsecret: Creating...
kubernetes_config_map_v1.upgrade_gpu_driver: Creating...
kubernetes_config_map_v1.install_nfs_common: Creation complete after 0s [id=default/install-nfs-common]
kubernetes_secret_v1.imagepullsecret: Creation complete after 0s [id=default/imagepullsecret]
kubernetes_daemon_set_v1.install_nfs_common: Creating...
kubernetes_daemon_set_v1.install_nfs_common: Creation complete after 0s [id=default/install-nfs-common]
helm_release.ingress_nginx: Creating...
helm_release.nfs_subdir_external_provisioner: Creating...
helm_release.ingress_nginx: Still creating... [10s elapsed]
helm_release.nfs_subdir_external_provisioner: Still creating... [10s elapsed]
helm_release.nfs_subdir_external_provisioner: Creation complete after 12s [id=nfs-subdir-external-provisioner]
helm_release.ingress_nginx: Still creating... [20s elapsed]
helm_release.ingress_nginx: Creation complete after 24s [id=ingress-nginx]
╷
│ Error: configmaps "upgrade-gpu-driver" is forbidden: unable to create new content in namespace nvidia-gpu-operator because it is being terminated
│ 
│   with kubernetes_config_map_v1.upgrade_gpu_driver,
│   on gpu-operator.tf line 11, in resource "kubernetes_config_map_v1" "upgrade_gpu_driver":
│   11: resource "kubernetes_config_map_v1" "upgrade_gpu_driver" {

Sorry for late reply. Could you uninstall the driver as below?
$sudo apt purge nvidia-driver-xxx
$sudo apt autoremove
$sudo apt autoclean

Then, after rebooting, please do not install the nvidia-driver again.
Then, run below to check if it works.
$ bash setup.sh uninstall

If works, please run installation.
$ bash setup.sh install

dpkg -l | grep nvidia
ii  libnvidia-container-tools                1.10.0-1                          amd64        NVIDIA container runtime library (command-line tools)
ii  libnvidia-container1:amd64               1.10.0-1                          amd64        NVIDIA container runtime library
ii  nvidia-container-runtime                 3.10.0-1                          all          NVIDIA container runtime
ii  nvidia-container-toolkit                 1.10.0-1                          amd64        NVIDIA container runtime hook
hi  nvidia-fabricmanager-515                 515.48.07-1                       amd64        Fabric Manager for NVSwitch based systems.

Which one should I uninstall?

Please enter xxx when you check with $nvidia-smi

$nvidia-smi
Fri Feb 23 09:01:21 2024       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.48.07    Driver Version: 515.48.07    CUDA Version: 11.7     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla T4            On   | 00000001:00:00.0 Off |                    0 |
| N/A   32C    P8    14W /  70W |      2MiB / 15360MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Tesla T4            On   | 00000002:00:00.0 Off |                    0 |
| N/A   34C    P8    14W /  70W |      2MiB / 15360MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  Tesla T4            On   | 00000003:00:00.0 Off |                    0 |
| N/A   33C    P8    14W /  70W |      2MiB / 15360MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  Tesla T4            On   | 00000004:00:00.0 Off |                    0 |
| N/A   35C    P8    14W /  70W |      2MiB / 15360MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+



$sudo apt purge nvidia-driver-515
Reading package lists... Done
Building dependency tree       
Reading state information... Done
**Package 'nvidia-driver-515' is not installed, so not removed**
0 upgraded, 0 newly installed, 0 to remove and 232 not upgraded.

What is the latest status of $kubectl get pods?

The connection to the server localhost:8080 was refused - did you specify the right host or port?

Please try to use below way for hosts file. Assume your local machine has ip of 10.34.4.222, and its passwd is xxx, then set it as

[master]
10.34.4.222 ansible_ssh_user='ubuntu' ansible_ssh_pass='xxx' ansible_ssh_extra_args='-o StrictHostKeyChecking=no'

Then run bash setup.sh install. Please share me with the full log. You can upload it as txt file. Thanks.

I added the host to hosts as you suggested and installed again:

Attached the output file
output_file.txt (78.6 KB)

From the log, seems that kubernetes_stateful_set_v1.nvidia_smi is still creating. Did you meet error in the end?

BTW, may I know if the environment is on Azure AKS or Azure VMs?
More info is in Azure Containers Services: Pricing and Feature Comparison - CAST AI – Kubernetes Automation Platform.
Thanks.