I am trying to install the TAO autoML kubernetes cluster on Azure.
After running bash setup.sh install
I am getting this error:
│ Error: release tao-toolkit-api failed, and has been uninstalled due to atomic being set: timed out waiting for the condition
│
│ with helm_release.tao_toolkit_api,
│ on api-config.tf line 145, in resource "helm_release" "tao_toolkit_api":
│ 145: resource "helm_release" "tao_toolkit_api" {
│
╵
╷
│ Error: StatefulSet default/nvidia-smi is not finished rolling out
│
│ with kubernetes_stateful_set_v1.nvidia_smi,
│ on gpu-operator.tf line 147, in resource "kubernetes_stateful_set_v1" "nvidia_smi":
│ 147: resource "kubernetes_stateful_set_v1" "nvidia_smi" {
user@machine:~$ which terraform
user@machine:~$ sudo su - root
user@machine:~# which terraform
user@machine:~#terraform --version
Terraform v1.2.4
on linux_amd64
Your version of Terraform is out of date! The latest version
is 1.7.3. You can update by downloading from https://www.terraform.io/downloads.html
So there are two issue with the setup.sh scripts:
It’s installing an old version. I am not sure whether this is intentional.
The ${HOME}/bin/terraform does not exist in $PATH.
After changing the path of terraform, I am getting the error that I first mentionned when I openned this post.
I thought to reinstall (uninstall, then install), but, something is getting stuck at the un-installation level.
Plan: 0 to add, 0 to change, 27 to destroy.
Do you want to proceed to uninstall [y/n] ?y
╷
│ Warning: "use_microsoft_graph": [DEPRECATED] This field now defaults to `true` and will be removed in v1.3 of Terraform Core due to the deprecation of ADAL by Microsoft.
│
**Some more logs ...**
╵
kubernetes_namespace_v1.gpu_operator: Destroying... [id=nvidia-gpu-operator]
kubernetes_namespace_v1.gpu_operator: Still destroying... [id=nvidia-gpu-operator, 10s elapsed]
kubernetes_namespace_v1.gpu_operator: Still destroying... [id=nvidia-gpu-operator, 20s elapsed]
kubernetes_namespace_v1.gpu_operator: Still destroying... [id=nvidia-gpu-operator, 30s elapsed]
kubernetes_namespace_v1.gpu_operator: Still destroying... [id=nvidia-gpu-operator, 40s elapsed]
kubernetes_namespace_v1.gpu_operator: Still destroying... [id=nvidia-gpu-operator, 50s elapsed]
kubernetes_namespace_v1.gpu_operator: Still destroying... [id=nvidia-gpu-operator, 4m20s elapsed]
kubernetes_namespace_v1.gpu_operator: Still destroying... [id=nvidia-gpu-operator, 4m30s elapsed]
kubernetes_namespace_v1.gpu_operator: Still destroying... [id=nvidia-gpu-operator, 4m40s elapsed]
kubernetes_namespace_v1.gpu_operator: Still destroying... [id=nvidia-gpu-operator, 4m50s elapsed]
╷
│ Error: context deadline exceeded
The operator stays in the mode Still destroying for 3 minutes and then gives up.
Now, for the installation again, it’s still stuck at:
module.aks.azurerm_kubernetes_cluster.this: Still modifying... [id=/subscriptions/78b4d5f1-fca5-4af5-b686-...inerService/managedClusters/tao-automl, 10m50s elapsed]
module.aks.azurerm_kubernetes_cluster.this: Still modifying... [id=/subscriptions/78b4d5f1-fca5-4af5-b686-...inerService/managedClusters/tao-automl, 11m0s elapsed]
module.aks.azurerm_kubernetes_cluster.this: Still modifying... [id=/subscriptions/78b4d5f1-fca5-4af5-b686-...inerService/managedClusters/tao-automl, 11m10s elapsed]
module.aks.azurerm_kubernetes_cluster.this: Modifications complete after 11m16s [id=/subscriptions/78b4d5f1-fca5-4af5-b686-34747c61c20f/resourceGroups/tao-automl/providers/Microsoft.ContainerService/managedClusters/tao-automl]
kubernetes_config_map_v1.install_nfs_common: Creating...
kubernetes_secret_v1.imagepullsecret: Creating...
kubernetes_config_map_v1.upgrade_gpu_driver: Creating...
kubernetes_config_map_v1.install_nfs_common: Creation complete after 0s [id=default/install-nfs-common]
kubernetes_secret_v1.imagepullsecret: Creation complete after 0s [id=default/imagepullsecret]
kubernetes_daemon_set_v1.install_nfs_common: Creating...
kubernetes_daemon_set_v1.install_nfs_common: Creation complete after 0s [id=default/install-nfs-common]
helm_release.ingress_nginx: Creating...
helm_release.nfs_subdir_external_provisioner: Creating...
helm_release.ingress_nginx: Still creating... [10s elapsed]
helm_release.nfs_subdir_external_provisioner: Still creating... [10s elapsed]
helm_release.nfs_subdir_external_provisioner: Creation complete after 12s [id=nfs-subdir-external-provisioner]
helm_release.ingress_nginx: Still creating... [20s elapsed]
helm_release.ingress_nginx: Creation complete after 24s [id=ingress-nginx]
╷
│ Error: configmaps "upgrade-gpu-driver" is forbidden: unable to create new content in namespace nvidia-gpu-operator because it is being terminated
│
│ with kubernetes_config_map_v1.upgrade_gpu_driver,
│ on gpu-operator.tf line 11, in resource "kubernetes_config_map_v1" "upgrade_gpu_driver":
│ 11: resource "kubernetes_config_map_v1" "upgrade_gpu_driver" {