TAO Toolkit 4.0.0 API bare metal setup causing gpu driver and kube utilities to uninstall (lots of confusing things happening at the same time)

Morganh · March 16, 2023, 3:57pm

Previously, there are similar topics for stuck “TASK [Waiting for the Cluster to become available]”.
See TAO Toolkit 4.0 setup issue - #19 by Morganh
AutoML installation problem [Waiting for the Cluster to become available] - #7 by Morganh
It is solved with above commands.

To debug, could you open another terminal to check the logs via below? Some marks(****) depends on real name.

$ kubectl get pods
$ kubectl get pod -n nvidia-gpu-operator
$ kubectl logs -n nvidia-gpu-operator nvidia-driver-daemonset-*****
$ kubectl pod get -n gpu-operator-operator nvidia-cuda-validator-****