TAO Toolkit API 5.3.0 - Installed with errors

Hi!

I was able to install TAO Toolkit API 5.3.0 but some problems arised during installation. This is the complete log:

installation_log.txt (225.1 KB)

And these are all the pods:

NAMESPACE             NAME                                                              READY   STATUS     RESTARTS   AGE
default               dgx-job-controller-74f7f5ccb8-7pl4f                               1/1     Running    0          67s
default               ingress-nginx-controller-764899d9c6-q4lwn                         1/1     Running    0          82s
default               nfs-subdir-external-provisioner-8665b7df97-w6sfz                  1/1     Running    0          75s
default               nvidia-smi-gpuaasl40vs                                            1/1     Running    0          70s
default               nvtl-api-app-pod-784b78fb8b-zdpsw                                 1/1     Running    0          67s
default               nvtl-api-jupyterlab-pod-5cbbbb8f8-9sb8n                           1/1     Running    0          67s
default               nvtl-api-workflow-pod-6db787664c-dtfrm                            1/1     Running    0          67s
kube-system           calico-kube-controllers-6ff746f7c5-8zxbl                          1/1     Running    0          4m48s
kube-system           calico-node-z6njm                                                 1/1     Running    0          4m43s
kube-system           coredns-5d78c9869d-b97mb                                          1/1     Running    0          3m29s
kube-system           coredns-5d78c9869d-df2zq                                          1/1     Running    0          3m29s
kube-system           etcd-gpuaasl40vs                                                  1/1     Running    0          5m5s
kube-system           kube-apiserver-gpuaasl40vs                                        1/1     Running    0          5m5s
kube-system           kube-controller-manager-gpuaasl40vs                               1/1     Running    0          5m5s
kube-system           kube-proxy-6sq7g                                                  1/1     Running    0          4m50s
kube-system           kube-scheduler-gpuaasl40vs                                        1/1     Running    0          5m5s
nvidia-gpu-operator   gpu-feature-discovery-h68wf                                       0/1     Init:0/1   0          3m36s
nvidia-gpu-operator   gpu-operator-1713535811-node-feature-discovery-gc-7b46cd672vvxc   1/1     Running    0          3m22s
nvidia-gpu-operator   gpu-operator-1713535811-node-feature-discovery-master-5b9942t5h   1/1     Running    0          4m32s
nvidia-gpu-operator   gpu-operator-1713535811-node-feature-discovery-worker-2l9qk       1/1     Running    0          3m19s
nvidia-gpu-operator   gpu-operator-5587854f69-7pnrs                                     1/1     Running    0          4m32s
nvidia-gpu-operator   nvidia-container-toolkit-daemonset-tbvgv                          0/1     Init:0/1   0          3m36s
nvidia-gpu-operator   nvidia-dcgm-exporter-qqp2f                                        0/1     Init:0/1   0          3m36s
nvidia-gpu-operator   nvidia-device-plugin-daemonset-rmwhf                              0/1     Init:0/1   0          3m36s
nvidia-gpu-operator   nvidia-driver-daemonset-tb74g                                     0/1     Running    0          3m58s
nvidia-gpu-operator   nvidia-operator-validator-vkfx4                                   0/1     Init:0/4   0          3m36s

The “Nvidia” ones are stuck in initialization phase. Could be a problem?
Result of kubectl logs -f nvtl-api-app-pod-784b78fb8b-zdpsw is:

api-logs.txt (16.8 KB)

Working with an L40 GPU inside a VM with Passthrough. There’s a way to monitor the GPU usage inside the Kubernetes Cluster?

Thanks!

Hi,
I can install TAO API 5.3.0 successfully on an A40 machine. Attach my log.
20240421_install_TAO-API_5.3_for_forum.txt (397.1 KB)

Do you install with bare metal?
You can try to remove the driver and retry.
$ bash setup.sh uninstall
$ sudo apt purge nvidia-driver-5xx (check the exact with 5xx via $nvidia-smi)
$ sudo apt autoremove
$ sudo apt autoclean

You can run below.
$ kubectl exec -it nvidia-smi-gpuaasl40vs -- nvidia-smi

Hi Morganh!

Problem solved! It seems that some validation pods go into running state after CUDA and Nvidia-smi tests, causing an error. For example this ones:

nvidia-gpu-operator   nvidia-container-toolkit-daemonset-tbvgv                          0/1     Init:0/1   0          3m36s
nvidia-gpu-operator   nvidia-dcgm-exporter-qqp2f                                        0/1     Init:0/1   0          3m36s
nvidia-gpu-operator   nvidia-device-plugin-daemonset-rmwhf                              0/1     Init:0/1   0          3m36s
nvidia-gpu-operator   nvidia-driver-daemonset-tb74g                                     0/1     Running    0          3m58s
nvidia-gpu-operator   nvidia-operator-validator-vkfx4                                   0/1     Init:0/4  

Just needed a little of patience.

Ok, got it.

Thanks!

Mattia

1 Like

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.