Hi!
I was able to install TAO Toolkit API 5.3.0 but some problems arised during installation. This is the complete log:
installation_log.txt (225.1 KB)
And these are all the pods:
NAMESPACE NAME READY STATUS RESTARTS AGE
default dgx-job-controller-74f7f5ccb8-7pl4f 1/1 Running 0 67s
default ingress-nginx-controller-764899d9c6-q4lwn 1/1 Running 0 82s
default nfs-subdir-external-provisioner-8665b7df97-w6sfz 1/1 Running 0 75s
default nvidia-smi-gpuaasl40vs 1/1 Running 0 70s
default nvtl-api-app-pod-784b78fb8b-zdpsw 1/1 Running 0 67s
default nvtl-api-jupyterlab-pod-5cbbbb8f8-9sb8n 1/1 Running 0 67s
default nvtl-api-workflow-pod-6db787664c-dtfrm 1/1 Running 0 67s
kube-system calico-kube-controllers-6ff746f7c5-8zxbl 1/1 Running 0 4m48s
kube-system calico-node-z6njm 1/1 Running 0 4m43s
kube-system coredns-5d78c9869d-b97mb 1/1 Running 0 3m29s
kube-system coredns-5d78c9869d-df2zq 1/1 Running 0 3m29s
kube-system etcd-gpuaasl40vs 1/1 Running 0 5m5s
kube-system kube-apiserver-gpuaasl40vs 1/1 Running 0 5m5s
kube-system kube-controller-manager-gpuaasl40vs 1/1 Running 0 5m5s
kube-system kube-proxy-6sq7g 1/1 Running 0 4m50s
kube-system kube-scheduler-gpuaasl40vs 1/1 Running 0 5m5s
nvidia-gpu-operator gpu-feature-discovery-h68wf 0/1 Init:0/1 0 3m36s
nvidia-gpu-operator gpu-operator-1713535811-node-feature-discovery-gc-7b46cd672vvxc 1/1 Running 0 3m22s
nvidia-gpu-operator gpu-operator-1713535811-node-feature-discovery-master-5b9942t5h 1/1 Running 0 4m32s
nvidia-gpu-operator gpu-operator-1713535811-node-feature-discovery-worker-2l9qk 1/1 Running 0 3m19s
nvidia-gpu-operator gpu-operator-5587854f69-7pnrs 1/1 Running 0 4m32s
nvidia-gpu-operator nvidia-container-toolkit-daemonset-tbvgv 0/1 Init:0/1 0 3m36s
nvidia-gpu-operator nvidia-dcgm-exporter-qqp2f 0/1 Init:0/1 0 3m36s
nvidia-gpu-operator nvidia-device-plugin-daemonset-rmwhf 0/1 Init:0/1 0 3m36s
nvidia-gpu-operator nvidia-driver-daemonset-tb74g 0/1 Running 0 3m58s
nvidia-gpu-operator nvidia-operator-validator-vkfx4 0/1 Init:0/4 0 3m36s
The “Nvidia” ones are stuck in initialization phase. Could be a problem?
Result of kubectl logs -f nvtl-api-app-pod-784b78fb8b-zdpsw
is:
api-logs.txt (16.8 KB)
Working with an L40 GPU inside a VM with Passthrough. There’s a way to monitor the GPU usage inside the Kubernetes Cluster?
Thanks!