TAO Toolkit 4.0 setup issue - similar as a previous issue

dmunro · January 31, 2023, 8:47pm

Please provide the following information when requesting support.

• Hardware (T4/V100/Xavier/Nano/etc)
I am using a 3080 GPU

I am having a similar issue as this thread TAO Toolkit 4.0 setup issue

I attempted the noveau blacklist fix mentioned in previous thread, but it didn’t resolve the problem. There appears to be some missing information as to how i can collect logs from the pod that failed to init.

kubectl describe pod -n nvidia-gpu-operator nvidia-driver-daemonset-4fl5k
This is output from command for failed pod:

Name: nvidia-driver-daemonset-4fl5k
Namespace: nvidia-gpu-operator
Priority: 2000001000
Priority Class Name: system-node-critical
Node: pmsi-tao/10.1.40.26
Start Time: Tue, 31 Jan 2023 15:29:17 -0500
Labels: app=nvidia-driver-daemonset
controller-revision-hash=589ff6c946
pod-template-generation=1
Annotations: cni.projectcalico.org/containerID: 24b288f073ba68db36601cf9771b818b3d82270de043eca91cb9a5efbe5b6af8
cni.projectcalico.org/podIP: 192.168.35.187/32
cni.projectcalico.org/podIPs: 192.168.35.187/32
Status: Pending
IP: 192.168.35.187
IPs:
IP: 192.168.35.187
Controlled By: DaemonSet/nvidia-driver-daemonset
Init Containers:
k8s-driver-manager:
Container ID: containerd://8b13234a3a23b093776430e2175490a4faf44480d868050d7d69598ceb77d4b1
Image: nvcr.io/nvidia/cloud-native/k8s-driver-manager:v0.3.0
Image ID: nvcr.io/nvidia/cloud-native/k8s-driver-manager@sha256:5b16056257acc51b517d9cdb1da3218693cefc214af93789e6e214fd2b4cacf1
Port:
Host Port:
Command:
driver-manager
Args:
uninstall_driver
State: Running
Started: Tue, 31 Jan 2023 15:33:07 -0500
Last State: Terminated
Reason: Error
Exit Code: 1
Started: Tue, 31 Jan 2023 15:31:31 -0500
Finished: Tue, 31 Jan 2023 15:31:41 -0500
Ready: False
Restart Count: 5
Environment:
NODE_NAME: (v1:spec.nodeName)
NVIDIA_VISIBLE_DEVICES: void
ENABLE_AUTO_DRAIN: true
DRAIN_USE_FORCE: false
DRAIN_POD_SELECTOR_LABEL:
DRAIN_TIMEOUT_SECONDS: 0s
DRAIN_DELETE_EMPTYDIR_DATA: false
OPERATOR_NAMESPACE: nvidia-gpu-operator (v1:metadata.namespace)
Mounts:
/run/nvidia from run-nvidia (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-jgcjn (ro)
Containers:
nvidia-driver-ctr:
Container ID:
Image: nvcr.io/nvidia/driver:510.47.03-ubuntu20.04
Image ID:
Port:
Host Port:
Command:
nvidia-driver
Args:
init
State: Waiting
Reason: PodInitializing
Ready: False
Restart Count: 0
Environment:
Mounts:
/dev/log from dev-log (rw)
/host-etc/os-release from host-os-release (ro)
/run/mellanox/drivers from run-mellanox-drivers (rw)
/run/mellanox/drivers/usr/src from mlnx-ofed-usr-src (rw)
/run/nvidia from run-nvidia (rw)
/run/nvidia-topologyd from run-nvidia-topologyd (rw)
/var/log from var-log (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-jgcjn (ro)
Conditions:
Type Status
Initialized False
Ready False
ContainersReady False
PodScheduled True
Volumes:
run-nvidia:
Type: HostPath (bare host directory volume)
Path: /run/nvidia
HostPathType: DirectoryOrCreate
var-log:
Type: HostPath (bare host directory volume)
Path: /var/log
HostPathType:
dev-log:
Type: HostPath (bare host directory volume)
Path: /dev/log
HostPathType:
host-os-release:
Type: HostPath (bare host directory volume)
Path: /etc/os-release
HostPathType:
run-nvidia-topologyd:
Type: HostPath (bare host directory volume)
Path: /run/nvidia-topologyd
HostPathType: DirectoryOrCreate
mlnx-ofed-usr-src:
Type: HostPath (bare host directory volume)
Path: /run/mellanox/drivers/usr/src
HostPathType: DirectoryOrCreate
run-mellanox-drivers:
Type: HostPath (bare host directory volume)
Path: /run/mellanox/drivers
HostPathType: DirectoryOrCreate
run-nvidia-validations:
Type: HostPath (bare host directory volume)
Path: /run/nvidia/validations
HostPathType: DirectoryOrCreate
kube-api-access-jgcjn:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional:
DownwardAPI: true
QoS Class: BestEffort
Node-Selectors: nvidia.com/gpu.deploy.driver=true
Tolerations: node.kubernetes.io/disk-pressure:NoSchedule op=Exists
node.kubernetes.io/memory-pressure:NoSchedule op=Exists
node.kubernetes.io/not-ready:NoExecute op=Exists
node.kubernetes.io/pid-pressure:NoSchedule op=Exists
node.kubernetes.io/unreachable:NoExecute op=Exists
node.kubernetes.io/unschedulable:NoSchedule op=Exists
nvidia.com/gpu:NoSchedule op=Exists
Events:
Type Reason Age From Message

Normal Scheduled 3m52s default-scheduler Successfully assigned nvidia-gpu-operator/nvidia-driver-daemonset-4fl5k to pmsi-tao
Normal Pulled 98s (x5 over 3m52s) kubelet Container image “nvcr.io/nvidia/cloud-native/k8s-driver-manager:v0.3.0” already present on machine
Normal Created 98s (x5 over 3m52s) kubelet Created container k8s-driver-manager
Normal Started 98s (x5 over 3m52s) kubelet Started container k8s-driver-manager
Warning BackOff 60s (x10 over 3m24s) kubelet Back-off restarting failed container

Morganh · February 1, 2023, 7:20am

Please run
$ bash setup.sh uninstall
$ kubectl delete crd clusterpolicies.nvidia.com
$ bash setup.sh install

dmunro · February 2, 2023, 9:08pm

I attempted those steps as well, same result.

Morganh · February 3, 2023, 2:07am

Can you upload all the command and log via

Morganh · February 6, 2023, 5:06am

There is no update from you for a period, assuming this is not an issue anymore. Hence we are closing this topic. If need further support, please open a new one. Thanks

Hi,
Please check the cases and to-do.

If you are running in bare metal without anything in system, please
$ bash setup.sh install
If nouveau is installed and in use, then run below commands to get logs for debug.
$ kubectl get pods
$ kubectl get pod -n nvidia-gpu-operator
$ kubectl logs -n nvidia-gpu-operator nvidia-driver-daemonset-*****
$ kubectl pod get -n gpu-operator-operator nvidia-cuda-validator-****
The previous solution is:
$ bash setup.sh uninstall
$ kubectl delete crd clusterpolicies.nvidia.com
$ sudo reboot
$ bash setup.sh install
If nvidia gpu driver is pre-installed,
run below commands to uninstall gpu driver.
$ sudo apt purge nvidia-driver-*
$ sudo apt autoremove
$ sudo apt autoclean
then,
$ bash setup.sh uninstall
$ kubectl delete crd clusterpolicies.nvidia.com
$ bash setup.sh install

system · March 7, 2023, 5:53am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
TAO Toolkit 4.0 setup issue TAO Toolkit	19	2792	January 5, 2023
Completely purge and reinstall nvidia gpu operator TAO Toolkit	41	5673	September 5, 2023
Unable to install TAO Toolkit 5.2.0 API on bare metal TAO Toolkit installation , api	58	832	February 29, 2024
TAO Toolkit API 5.3.0 - Installed with errors TAO Toolkit	3	322	April 22, 2024
NVIDIA Driver Installation skipped during bare-metal install TAO Toolkit	24	883	July 25, 2023
How to Deploy TAO 4.0 (with AutoML) Support without Kubernetes? TAO Toolkit automation , ansible , kubernetes , tao	11	1155	January 4, 2023
Exception: TAO4 AutoML with PeopleNet. Round 2 TAO Toolkit	49	939	June 28, 2023
Tao toolkit Error while fetching server API version TAO Toolkit	19	1893	June 15, 2023
Docker - No such container TAO Toolkit	7	60	March 10, 2025
Nvidia-docker DeepStream SDK	9	570	June 11, 2024

TAO Toolkit 4.0 setup issue - similar as a previous issue

Related topics