TAO Toolkit 4.0 setup issue - similar as a previous issue

Please provide the following information when requesting support.

• Hardware (T4/V100/Xavier/Nano/etc)
I am using a 3080 GPU

I am having a similar issue as this thread TAO Toolkit 4.0 setup issue

I attempted the noveau blacklist fix mentioned in previous thread, but it didn’t resolve the problem. There appears to be some missing information as to how i can collect logs from the pod that failed to init.

kubectl describe pod -n nvidia-gpu-operator nvidia-driver-daemonset-4fl5k
This is output from command for failed pod:

Name: nvidia-driver-daemonset-4fl5k
Namespace: nvidia-gpu-operator
Priority: 2000001000
Priority Class Name: system-node-critical
Node: pmsi-tao/10.1.40.26
Start Time: Tue, 31 Jan 2023 15:29:17 -0500
Labels: app=nvidia-driver-daemonset
controller-revision-hash=589ff6c946
pod-template-generation=1
Annotations: cni.projectcalico.org/containerID: 24b288f073ba68db36601cf9771b818b3d82270de043eca91cb9a5efbe5b6af8
cni.projectcalico.org/podIP: 192.168.35.187/32
cni.projectcalico.org/podIPs: 192.168.35.187/32
Status: Pending
IP: 192.168.35.187
IPs:
IP: 192.168.35.187
Controlled By: DaemonSet/nvidia-driver-daemonset
Init Containers:
k8s-driver-manager:
Container ID: containerd://8b13234a3a23b093776430e2175490a4faf44480d868050d7d69598ceb77d4b1
Image: nvcr.io/nvidia/cloud-native/k8s-driver-manager:v0.3.0
Image ID: nvcr.io/nvidia/cloud-native/k8s-driver-manager@sha256:5b16056257acc51b517d9cdb1da3218693cefc214af93789e6e214fd2b4cacf1
Port:
Host Port:
Command:
driver-manager
Args:
uninstall_driver
State: Running
Started: Tue, 31 Jan 2023 15:33:07 -0500
Last State: Terminated
Reason: Error
Exit Code: 1
Started: Tue, 31 Jan 2023 15:31:31 -0500
Finished: Tue, 31 Jan 2023 15:31:41 -0500
Ready: False
Restart Count: 5
Environment:
NODE_NAME: (v1:spec.nodeName)
NVIDIA_VISIBLE_DEVICES: void
ENABLE_AUTO_DRAIN: true
DRAIN_USE_FORCE: false
DRAIN_POD_SELECTOR_LABEL:
DRAIN_TIMEOUT_SECONDS: 0s
DRAIN_DELETE_EMPTYDIR_DATA: false
OPERATOR_NAMESPACE: nvidia-gpu-operator (v1:metadata.namespace)
Mounts:
/run/nvidia from run-nvidia (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-jgcjn (ro)
Containers:
nvidia-driver-ctr:
Container ID:
Image: nvcr.io/nvidia/driver:510.47.03-ubuntu20.04
Image ID:
Port:
Host Port:
Command:
nvidia-driver
Args:
init
State: Waiting
Reason: PodInitializing
Ready: False
Restart Count: 0
Environment:
Mounts:
/dev/log from dev-log (rw)
/host-etc/os-release from host-os-release (ro)
/run/mellanox/drivers from run-mellanox-drivers (rw)
/run/mellanox/drivers/usr/src from mlnx-ofed-usr-src (rw)
/run/nvidia from run-nvidia (rw)
/run/nvidia-topologyd from run-nvidia-topologyd (rw)
/var/log from var-log (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-jgcjn (ro)
Conditions:
Type Status
Initialized False
Ready False
ContainersReady False
PodScheduled True
Volumes:
run-nvidia:
Type: HostPath (bare host directory volume)
Path: /run/nvidia
HostPathType: DirectoryOrCreate
var-log:
Type: HostPath (bare host directory volume)
Path: /var/log
HostPathType:
dev-log:
Type: HostPath (bare host directory volume)
Path: /dev/log
HostPathType:
host-os-release:
Type: HostPath (bare host directory volume)
Path: /etc/os-release
HostPathType:
run-nvidia-topologyd:
Type: HostPath (bare host directory volume)
Path: /run/nvidia-topologyd
HostPathType: DirectoryOrCreate
mlnx-ofed-usr-src:
Type: HostPath (bare host directory volume)
Path: /run/mellanox/drivers/usr/src
HostPathType: DirectoryOrCreate
run-mellanox-drivers:
Type: HostPath (bare host directory volume)
Path: /run/mellanox/drivers
HostPathType: DirectoryOrCreate
run-nvidia-validations:
Type: HostPath (bare host directory volume)
Path: /run/nvidia/validations
HostPathType: DirectoryOrCreate
kube-api-access-jgcjn:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional:
DownwardAPI: true
QoS Class: BestEffort
Node-Selectors: nvidia.com/gpu.deploy.driver=true
Tolerations: node.kubernetes.io/disk-pressure:NoSchedule op=Exists
node.kubernetes.io/memory-pressure:NoSchedule op=Exists
node.kubernetes.io/not-ready:NoExecute op=Exists
node.kubernetes.io/pid-pressure:NoSchedule op=Exists
node.kubernetes.io/unreachable:NoExecute op=Exists
node.kubernetes.io/unschedulable:NoSchedule op=Exists
nvidia.com/gpu:NoSchedule op=Exists
Events:
Type Reason Age From Message


Normal Scheduled 3m52s default-scheduler Successfully assigned nvidia-gpu-operator/nvidia-driver-daemonset-4fl5k to pmsi-tao
Normal Pulled 98s (x5 over 3m52s) kubelet Container image “nvcr.io/nvidia/cloud-native/k8s-driver-manager:v0.3.0” already present on machine
Normal Created 98s (x5 over 3m52s) kubelet Created container k8s-driver-manager
Normal Started 98s (x5 over 3m52s) kubelet Started container k8s-driver-manager
Warning BackOff 60s (x10 over 3m24s) kubelet Back-off restarting failed container

Please run
$ bash setup.sh uninstall
$ kubectl delete crd clusterpolicies.nvidia.com
$ bash setup.sh install

I attempted those steps as well, same result.

Can you upload all the command and log via
image

There is no update from you for a period, assuming this is not an issue anymore. Hence we are closing this topic. If need further support, please open a new one. Thanks

Hi,
Please check the cases and to-do.

  1. If you are running in bare metal without anything in system, please
    $ bash setup.sh install

  2. If nouveau is installed and in use, then run below commands to get logs for debug.
    $ kubectl get pods
    $ kubectl get pod -n nvidia-gpu-operator
    $ kubectl logs -n nvidia-gpu-operator nvidia-driver-daemonset-*****
    $ kubectl pod get -n gpu-operator-operator nvidia-cuda-validator-****
    The previous solution is:
    $ bash setup.sh uninstall
    $ kubectl delete crd clusterpolicies.nvidia.com
    $ sudo reboot
    $ bash setup.sh install

  3. If nvidia gpu driver is pre-installed,
    run below commands to uninstall gpu driver.
    $ sudo apt purge nvidia-driver-*
    $ sudo apt autoremove
    $ sudo apt autoclean
    then,
    $ bash setup.sh uninstall
    $ kubectl delete crd clusterpolicies.nvidia.com
    $ bash setup.sh install

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.