Hi,
I am trying to deploy the GPU operator but the deployment fails with pods unable to pull images. I even tried specific versions instead of latest and I am facing the same issue.
kubectl get pods -n gpu-operator
NAME READY STATUS RESTARTS AGE
gpu-feature-discovery-wdv4c 0/1 Init:0/1 0 2m31s
gpu-operator-1761617510-node-feature-discovery-gc-595bfd845zwph 1/1 Running 1 (3m21s ago) 3d13h
gpu-operator-1761617510-node-feature-discovery-master-59d8shxxv 1/1 Running 1 (3m21s ago) 3d13h
gpu-operator-1761617510-node-feature-discovery-worker-rsh7v 1/1 Running 1 (3m21s ago) 3d13h
gpu-operator-5c5f48f846-h9w6c 1/1 Running 1 (3m21s ago) 3d13h
nvidia-container-toolkit-daemonset-2rsjb 0/1 Init:0/1 0 2m31s
nvidia-dcgm-exporter-n2m99 0/1 Init:0/1 0 2m31s
nvidia-device-plugin-daemonset-dxjdm 0/1 Init:0/1 0 2m31s
nvidia-driver-daemonset-zmst7 0/1 ImagePullBackOff 0 3d13h
nvidia-operator-validator-sk5lm 0/1 Init:0/4 0 2m31s
> kubectl describe pod nvidia-driver-daemonset-zmst7 -n gpu-operator
Normal Pulling 3m36s (x4 over 4m56s) kubelet Pulling image "nvcr.io/nvidia/driver:580.95.05-rocky8.10"
Warning Failed 3m36s (x4 over 4m56s) kubelet Failed to pull image "nvcr.io/nvidia/driver:580.95.05-rocky8.10": rpc error: code = NotFound desc = failed to pull and unpack image "nvcr.io/nvidia/driver:580.95.05-rocky8.10": failed to resolve reference "nvcr.io/nvidia/driver:580.95.05-rocky8.10": nvcr.io/nvidia/driver:580.95.05-rocky8.10: not found
Warning Failed 3m36s (x4 over 4m56s) kubelet Error: ErrImagePull
Warning Failed 3m25s (x4 over 4m32s) kubelet Error: ImagePullBackOff
Normal BackOff 24s (x17 over 4m32s) kubelet Back-off pulling image "nvcr.io/nvidia/driver:580.95.05-rocky8.10"
Here’s the helm command that I used
helm install --wait --generate-name -n gpu-operator --create-namespace nvidia/gpu-operator --version=v25.10.0 --set driver.enabled=true --set toolkit.enabled=true --set devicePlugin.enabled=truekubectl -n rag create secret generic ngc-credentials --from-literal=NGC_API_KEY='<license key redacted>'
helm install --wait --generate-name -n gpu-operator --create-namespace nvidia/gpu-operator --version=latest --set driver.enabled=true --set toolkit.enabled=true --set devicePlugin.enabled=truekubectl -n rag create secret generic ngc-credentials --from-literal=NGC_API_KEY='<license key redacted>'