Hi there,
previously, with nvidia-driver-535-server* (535.247.01-0ubuntu0.22.04.1 on the GPU Ubuntu node, Ubuntu 22.04.5 LTS (GNU/Linux 6.8.0-47-generic x86_64), our k8s cluster could seamlessly schedule GPU pods, by having configured containerd and dockerd, and having gpu-operator deployed without the driver container, instead relying on the driver installed on the GPU node.
Wanting to upgrade our NVIDIA driver, I uninstalled the just mentioned driver using the following command from the docs:
apt remove --autoremove --purge -V nvidia-driver* libxnvctrl*
, and installed the new driver as described here 1. Introduction — NVIDIA Driver Installation Guide r575 documentation
I tried both, the open and proprietary kernel and the node would be recognized in the cluster again after rebooting, and gpu-operator validators run through without a problem. Pods in pending state utilizing GPUs also run initially. My MIG config is also applied correctly and MIG devices are created and exposed (NVIDIA Device Plugin throws no errors).
However, once the pods are completed they try to release the GPUs but fail and the pods stay in terminating state indefinitely. Deleting them does not work, as there is no container in Docker (already gone), but the GPUs are not released for other pods.
Or when I try to schedule another pod with GPU resources (GPUs are allocatable from the node) and pods get assigned to this node, however, these pods also stay in pending state indefinitely, even though enough GPUs are unused.
So the problem is, we can’t schedule pods with GPUs and do not get any errors (logs) showing us, what to fix to make it run again. The strange thing is, it does work initially, after rebooting the node/docker, but fails to continue to run after validating everything (gpu-operator validators) or after some amount of time and we don’t know why or what is failing.
If necessary, I can include nvidia-bug-report, but here are the necessary tool versions:
Docker, NVIDIA Container Toolkit are up-to-date.
Kubernetes Version: v1.30.4
This is how I installed gpu-operator:
helm install --wait --generate-name
-n gpu-operator --create-namespace
nvidia/gpu-operator
–version=v25.3.0
–set mig.strategy=mixed
–set nfd.enabled=false
–set driver.enabled=false
–set toolkit.enabled=false
-f values.yaml
where the values.yaml includes necessary tolerations.
NVIDIA Device Plugin: nvcr.io/nvidia/k8s-device-plugin:v0.17.1
These are are the currently installed NVIDIA driver details:
apt list --installed| grep nvidia
libnvidia-cfg1-575/unknown,now 575.57.08-0ubuntu1 amd64 [installed,automatic]
libnvidia-common-575/unknown,now 575.57.08-0ubuntu1 all [installed,automatic]
libnvidia-compute-575/unknown,now 575.57.08-0ubuntu1 amd64 [installed]
libnvidia-container-tools/unknown,now 1.17.8-1 amd64 [installed,automatic]
libnvidia-container1/unknown,now 1.17.8-1 amd64 [installed,automatic]
libnvidia-decode-575/unknown,now 575.57.08-0ubuntu1 amd64 [installed,automatic]
libnvidia-encode-575/unknown,now 575.57.08-0ubuntu1 amd64 [installed,automatic]
libnvidia-extra-575/unknown,now 575.57.08-0ubuntu1 amd64 [installed,automatic]
libnvidia-fbc1-575/unknown,now 575.57.08-0ubuntu1 amd64 [installed,automatic]
libnvidia-gl-575/unknown,now 575.57.08-0ubuntu1 amd64 [installed,automatic]
libnvidia-gpucomp-575/unknown,now 575.57.08-0ubuntu1 amd64 [installed,automatic]
linux-signatures-nvidia-6.8.0-47-generic/jammy-updates,jammy-security,now 6.8.0-47.47~22.04.1+1 amd64 [installed,automatic]
nvidia-compute-utils-575/unknown,now 575.57.08-0ubuntu1 amd64 [installed,automatic]
nvidia-container-runtime/unknown,now 3.13.0-1 all [installed]
nvidia-container-toolkit-base/unknown,now 1.17.8-1 amd64 [installed,automatic]
nvidia-container-toolkit/unknown,now 1.17.8-1 amd64 [installed]
nvidia-dkms-575/unknown,now 575.57.08-0ubuntu1 amd64 [installed]
nvidia-driver-575/unknown,now 575.57.08-0ubuntu1 amd64 [installed]
nvidia-firmware-575/unknown,now 575.57.08-0ubuntu1 amd64 [installed,automatic]
nvidia-kernel-common-575/unknown,now 575.57.08-0ubuntu1 amd64 [installed,automatic]
nvidia-kernel-source-575/unknown,now 575.57.08-0ubuntu1 amd64 [installed,automatic]
nvidia-modprobe/unknown,now 575.57.08-0ubuntu1 amd64 [installed,automatic]
nvidia-persistenced/unknown,now 575.57.08-1ubuntu1 amd64 [installed,automatic]
nvidia-settings/unknown,now 575.57.08-0ubuntu1 amd64 [installed,automatic]
nvidia-utils-575/unknown,now 575.57.08-0ubuntu1 amd64 [installed,automatic]
xserver-xorg-video-nvidia-575/unknown,now 575.57.08-0ubuntu1 amd64 [installed,automatic]
Any help or hint would be highly appreciated.