vGPU pods stuck after the installation

Hi,

I’m trying to use the GPU Operator with vGPU support following this article on k3s. After I install the operator, vGPU pods stuck at init state, and then the CrashLoopBackoff happens to the vGPU manager pod. I couldn’t find the root cause or a similar issue from the forum/issues yet. I can provide outputs from the host if requested. Any kind of help is appreciated.

  • The server is vGPU certified. (Supermicro 1029U-TR4 w/ two NVIDIA T4 GPUs)
  • SR-IOV is enabled. (BIOS)
  • VT-d is enabled. (BIOS)
  • intel_iommu is enabled. (/etc/default/grub)
gpu-operator   gpu-operator-1727453251-node-feature-discovery-gc-854cf464lp2ck   1/1     Running     0                53m
gpu-operator   gpu-operator-1727453251-node-feature-discovery-master-8656xbdjm   1/1     Running     0                53m
gpu-operator   gpu-operator-1727453251-node-feature-discovery-worker-25mqp       1/1     Running     0                53m
gpu-operator   gpu-operator-84c6b4697b-hlshg                                     1/1     Running     0                53m
gpu-operator   nvidia-sandbox-device-plugin-daemonset-8vp72                      1/1     Running     0                50m
gpu-operator   nvidia-sandbox-validator-zw6w7                                    1/1     Running     0                53m
gpu-operator   nvidia-vgpu-device-manager-zddgb                                  0/1     Init:0/1    0                40m
gpu-operator   nvidia-vgpu-manager-daemonset-4cm5s                               0/1     Init:0/1    12 (5m42s ago)   49m

When I check the allocatable resources on node, I can see the vGPU device that I try to use as below.

allocatable:
  cpu: "80"
  ephemeral-storage: "4411267110320"
  hugepages-1Gi: "0"
  hugepages-2Mi: 2Gi
  memory: 261654520Ki
  nvidia.com/GRID_T4-2Q: "1"
  pods: "110"

Here is my installation command. I disabled the driver and the toolkit because they are available on the host.

helm install --wait --generate-name \
  -n gpu-operator --create-namespace \
  nvidia/gpu-operator \
  --set sandboxWorkloads.enabled=true \
  --set driver.enabled=false \
  --set toolkit.enabled=false \
  --set vgpuManager.enabled=true \
  --set vgpuManager.repository=${PRIVATE_REGISTRY} \
  --set vgpuManager.image=vgpu-manager \
  --set vgpuManager.version=550.90.05 \
  --set vgpuManager.imagePullSecrets={${REGISTRY_SECRET_NAME}}

Logs of the crashing/stucking pods

kubectl logs -f nvidia-vgpu-manager-daemonset-4cm5s -n gpu-operator -c k8s-driver-manager
NVIDIA GPU driver is already pre-installed on the node, disabling the containerized driver on the node
node/robolaunch-internal labeled
kubectl logs -f nvidia-vgpu-device-manager-zddgb -n gpu-operator -c vgpu-manager-validation
waiting for NVIDIA vGPU Manager to be setup
waiting for NVIDIA vGPU Manager to be setup
waiting for NVIDIA vGPU Manager to be setup
waiting for NVIDIA vGPU Manager to be setup
waiting for NVIDIA vGPU Manager to be setup
waiting for NVIDIA vGPU Manager to be setup
waiting for NVIDIA vGPU Manager to be setup
waiting for NVIDIA vGPU Manager to be setup
waiting for NVIDIA vGPU Manager to be setup
waiting for NVIDIA vGPU Manager to be setup
waiting for NVIDIA vGPU Manager to be setup
waiting for NVIDIA vGPU Manager to be setup
waiting for NVIDIA vGPU Manager to be setup

Verification of vGPU creation

mdevctl list
16d7dda2-f888-4c28-9c3c-2352daa88a8c 0000:af:00.0 nvidia-231 (defined)

Host driver

It’s installed using .deb file (Host Drivers) downloaded from NLP - Software Download with the command sudo apt install ./nvidia-vgpu-ubuntu-550_550.90.05_amd64.deb.

nvidia-smi
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.90.05              Driver Version: 550.90.05      CUDA Version: N/A      |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  Tesla T4                       On  |   00000000:AF:00.0 Off |                  Off |
| N/A   53C    P8             18W /   70W |      97MiB /  16384MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  Tesla T4                       On  |   00000000:D8:00.0 Off |                  Off |
| N/A   54C    P8             17W /   70W |      97MiB /  16384MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+