Hi,
I’m trying to use the GPU Operator with vGPU support following this article on k3s. After I install the operator, vGPU pods stuck at init state, and then the CrashLoopBackoff
happens to the vGPU manager pod. I couldn’t find the root cause or a similar issue from the forum/issues yet. I can provide outputs from the host if requested. Any kind of help is appreciated.
- The server is vGPU certified. (Supermicro 1029U-TR4 w/ two NVIDIA T4 GPUs)
- SR-IOV is enabled. (BIOS)
- VT-d is enabled. (BIOS)
intel_iommu
is enabled. (/etc/default/grub
)
gpu-operator gpu-operator-1727453251-node-feature-discovery-gc-854cf464lp2ck 1/1 Running 0 53m
gpu-operator gpu-operator-1727453251-node-feature-discovery-master-8656xbdjm 1/1 Running 0 53m
gpu-operator gpu-operator-1727453251-node-feature-discovery-worker-25mqp 1/1 Running 0 53m
gpu-operator gpu-operator-84c6b4697b-hlshg 1/1 Running 0 53m
gpu-operator nvidia-sandbox-device-plugin-daemonset-8vp72 1/1 Running 0 50m
gpu-operator nvidia-sandbox-validator-zw6w7 1/1 Running 0 53m
gpu-operator nvidia-vgpu-device-manager-zddgb 0/1 Init:0/1 0 40m
gpu-operator nvidia-vgpu-manager-daemonset-4cm5s 0/1 Init:0/1 12 (5m42s ago) 49m
When I check the allocatable resources on node, I can see the vGPU device that I try to use as below.
allocatable:
cpu: "80"
ephemeral-storage: "4411267110320"
hugepages-1Gi: "0"
hugepages-2Mi: 2Gi
memory: 261654520Ki
nvidia.com/GRID_T4-2Q: "1"
pods: "110"
Here is my installation command. I disabled the driver and the toolkit because they are available on the host.
helm install --wait --generate-name \
-n gpu-operator --create-namespace \
nvidia/gpu-operator \
--set sandboxWorkloads.enabled=true \
--set driver.enabled=false \
--set toolkit.enabled=false \
--set vgpuManager.enabled=true \
--set vgpuManager.repository=${PRIVATE_REGISTRY} \
--set vgpuManager.image=vgpu-manager \
--set vgpuManager.version=550.90.05 \
--set vgpuManager.imagePullSecrets={${REGISTRY_SECRET_NAME}}
Logs of the crashing/stucking pods
kubectl logs -f nvidia-vgpu-manager-daemonset-4cm5s -n gpu-operator -c k8s-driver-manager
NVIDIA GPU driver is already pre-installed on the node, disabling the containerized driver on the node
node/robolaunch-internal labeled
kubectl logs -f nvidia-vgpu-device-manager-zddgb -n gpu-operator -c vgpu-manager-validation
waiting for NVIDIA vGPU Manager to be setup
waiting for NVIDIA vGPU Manager to be setup
waiting for NVIDIA vGPU Manager to be setup
waiting for NVIDIA vGPU Manager to be setup
waiting for NVIDIA vGPU Manager to be setup
waiting for NVIDIA vGPU Manager to be setup
waiting for NVIDIA vGPU Manager to be setup
waiting for NVIDIA vGPU Manager to be setup
waiting for NVIDIA vGPU Manager to be setup
waiting for NVIDIA vGPU Manager to be setup
waiting for NVIDIA vGPU Manager to be setup
waiting for NVIDIA vGPU Manager to be setup
waiting for NVIDIA vGPU Manager to be setup
Verification of vGPU creation
mdevctl list
16d7dda2-f888-4c28-9c3c-2352daa88a8c 0000:af:00.0 nvidia-231 (defined)
Host driver
It’s installed using .deb
file (Host Drivers) downloaded from NLP - Software Download with the command sudo apt install ./nvidia-vgpu-ubuntu-550_550.90.05_amd64.deb
.
nvidia-smi
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.90.05 Driver Version: 550.90.05 CUDA Version: N/A |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 Tesla T4 On | 00000000:AF:00.0 Off | Off |
| N/A 53C P8 18W / 70W | 97MiB / 16384MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 1 Tesla T4 On | 00000000:D8:00.0 Off | Off |
| N/A 54C P8 17W / 70W | 97MiB / 16384MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| No running processes found |
+-----------------------------------------------------------------------------------------+