Problem configuring vGPU access using Kubevirt

nadav21300 · May 14, 2023, 1:46pm

1. Issue or feature description

We try to configure OpenShift environment to use NVIDIA vGPU using the NVIDIA gpu operator. We followed the steps as described in this guide in NVIDIA vgpu documentation.

I have a trail license from NVIDIA to use the vGPU software from their portal. I downloaded the Linux KVM all supported (should I install the RHEL drivers instaed?) version 13.7. As described in the tutorial, i extracted the NVIDIA-Linux-x86_64-470.182.02-vgpu-kvm.run and build the image using the drivers repository.

I have also configured the ClusterPolicy, and all the pods, related to the GPU Operator, are in Running state.
I configured CNV (Kubevirt) on the cluster, I edited the HyperConverged to allow med device to allow using the use of the GPU for VMs (all the steps as described in NVIDIA GPU Operator guide)

I try to deploy a new VM with RHEL 7.9 (looks like this version is supported in driver version 13.7 in the documentation)

When I try to create the VM configured as follow:

spec:
      domain:
        cpu:
          cores: 4
          sockets: 1
          threads: 1
        devices:
          disks:
            - disk:
                bus: virtio
              name: cloudinitdisk
            - bootOrder: 1
              disk:
                bus: virtio
              name: rootdisk
          gpus:
            - deviceName: nvidia.com/GRID_A100D-40C
              name: a100

It fails to start and there are the warning in the VM events:

Generated from virt-handler
4 times in the last 1 minute
unknown error encountered sending command SyncVMI: rpc error: code = DeadlineExceeded desc = context deadline exceeded

Generated from virt-handler
7 times in the last 0 minutes
failed to detect VMI pod: dial unix //pods/efab0ff2-c256-49e3-9068-61bfec42dc49/volumes/kubernetes.io~empty-dir/sockets/launcher-sock: connect: connection refused

In the VM virt-launcher pod I get the following errors:

{"component":"virt-launcher","level":"warning","msg":"PCI_RESOURCE_NVIDIA_COM_GRID_A100D-40C not set for resource nvidia.com/GRID_A100D-40C","pos":"addresspool.go:50",}
{"component":"virt-launcher","level":"error","msg":"Unable to read from monitor: Connection reset by peer","pos":"qemuMonitorIORead:495","subcomponent":"libvirt","thread":"91",}
{"component":"virt-launcher","level":"error","msg":"At least one cgroup controller is required: No such device or address","pos":"virCgroupDetectControllers:455","subcomponent":"libvirt","thread":"45",}
{"component":"virt-launcher","level":"info","msg":"Process 08e8d621-0fa7-5488-9dd4-70540b814b5e and pid 86 is gone!","pos":"monitor.go:148",}
{"component":"virt-launcher","level":"info","msg":"Waiting on final notifications to be sent to virt-handler.","pos":"virt-launcher.go:277","}
{"component":"virt-launcher","level":"info","msg":"Timed out waiting for final delete notification. Attempting to kill domain","pos":"virt-launcher.go:297","timestamp":"2023-05-14T09:36:59.481498Z"}

and finally the virt-launcher fails, and so does the VM.

I can’t see any related logs in the Operator pods, only in the sandbox-device-plugin there is a log:

2023/05/14 09:01:21 In allocate
2023/05/14 09:01:21 Allocated devices map[MDEV_PCI_RESOURCE_NVIDIA_COM_GRID_A100D-40C:01b514e9-2afb-4bd8-a82d-755eb045885a]

The sandbox also showed this error when it started, but it’s in running state:

2023/05/11 12:25:46 GRID_A100D-40C Device plugin server ready
2023/05/11 12:25:46 healthCheck(GRID_A100D-40C): invoked
2023/05/11 12:25:46 healthCheck(GRID_A100D-40C): Loading NVML
2023/05/11 12:25:46 healthCheck(GRID_A100D-40C): Failed to initialize NVML: could not load NVML library

In the Host itself using dmesg I can see the following error:

[] [nvidia-vgpu-vfio] a581675d-19eb-4f50-aa64-6aeb378c58c3: ERESTARTSYS received during open, waiting for 25000 milliseconds for operation to complete
[] [nvidia-vgpu-vfio] a581675d-19eb-4f50-aa64-6aeb378c58c3: start failed. status: 0x0 Timeout Occured

Running nvidia-smi vgpu from nvidia-vgpu-manager-daemonset pod within openshift-driver-toolkit-ctr

$ nvidia-smi vgpu -q
GPU 00000000:82:00.0
    Active vGPUs                      : 0

$ nvidia-smi vgpu
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.182.02             Driver Version: 470.182.02                |
|---------------------------------+------------------------------+------------+
| GPU  Name                       | Bus-Id                       | GPU-Util   |
|      vGPU ID     Name           | VM ID     VM Name            | vGPU-Util  |
|=================================+==============================+============|
|   0  NVIDIA A10                 | 00000000:82:00.0             |   0%       |
+---------------------------------+------------------------------+------------+

2. Environment details

OpenShift version 4.10
Bare metal node with A100 GPU
GPU operator version: 22.9.1
OpenShift Virtualization: 4.10.1
vGPU driver version 13.7

3. Steps to reproduce the issue

Deploy OpenShift cluster in version 4.10 on bare metal nodes
Configure GPU Operator + vGPU configuration
Configure CNV operator
Create a VM with hardware resource of the new GPU.

Do you have a suggestion why does this behavior happen?
Let me know if I can provide any additional information.