vGPU timeout status 0x65, VFIO error, QEMU/KVM RHEL7.6

cal.rubbo · February 14, 2019, 4:12pm

I have created a vGPU with UUID def87179-9c53-42d7-b224-a5d281037b84. The license server is running, and I’ve provided GRID-Virtual App and QUADRO-DWS resources to the mac address of the VM.

I get the following output when I try to start my VM:

[root@instance-1 ~]# dmesg
[nvidia-vgpu-vfio] def87179-9c53-42d7-b224-a5d281037b84: start failed. status: 0x65 Timeout Occured

[root@instance-1 ~]# virsh start win10_1
error: Failed to start domain win10_1
error: internal error: process exited while connecting to monitor: Verify all devices in group 0 are bound to vfio-pci or pci-stub and not already in use
2019-02-13T15:11:50.129364Z qemu-kvm: -device vfio-pci,id=hostdev0,sysfsdev=/sys/bus/mdev/devices/def87179-9c53-42d7-b224-a5d281037b84,display=off,bus=pci.0,addr=0x8: vfio: failed to get device def87179-9c53-42d7-b224-a5d281037b84
2019-02-13T15:11:50.129455Z qemu-kvm: -device vfio-pci,id=hostdev0,sysfsdev=/sys/bus/mdev/devices/def87179-9c53-42d7-b224-a5d281037b84,display=off,bus=pci.0,addr=0x8: Device initialization failed.
2019-02-13T15:11:50.129479Z qemu-kvm: -device vfio-pci,id=hostdev0,sysfsdev=/sys/bus/mdev/devices/def87179-9c53-42d7-b224-a5d281037b84,display=off,bus=pci.0,addr=0x8: Device 'vfio-pci' could not be initialized

I’ve tried GRID P100-1Q, P100-16Q, and P100-1A vGPUs with the same results. Further, while I can see the device’s uuid listed in the mdev/devices, I get the following when I run the following:

[root@instance-1 ~]# nvidia-smi vgpu -q
GPU 00000000:00:04.0
    Active vGPUs              : 0

[root@instance-1 ~]# nvidia-smi vgpu -c
GPU 00000000:00:04.0
    GRID P100-1Q

I am running qemu-kvm version 1.5.3 and RHEL 7.6 with kernel 3.10.0-957.el7.x86_64. Here’s the relevant portion of my VM’s XML file:

<hostdev mode='subsystem' type='mdev' managed='no' model='vfio-pci' display='off'>
  <source>
    <address uuid='def87179-9c53-42d7-b224-a5d281037b84'/>
  </source>
  <address type='pci' domain='0x0000' bus='0x00' slot='0x09' function='0x0'/>
</hostdev>

sschaber · February 18, 2019, 7:30am

Hi,

did you disable ECC memory on the P100?

Regards
Simon

cal.rubbo · February 18, 2019, 11:56am

Yes I forgot to mention! I did disable ECC memory.

dominiaz · February 22, 2019, 10:06am

I have similar problem with Tesla P40.

I have installed vgpu manager 7.1 on Debian 9.

When I am starting VM in Proxmox I am getting that error:

Verify all devices in group 79 are bound to vfio-<bus> or pci-stub and not already in use

dmesg:

[  150.834555] iommu: Adding device 00000000-0000-0000-0000-000000000100 to group 79
[  150.834557] vfio_mdev 00000000-0000-0000-0000-000000000100: MDEV: group_id = 79
[  161.498679] [nvidia-vgpu-vfio] 00000000-0000-0000-0000-000000000100: start failed. status: 0x65 Timeout Occured

Nvidia please help us :)

sschaber · February 24, 2019, 10:37am

Debian is not supported at all so your situation is different…

dominiaz · February 26, 2019, 12:54am

Are you going to support other than RedHat linux distribution soon?

You should support Proxmox too…

So please tell that driver on nvidia download page:

NVIDIA vGPU for Linux KVM - for what linux distribution is it created?

sschaber · February 27, 2019, 11:02am

As I said, we currently only support vGPU for RHEL KVM as you can see here: