vGPU timeout status 0x65, VFIO error, QEMU/KVM RHEL7.6

I have created a vGPU with UUID def87179-9c53-42d7-b224-a5d281037b84. The license server is running, and I’ve provided GRID-Virtual App and QUADRO-DWS resources to the mac address of the VM.

I get the following output when I try to start my VM:

[root@instance-1 ~]# dmesg
[nvidia-vgpu-vfio] def87179-9c53-42d7-b224-a5d281037b84: start failed. status: 0x65 Timeout Occured

[root@instance-1 ~]# virsh start win10_1
error: Failed to start domain win10_1
error: internal error: process exited while connecting to monitor: Verify all devices in group 0 are bound to vfio-pci or pci-stub and not already in use
2019-02-13T15:11:50.129364Z qemu-kvm: -device vfio-pci,id=hostdev0,sysfsdev=/sys/bus/mdev/devices/def87179-9c53-42d7-b224-a5d281037b84,display=off,bus=pci.0,addr=0x8: vfio: failed to get device def87179-9c53-42d7-b224-a5d281037b84
2019-02-13T15:11:50.129455Z qemu-kvm: -device vfio-pci,id=hostdev0,sysfsdev=/sys/bus/mdev/devices/def87179-9c53-42d7-b224-a5d281037b84,display=off,bus=pci.0,addr=0x8: Device initialization failed.
2019-02-13T15:11:50.129479Z qemu-kvm: -device vfio-pci,id=hostdev0,sysfsdev=/sys/bus/mdev/devices/def87179-9c53-42d7-b224-a5d281037b84,display=off,bus=pci.0,addr=0x8: Device 'vfio-pci' could not be initialized

I’ve tried GRID P100-1Q, P100-16Q, and P100-1A vGPUs with the same results. Further, while I can see the device’s uuid listed in the mdev/devices, I get the following when I run the following:

[root@instance-1 ~]# nvidia-smi vgpu -q
GPU 00000000:00:04.0
    Active vGPUs              : 0

[root@instance-1 ~]# nvidia-smi vgpu -c
GPU 00000000:00:04.0
    GRID P100-1Q

I am running qemu-kvm version 1.5.3 and RHEL 7.6 with kernel 3.10.0-957.el7.x86_64. Here’s the relevant portion of my VM’s XML file:

<hostdev mode='subsystem' type='mdev' managed='no' model='vfio-pci' display='off'>
  <source>
    <address uuid='def87179-9c53-42d7-b224-a5d281037b84'/>
  </source>
  <address type='pci' domain='0x0000' bus='0x00' slot='0x09' function='0x0'/>
</hostdev>

Hi,

did you disable ECC memory on the P100?

Regards
Simon

Yes I forgot to mention! I did disable ECC memory.

I have similar problem with Tesla P40.

I have installed vgpu manager 7.1 on Debian 9.

When I am starting VM in Proxmox I am getting that error:

Verify all devices in group 79 are bound to vfio-<bus> or pci-stub and not already in use

dmesg:

[  150.834555] iommu: Adding device 00000000-0000-0000-0000-000000000100 to group 79
[  150.834557] vfio_mdev 00000000-0000-0000-0000-000000000100: MDEV: group_id = 79
[  161.498679] [nvidia-vgpu-vfio] 00000000-0000-0000-0000-000000000100: start failed. status: 0x65 Timeout Occured

Nvidia please help us :)

Debian is not supported at all so your situation is different…

Are you going to support other than RedHat linux distribution soon?

You should support Proxmox too…

So please tell that driver on nvidia download page:

NVIDIA vGPU for Linux KVM - for what linux distribution is it created?

As I said, we currently only support vGPU for RHEL KVM as you can see here:
https://docs.nvidia.com/grid/latest/product-support-matrix/index.html