Problem with K1 and Vmware View: Device 8:0.0 is already in use.

I have a Dell R720 with a K1 board in it that I am testing out vDGA in Vmware View 6.

My K1 will only give me one of 2 options, either assign all GPUs to PCIe passthrough or none. Not sure if that is the way it is or not.

However my problem lies in that when I assign the PCIe passthrough video cards to a VM, the first one will boot fine, and all subsequent VMs will refuse to start and display the error: Device 8:0.0 is already in use.

VM 1 is assigned to 7:0.0
VM2 is assigned to 8:0.0

I have tried moving vm2 to 9:0.0 and A:0.0 with the same results, only 1 vm can operate at any given time.

Has anyone else had this problem and able to shed some light on it?

Hi Jeremy,

Can you post some screenshots of the VM settings in vSphere and also the settings for PCI devices under the host too.

Other information:

Bios Settings on R720
VT = Enabled
Memory Mapped I/O above 4gb = Enabled
I/OAT DMA Engine = Enabled

PCIe Passthrough

Startup error

VM01

VM06

Hi Jeremy,

The Hypervisor and VM settings appear correct, but just to eliminate one potential issue can you change this BIOS Setting

Memory Mapped I/O above 4gb = Enabled

to Disabled and retest please.

Thanks

Hi Jeremy,

If you look in Device Manager in the VM are you seeing all 4 GPUs showing up there?

It sounds like the one VM is claiming all of the GPUs instead of just the one.

To confirm, you haven’t installed the Nvidia VIB for vSGA, correct?

-Mike

Hey, thanks for the responses, sorry i was out for the weekend.

I tried with Memory Mapped IO to Enabled as well, no avail.

Confirmed that the VIB is installed.

In the device manager for the VM it only shows a single card. Would the card have multiple entries if it is claiming all?

If you are attempting to use passthrough then you don’t want the VIB installed as vSGA is likely claiming some of the GPUs. I would try uninstalling it and then attempt to passthrough again.

I would expect all of them to show in the VM if they were all being claimed.

-Mike

Things went a bit crazy again, just got a chance to do all those things.

Removed VIB, did not change the problem.

Only a single VM could grab control of the card still.

Maybe a firmware issue?
Is there a good way to find the firmware of this card?

Any other thoughts would be great.

How long have you had the card? There were firmware updates but those were almost a year or so ago and nothing since. Also, how old is the server? Then some basics, which power supplies do you have in the server? Only one K1, right, and all 6 power pins are connected on the back of the card? I assume you reseated the card?

I’m more or less having the same issue in a XenServer 6.2 pool…same messages…same hardware configs.

Hello,
any solution for this? We have exactly the same problems with all our Grid K1 cards. Now tested in 3x

R720 newest BIOS + Grid K1

The hypervisor alway shows "the device is already in use"

XenServer handles GPU passthrough differently, we’d need to see screenshots / error messages.




vmware.log
2015-11-25T16:44:19.686Z| vmx| I120: PCIPassthru: Failed to register device 0000:08:00.0 error = 0x10
2015-11-25T16:44:19.686Z| vmx| I120: Msg_Post: Error
2015-11-25T16:44:19.686Z| vmx| I120: [msg.pciPassthru.createAdapterFailedDeviceInUse] Device 008:00.0 is already in use.
2015-11-25T16:44:19.686Z| vmx| I120: ----------------------------------------
2015-11-25T16:44:19.687Z| vmx| I120: Vigor_MessageRevoke: message ‘msg.pciPassthru.createAdapterFailedDeviceInUse’ (seq 53295) is revoked
2015-11-25T16:44:19.687Z| vmx| I120: Module DevicePowerOn power on failed.

Just to confirm this, can you run at the ESXi shell

esxcli software vib list | grep -i nvidia

then also

vmkload_mod -l | grep nvidia

After that run

nvidia-smi

and post the output from each here.

Whilst this shouldn’t make any impact it seems that the hypervisor is blocking access to the PCI devices, so eliminating anything else that could be blocking the resource first should help pin it down.

We have opend a ticket for this problem @ vmware on 18.11.2015 - the can’t find any failure and now say “please contact nvidia” the configuration is correct.

We have deinstalled the vibs - this is desscribed in the nvidia docu for vDGA!

I can give you the output from:
[root@esxi-06:~] esxcli hardware pci list -c 0x0300 -m 0xffe[J
0000:07:00.0
Address: 0000:07:00.0
Segment: 0x0000
Bus: 0x07
Slot: 0x00
Function: 0x0
VMkernel Name:
Vendor Name: NVIDIA Corporation
Device Name: GK107GL [GRID K1]
Configured Owner: VM Passthru
Current Owner: VM Passthru
Vendor ID: 0x10de
Device ID: 0x0ff2
SubVendor ID: 0x10de
SubDevice ID: 0x1012
Device Class: 0x0300
Device Class Name: VGA compatible controller
Programming Interface: 0x00
Revision ID: 0xa1
Interrupt Line: 0x0f
IRQ: 255
Interrupt Vector: 0x41
PCI Pin: 0x00
Spawned Bus: 0x00
Flags: 0x0401
Module ID: 19
Module Name: pciPassthru
Chassis: 0
Physical Slot: 4294967295
Slot Description: PCI6; relative bdf 01:00.0
Passthru Capable: true
Parent Device: PCI 0:6:8:0
Dependent Device: PCI 0:5:0:0
Reset Method: Bridge reset
FPT Sharable: true

0000:08:00.0
Address: 0000:08:00.0
Segment: 0x0000
Bus: 0x08
Slot: 0x00
Function: 0x0
VMkernel Name:
Vendor Name: NVIDIA Corporation
Device Name: GK107GL [GRID K1]
Configured Owner: VM Passthru
Current Owner: VM Passthru
Vendor ID: 0x10de
Device ID: 0x0ff2
SubVendor ID: 0x10de
SubDevice ID: 0x1012
Device Class: 0x0300
Device Class Name: VGA compatible controller
Programming Interface: 0x00
Revision ID: 0xa1
Interrupt Line: 0x0e
IRQ: 255
Interrupt Vector: 0x00
PCI Pin: 0x00
Spawned Bus: 0x00
Flags: 0x0401
Module ID: 19
Module Name: pciPassthru
Chassis: 0
Physical Slot: 4294967295
Slot Description: PCI6; relative bdf 02:00.0
Passthru Capable: true
Parent Device: PCI 0:6:9:0
Dependent Device: PCI 0:5:0:0
Reset Method: Bridge reset
FPT Sharable: true

0000:09:00.0
Address: 0000:09:00.0
Segment: 0x0000
Bus: 0x09
Slot: 0x00
Function: 0x0
VMkernel Name:
Vendor Name: NVIDIA Corporation
Device Name: GK107GL [GRID K1]
Configured Owner: VM Passthru
Current Owner: VM Passthru
Vendor ID: 0x10de
Device ID: 0x0ff2
SubVendor ID: 0x10de
SubDevice ID: 0x1012
Device Class: 0x0300
Device Class Name: VGA compatible controller
Programming Interface: 0x00
Revision ID: 0xa1
Interrupt Line: 0x0f
IRQ: 255
Interrupt Vector: 0x00
PCI Pin: 0x00
Spawned Bus: 0x00
Flags: 0x0401
Module ID: 19
Module Name: pciPassthru
Chassis: 0
Physical Slot: 4294967295
Slot Description: PCI6; relative bdf 03:00.0
Passthru Capable: true
Parent Device: PCI 0:6:16:0
Dependent Device: PCI 0:5:0:0
Reset Method: Bridge reset
FPT Sharable: true

0000:0a:00.0
Address: 0000:0a:00.0
Segment: 0x0000
Bus: 0x0a
Slot: 0x00
Function: 0x0
VMkernel Name:
Vendor Name: NVIDIA Corporation
Device Name: GK107GL [GRID K1]
Configured Owner: VM Passthru
Current Owner: VM Passthru
Vendor ID: 0x10de
Device ID: 0x0ff2
SubVendor ID: 0x10de
SubDevice ID: 0x1012
Device Class: 0x0300
Device Class Name: VGA compatible controller
Programming Interface: 0x00
Revision ID: 0xa1
Interrupt Line: 0x0e
IRQ: 255
Interrupt Vector: 0x00
PCI Pin: 0x00
Spawned Bus: 0x00
Flags: 0x0401
Module ID: 19
Module Name: pciPassthru
Chassis: 0
Physical Slot: 4294967295
Slot Description: PCI6; relative bdf 04:00.0
Passthru Capable: true
Parent Device: PCI 0:6:17:0
Dependent Device: PCI 0:5:0:0
Reset Method: Bridge reset
FPT Sharable: true

0000:11:00.0
Address: 0000:11:00.0
Segment: 0x0000
Bus: 0x11
Slot: 0x00
Function: 0x0
VMkernel Name:
Vendor Name: Matrox Electronics Systems Ltd.
Device Name: G200eR2
Configured Owner: Unknown
Current Owner: VMkernel
Vendor ID: 0x102b
Device ID: 0x0534
SubVendor ID: 0x1028
SubDevice ID: 0x048c
Device Class: 0x0300
Device Class Name: VGA compatible controller
Programming Interface: 0x00
Revision ID: 0x00
Interrupt Line: 0x0b
IRQ: 255
Interrupt Vector: 0x00
PCI Pin: 0x00
Spawned Bus: 0x00
Flags: 0x0221
Module ID: -1
Module Name: None
Chassis: 0
Physical Slot: 4294967295
Slot Description: Embedded Video
Passthru Capable: true
Parent Device: PCI 0:16:0:0
Dependent Device: PCI 0:16:0:0
Reset Method: Bridge reset
FPT Sharable: true
[root@esxi-06:~]

As you can see all cores are presented to the hypervisor correct. The first vm starts with no problems. But if you start the second one the vSphere client only shows "device already in use" and the esxi log this:

2015-11-25T16:44:19.686Z| vmx| I120: PCIPassthru: Failed to register device 0000:08:00.0 error = 0x10
2015-11-25T16:44:19.686Z| vmx| I120: Msg_Post: Error
2015-11-25T16:44:19.686Z| vmx| I120: [msg.pciPassthru.createAdapterFailedDeviceInUse] Device 008:00.0 is already in use.
2015-11-25T16:44:19.686Z| vmx| I120: ----------------------------------------
2015-11-25T16:44:19.687Z| vmx| I120: Vigor_MessageRevoke: message ‘msg.pciPassthru.createAdapterFailedDeviceInUse’ (seq 53295) is revoked
2015-11-25T16:44:19.687Z| vmx| I120: Module DevicePowerOn power on failed.

I think only few people will have this problem - because Enterprise Plus cust. use vGPU. We have tested this procedere with three identical Dell R720 servers.

To answer your questions i have installed the driver.

[root@esxi-06:~] esxcli software vib list | grep -i nvidia
NVIDIA-kepler-VMware_ESXi_6.0_Host_Driver 352.83-1OEM.600.0.0.2494585 NVIDIA VMwareAccepted 2016-04-04
[root@esxi-06:~]

[root@esxi-06:~] nvidia-smi
NVIDIA-SMI has failed because it couldn’t communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.

vmkload_mod -l | grep nvidia > gives no output

I thought these commands are for vSGA or vGPU …

The problem is still the same - frist vm starts - second shows "already in use". VMWare called me today and they adviced me to open a new case at nvidia support.

VMWare Case ID: 15808632611

I hope you can help us here! In VMWare ESXi 5.5 the same setup was working!

Why did you install the driver?

The purpose of these checks was to confirm the driver is completely removed and no module remains behind.

If your setup in 6.0 is identical to your 5.5 setup which is working, then it points to an issue with vSphere / ESXi.

I’ve checked the VMWare HCL and K1 on R720 is supported in vSphere 6.0 U2 with the Dell R720 BIOS at 2.4.3

https://www.vmware.com/resources/compatibility/detail.php?deviceCategory=vdga&productid=33815&vcl=true

what BIOS is your R720 currently at?

A ok - the system was clean. We also have tried these with a fresh install of ESXi - same result. The BIOS is on 2.5.4 on all servers. We try this since ESXi 6.0.0! We know that the HCL shows that the system should be compatible. VMWare says there might be a problem in the Nvidia card bios. But then the questions is: why the card was working in 5.5 ?

If the card works with 5.5 there’s no issue with the card.

The vBIOS on the cards is unchanged since Nov 2013, and Dell have posted it for donwload since April 2014, so I doubt that’s your issue.

http://www.dell.com/support/home/us/en/19/Drivers/DriversDetails?driverId=1YCT8

Does 5.5 work on the same hardware with the same SBIOS?

THe hardware was the same - but we upgraded the SBIOS by the time. On every update we hoped that the vms now start - but no luck.

Is there any way to see the card bios, which is installed acutally?