I have a Dell R720 with a K1 board in it that I am testing out vDGA in Vmware View 6.
My K1 will only give me one of 2 options, either assign all GPUs to PCIe passthrough or none. Not sure if that is the way it is or not.
However my problem lies in that when I assign the PCIe passthrough video cards to a VM, the first one will boot fine, and all subsequent VMs will refuse to start and display the error: Device 8:0.0 is already in use.
VM 1 is assigned to 7:0.0
VM2 is assigned to 8:0.0
I have tried moving vm2 to 9:0.0 and A:0.0 with the same results, only 1 vm can operate at any given time.
Has anyone else had this problem and able to shed some light on it?
If you are attempting to use passthrough then you don’t want the VIB installed as vSGA is likely claiming some of the GPUs. I would try uninstalling it and then attempt to passthrough again.
I would expect all of them to show in the VM if they were all being claimed.
How long have you had the card? There were firmware updates but those were almost a year or so ago and nothing since. Also, how old is the server? Then some basics, which power supplies do you have in the server? Only one K1, right, and all 6 power pins are connected on the back of the card? I assume you reseated the card?
Just to confirm this, can you run at the ESXi shell
esxcli software vib list | grep -i nvidia
then also
vmkload_mod -l | grep nvidia
After that run
nvidia-smi
and post the output from each here.
Whilst this shouldn’t make any impact it seems that the hypervisor is blocking access to the PCI devices, so eliminating anything else that could be blocking the resource first should help pin it down.
We have opend a ticket for this problem @ vmware on 18.11.2015 - the can’t find any failure and now say "please contact nvidia" the configuration is correct.
We have deinstalled the vibs - this is desscribed in the nvidia docu for vDGA!
As you can see all cores are presented to the hypervisor correct. The first vm starts with no problems. But if you start the second one the vSphere client only shows "device already in use" and the esxi log this:
2015-11-25T16:44:19.686Z| vmx| I120: PCIPassthru: Failed to register device 0000:08:00.0 error = 0x10
2015-11-25T16:44:19.686Z| vmx| I120: Msg_Post: Error
2015-11-25T16:44:19.686Z| vmx| I120: [msg.pciPassthru.createAdapterFailedDeviceInUse] Device 008:00.0 is already in use.
2015-11-25T16:44:19.686Z| vmx| I120: ----------------------------------------
2015-11-25T16:44:19.687Z| vmx| I120: Vigor_MessageRevoke: message ‘msg.pciPassthru.createAdapterFailedDeviceInUse’ (seq 53295) is revoked
2015-11-25T16:44:19.687Z| vmx| I120: Module DevicePowerOn power on failed.
I think only few people will have this problem - because Enterprise Plus cust. use vGPU. We have tested this procedere with three identical Dell R720 servers.
[root@esxi-06:~] nvidia-smi
NVIDIA-SMI has failed because it couldn’t communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.
vmkload_mod -l | grep nvidia > gives no output
I thought these commands are for vSGA or vGPU …
The problem is still the same - frist vm starts - second shows "already in use". VMWare called me today and they adviced me to open a new case at nvidia support.
VMWare Case ID: 15808632611
I hope you can help us here! In VMWare ESXi 5.5 the same setup was working!
A ok - the system was clean. We also have tried these with a fresh install of ESXi - same result. The BIOS is on 2.5.4 on all servers. We try this since ESXi 6.0.0! We know that the HCL shows that the system should be compatible. VMWare says there might be a problem in the Nvidia card bios. But then the questions is: why the card was working in 5.5 ?