M60 vGPU with Xorg "(EE) No devices detected"

We have a new M60 in our Dell R720, VMware ESXi 6.0/vSphere 6. vGPU profiles
work fine with a Windows 10 VM, but not in CentOS (tried 6.8 and 7), where Xorg.0.log
always says (EE) No devices detected. and exits:

[ 13572.243] (II) LoadModule: "glx"
[ 13572.243] (II) Loading /usr/lib64/xorg/modules/extensions/libglx.so
[ 13572.248] (II) Module glx: vendor="NVIDIA Corporation"
[ 13572.248]    compiled for 4.0.2, module version = 1.0.0
[ 13572.248]    Module class: X.Org Server Extension
[ 13572.248] (II) NVIDIA GLX Module  361.45.09  Tue May 10 08:44:16 PDT 2016
[ 13572.248] (II) LoadModule: "nvidia"
[ 13572.249] (II) Loading /usr/lib64/xorg/modules/drivers/nvidia_drv.so
[ 13572.249] (II) Module nvidia: vendor="NVIDIA Corporation"
[ 13572.249]    compiled for 4.0.2, module version = 1.0.0
[ 13572.249]    Module class: X.Org Video Driver
[ 13572.249] (II) NVIDIA dlloader X Driver  361.45.09  Tue May 10 08:22:21 PDT 2016
[ 13572.249] (II) NVIDIA Unified Driver for all Supported NVIDIA GPUs
[ 13572.249] (--) using VT number 7

[ 13572.252] (EE) No devices detected.
[ 13572.252] (EE) 
Fatal server error:
[ 13572.252] (EE) no screens found(EE) 
[ 13572.252] (EE)

This is running 361.45.09 drivers which appear okay on the hypervisor side as well as the guest VM.
The guest VM runs nvidia-smi and sees a GRID M60-4Q vGPU profile. xorg.conf was generated by nvidia-xconfig --enable-all-gpus --use-display-device=none. Licensing appears correctly set up. The GPU has been put into graphical mode. Kernel module is loaded, and dmesg shows nothing untoward. The R720 was previously running with a K1 and K2, which are now removed to keep things simple. And to reiterate, the Win10 VM works, with OpenGL renderer string showing the M60 vGPU profile. I’ve exhausted my forum / internet searching.

Anyone have ideas to try?

nvidia-bug-report.log.gz:
https://mft.opentext.com/MFT/Transfer?action=GetFile&name=37b52528-58cc-4060-9cfc-f8d2fb458dcf&TID=e2f8e371-e0a5-40e2-b833-68e5b2c77f14&nojava=true
vmware.log.gz:
https://mft.opentext.com/MFT/Transfer?action=GetFile&name=b9093aab-0050-4c95-a234-373f5874d76b&TID=e2f8e371-e0a5-40e2-b833-68e5b2c77f14&nojava=true
(URLs will expire on 2016-09-13)

Thanks

I don’t think CentOS is an OS officially supported by VMware (and consequently by NVIDIA) so you might want consider that. CentOS is supported by Citrix Linux VDA at the moment and given it’s similarity to RHEL I would expect from our side for it to work. However you should look carefully at which OSs and even versions Vmware/Citrix support.

There are a few common setup issues with CentOS / RHEL detailed in our knowledge base, could you have a look at them: http://nvidia.custhelp.com/app/answers/list/st/5/kw/centos%20grid/page/1
And see if anything rings a bell?

Rachel

That configuration will run CentOS 7 perfectly well, whilst it may not be "supported" by all vendors in the stack it should run perfectly well.

However,

First observation - M60 are not certified in the Dell R720, you need the R730

Second - Double check that nouveau is completely disabled and not restarting after you installed the NVIDIA driver. You need to block it in several locations to ensure it’s not capturing the hardware and preventing it being detected properly.

Also , what remoting solution are you using?

You reference vSphere as the underlying hypervisor, but there’s no mention of the remoting solution, without which there are no display devices attached. Horizon should add this to xorg.conf when the agent is installed.

Also, just to confirm that you’re using the Linux driver from the bundle downloaded from

https://nvidia.flexnetoperations.com/control/nvda/login

These are the drivers required for M60.

You should also amend the license settings in gridd.conf (though that’s not directly relevant to this issue).

Thanks for the replies.

Ah, thanks. The GRID ReleaseNotes explicitly include CentOS, and that VMWare link does include both 6.x and 7 for ESXi 6 U2, but I guess you’re saying vGPU profiles for VMWare are a separate support issue. I’ve been unable to separate hypervisor support from Horizon support (which we’re not using, see below) in the VMWare links I’ve found. Do you have a more explicit link? Anyway, this is a bit tangential since Jason says it should work on CentOS 7.

I’ve reviewed that (and seen most of those posts already) but don’t see anything relevant.

Yes, this is where we got the drivers.

Also done, and the license server sees the VM has a license registered.

This was my mistake, it is the R730.

Yes, nouveau is blacklisted, nvidia module is loaded.

This perhaps is the issue. We (OpenText) are an ISV, with our own remoting solution (ETX). Perhaps this is my misunderstanding since previously we ran Bare Metal with K1/K2; Are you saying the Nvidia driver will not load X.org without a special xorg.conf? I.e. nvidia-xconfig --use-display-device=none doesn’t work with vGPU, like it does for bare metal?

I only want the headless X.org to run, and all remoting is my own.

Thanks.

After a bunch of testing, the solution is to not use nvidia-xonfig to generate xorg.conf. X.org won’t start with the generated ServerLayout, Monitor and Screen sections (even with UseDisplayDevice “None”). The device section also needs an explicit BusID device added.

A minimal working config is, e.g.:

Section "DRI"
	Mode 0666
EndSection

Section "Device"
    Identifier     "Device0"
    Driver         "nvidia"
    VendorName     "NVIDIA Corporation"
    BoardName      "GRID M60-4Q"
    BusID          "PCI:2:0:0"
EndSection

I ran into the exact same problem when installing headless xorg for the Amazon g3 instance and this works like magic. Could you share the full working config? Thanks!

@JasJuang, I didn’t see this post for a year, sorry.

The final solution I used was:

nvidia-xconfig --enable-all-gpus --use-display-device=none --busid=<BUSID>

Value(s) for <BUSID> can be found by running

nvidia-xconfig --query-gpu-info | grep BusID

If you have more than one GPU enabled (as is likely with --enable-all-gups) you’ll need to manually edit your xorg.conf to give unique BusID values in each screen section; the above --busid argument will specify the same busid for every screen.