Can I do remote direct rendering with Tesla P4 on CentOS 7?

Hello,

I have been trying to test remote direct rendering on a centos 7.3 box with a Tesla P4 on board (and I should add an integrated graphic chip on the host motherboard). While I can connect a screen on this box for debugging purposes this should be a headless (remote) server which I would then use from a local computer using either a VNC client or vglconnect.

Before going much further in the details my question is:

Can I do remote direct rendering on a Tesla P4?

Assuming that I could here are the unsuccessful step I took so far:

  1. installed centos 7.3 with:
    “Server with GUI”, “mate-desktop-environment”, “mate-desktop”, “xfce-desktop”

and:

systemctl enable gdm.service
  systemctl set-default graphical.target

Do I actually need the graphical.target as default? Even though this is a headless server?

  1. installed NVIDIA driver with:
rpm -i nvidia-diag-driver-local-repo-rhel7-390.30-1.0-1.x86_64.rpm
yum clean all
yum install cuda-drivers
reboot
  1. run nvidia-xconfig
nvidia-xconfig --use-display-device=none --busid="PCI:1:0:0" --virtual=1280x1024
  1. enable the Direct Rendering Manager Kernel Modesetting:
modprobe -r nvidia-drm ; modprobe nvidia-drm modeset=1

Do I actually need this step?

  1. installed VirtualGL, TurboVNC and TigerVNC (so far however I have only tried TigerVNC)

stopped the GDM and run:

vglserver_config
  1. restarted gdm.service or even attempted to start /usr/bin/X :0

in either case X tries to start but fails…

systemctl status gdm.service
● gdm.service - GNOME Display Manager
   Loaded: loaded (/usr/lib/systemd/system/gdm.service; enabled; vendor preset: enabled)
   Active: active (running) since Mon 2018-03-12 17:12:17 PDT; 1h 10min ago
  Process: 3081 ExecStartPost=/bin/bash -c TERM=linux /usr/bin/clear > /dev/tty1 (code=exited, status=0/SUCCESS)
 Main PID: 3078 (gdm)
   CGroup: /system.slice/gdm.service
           └─3078 /usr/sbin/gdm

Mar 12 17:12:19 myserver.server.com gdm[3078]: GdmDisplay: display lasted 1.000655 seconds
Mar 12 17:12:20 myserver.server.com gdm[3078]: Child process 3106 was already dead.
Mar 12 17:12:20 myserver.server.com gdm[3078]: GdmDisplay: display lasted 0.962944 seconds
Mar 12 17:12:21 myserver.server.com gdm[3078]: Child process 3110 was already dead.
Mar 12 17:12:21 myserver.server.com gdm[3078]: GdmDisplay: display lasted 0.939461 seconds
Mar 12 17:12:22 myserver.server.com gdm[3078]: Child process 3123 was already dead.
Mar 12 17:12:22 myserver.server.com gdm[3078]: GdmDisplay: display lasted 0.959484 seconds
Mar 12 17:12:23 myserver.server.com gdm[3078]: Child process 3127 was already dead.
Mar 12 17:12:23 myserver.server.com gdm[3078]: GdmDisplay: display lasted 0.958884 seconds
Mar 12 17:12:23 myserver.server.com gdm[3078]: GdmLocalDisplayFactory: maximum number of X display failures reached: check X server log for errors


The relevant part of Xorg.0.log follows:

[  3817.519] (II) Loading sub module "fb"
[  3817.519] (II) LoadModule: "fb"
[  3817.519] (II) Loading /usr/lib64/xorg/modules/libfb.so
[  3817.519] (II) Module fb: vendor="X.Org Foundation"
[  3817.519]    compiled for 1.17.2, module version = 1.0.0
[  3817.519]    ABI class: X.Org ANSI C Emulation, version 0.4
[  3817.519] (II) Loading sub module "wfb"
[  3817.519] (II) LoadModule: "wfb"
[  3817.520] (II) Loading /usr/lib64/xorg/modules/libwfb.so
[  3817.520] (II) Module wfb: vendor="X.Org Foundation"
[  3817.520]    compiled for 1.17.2, module version = 1.0.0
[  3817.520]    ABI class: X.Org ANSI C Emulation, version 0.4
[  3817.520] (II) Loading sub module "ramdac"
[  3817.520] (II) LoadModule: "ramdac"
[  3817.520] (II) Module "ramdac" already built-in
[  3817.521] (**) NVIDIA(0): Depth 24, (--) framebuffer bpp 32
[  3817.521] (==) NVIDIA(0): RGB weight 888
[  3817.521] (==) NVIDIA(0): Default visual is TrueColor
[  3817.521] (==) NVIDIA(0): Using gamma correction (1.0, 1.0, 1.0)
[  3817.521] (**) NVIDIA(0): Option "UseDisplayDevice" "None"
[  3817.521] (**) NVIDIA(0): Enabling 2D acceleration
[  3817.521] (**) NVIDIA(0): Option "UseDisplayDevice" set to "none"; enabling NoScanout
[  3817.521] (**) NVIDIA(0):     mode
[  3817.521] (EE) NVIDIA(0): Failed to initialize the GLX module; please check in your X
[  3817.521] (EE) NVIDIA(0):     log file that the GLX module has been loaded in your X
[  3817.521] (EE) NVIDIA(0):     server, and that the module is the NVIDIA GLX module.  If
[  3817.521] (EE) NVIDIA(0):     you continue to encounter problems, Please try
[  3817.521] (EE) NVIDIA(0):     reinstalling the NVIDIA driver.
[  3818.188] (EE) NVIDIA(GPU-0): UseDisplayDevice "None" is not supported with GRID
[  3818.188] (EE) NVIDIA(GPU-0):     displayless
[  3818.188] (EE) NVIDIA(GPU-0): Failed to select a display subsystem.

Of course I can connect with TigerVNC and vglconnect but on TigerVNC I get indirect rendering (I test it with glxinfo) while vglconnect I get:

vglrun glxinfo
name of display: localhost:10.0
[VGL] ERROR: Could not open display :0.

Here is the output of nvidia-smi:

Mon Mar 12 17:54:55 2018       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 390.30                 Driver Version: 390.30                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla P4            Off  | 00000000:01:00.0 Off |                    0 |
| N/A   71C    P0    25W /  75W |      0MiB /  7611MiB |      5%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

Here is my xorg.conf:

# nvidia-xconfig: X configuration file generated by nvidia-xconfig
# nvidia-xconfig:  version 390.30  (buildmeister@swio-display-x64-rhel04-14)  Wed Jan 31 22:46:17 PST 2018

Section "ServerLayout"
    Identifier     "Layout0"
    Screen      0  "Screen0"
    InputDevice    "Keyboard0" "CoreKeyboard"
    InputDevice    "Mouse0" "CorePointer"
EndSection

Section "Files"
    FontPath        "/usr/share/fonts/default/Type1"
EndSection

Section "InputDevice"

    # generated from default
    Identifier     "Mouse0"
    Driver         "mouse"
    Option         "Protocol" "auto"
    Option         "Device" "/dev/input/mice"
    Option         "Emulate3Buttons" "no"
    Option         "ZAxisMapping" "4 5"
EndSection

Section "InputDevice"

    # generated from default
    Identifier     "Keyboard0"
    Driver         "kbd"
EndSection
Section "Monitor"
    Identifier     "Monitor0"
    VendorName     "Unknown"
    ModelName      "Unknown"
    HorizSync       28.0 - 33.0
    VertRefresh     43.0 - 72.0
    Option         "DPMS"
EndSection

Section "Device"
    Identifier     "Device0"
    Driver         "nvidia"
    VendorName     "NVIDIA Corporation"
    BoardName      "Tesla P4"
    BusID          "PCI:1:0:0"
EndSection

Section "Screen"
    Identifier     "Screen0"
    Device         "Device0"
    Monitor        "Monitor0"
    DefaultDepth    24
    Option         "UseDisplayDevice" "None"
    SubSection     "Display"
        Virtual     1280 1024
        Depth       24
    EndSubSection
EndSection

As you may have guessed, if you have followed insofar, I am a little bit at a lost. Is there any way I could upload the nvidia-bug-report.log.gz?

Any help would be greatly appreciated.

Thanks!

nvidia-bug-report.log (412 KB)

You can attach files to existing posts, while hovering the mouse over it, a paperclip appears.
There’s something wrong with your nvidia glx install, but this would need the whole log.
To get rid of the error “UseDisplayDevice non not supported…” try to use a minimal xorg.conf

Section "Device"
    Identifier     "Device0"
    Driver         "nvidia"
    VendorName     "NVIDIA Corporation"
    BoardName      "Tesla P4"
    BusID          "PCI:1:0:0"
EndSection

Hello generix,

Thanks for your answer. I have attached the log in my original post. With the minimal xorg.conf can I still attach a monitor to the box (this is just not to connect as root on the net)?

Thanks!
nvidia-bug-report.log (412 KB)

The Tesla doesn’t have any display connectors. When you connect a monitor to the box it’s connected to the intel igpu so you will always see the text console.
GLX doesn’t work because the modules path doesn’t get set right (Xorg.0.log)

[    46.843] (==) ModulePath set to "/usr/lib64/xorg/modules"

so Mesa gets loaded instead of Nvidia glx. Check if the directory

/usr/lib64/nvidia/xorg/

contains a file ‘libglx.so’
Then try the follwing xorg.conf

Section "Files"
	ModulePath   "/usr/lib64/nvidia/xorg"
	ModulePath   "/usr/lib64/xorg/modules"
EndSection

Section "Device"
    Identifier     "Device0"
    Driver         "nvidia"
    VendorName     "NVIDIA Corporation"
    BoardName      "Tesla P4"
    BusID          "PCI:1:0:0"
EndSection

Report back with a new nvidia-bug-report.

Hello generix,

Thanks so much for your update. After following your suggestions and installing acpid, I think I make some progress but I will still need your help:

  1. I now see that the X server kind of starts:
ps aux | grep X
root       987  0.0  0.0 216884  4688 ?        Ss   14:38   0:00 /usr/bin/abrt-watch-log -F Backtrace /var/log/Xorg.0.log -- /usr/bin/abrt-dump-xorg -xD
root      1161  0.0  0.0  13936   752 tty1     S+   14:38   0:00 /bin/xinit /usr/libexec/initial-setup/firstboot-windowmanager /usr/libexec/initial-setup/initial-setup-graphical --no-stdout-log -- /bin/Xorg :9 -ac -nolisten tcp
root      1162  0.1  1.5 307872 118516 tty2    S<s+ 14:38   0:06 /bin/Xorg :9 -ac -nolisten tcp
root     21203  0.0  0.0 112648   956 pts/0    S+   15:45   0:00 grep --color=auto X

but it seems to be opening on DISPLAY 9 instead of 0, not sure why…

  1. gdm seems to be dead:
systemctl status gdm.service
● gdm.service - GNOME Display Manager
   Loaded: loaded (/usr/lib/systemd/system/gdm.service; enabled; vendor preset: enabled)
   Active: inactive (dead)

and attemting to start it hungs the prompt

  1. connecting from a client with vgl does :
/opt/VirtualGL/bin/vglconnect -s -x myusername@myserver.server.com

VirtualGL Client 64-bit v2.5.2 (Build 20170302)
vglclient is already running on this X display and accepting unencrypted
   connections on port 4242.

myusername@myserver.server.com's password: 
[myusername@myserver ~]$ echo $DISPLAY 
XXX.XX.XX.XXX:0
[myusername@myserver ~]$ vglrun -q 30 -samp 4x glxgears 
Error: couldn't open display XXX.XX.XX.XXX:0

Considering that X on the remove box seems to be running on :.9 I have tried on my vglclient connection:

[myusername@myserver ~]$ export DISPLAY=XXX.XX.XX.XXX:9
[myusername@myserver ~]$ vglrun -q 30 -samp 4x glxgears 
Error: couldn't open display XXX.XX.XX.XXX:9

So even if some some sort of X is running on the remote machine I still don’t seem to be able to connect.

Here is /var/log/Xorg.0.log:

Release Date: 2015-06-16
[    46.987] X Protocol Version 11, Revision 0
[    46.987] Build Operating System:  2.6.32-573.18.1.el6.x86_64 
[    46.987] Current Operating System: Linux myserver.server.com 3.10.0-514.26.2.el7.x86_64 #1 SMP Tue Jul 4 15:04:05 UTC 2017 x86_64
[    46.987] Kernel command line: BOOT_IMAGE=/vmlinuz-3.10.0-514.26.2.el7.x86_64 root=/dev/mapper/centos_myserver-root ro rd.lvm.lv=centos_myserver/root crashkernel=auto rd.lvm.lv=centos_myserver/swap rhgb quiet rd.driver.blacklist=nouveau nouveau.modset=0 nouveau.modeset=0 video=vesa:off
[    46.987] Build Date: 06 November 2016  12:43:39AM
[    46.987] Build ID: xorg-x11-server 1.17.2-22.el7 
[    46.987] Current version of pixman: 0.34.0
[    46.987]    Before reporting problems, check http://wiki.x.org
        to make sure that you have the latest version.
[    46.987] Markers: (--) probed, (**) from config file, (==) default setting,
        (++) from command line, (!!) notice, (II) informational,
        (WW) warning, (EE) error, (NI) not implemented, (??) unknown.
[    46.988] (==) Log file: "/var/log/Xorg.0.log", Time: Tue Mar 13 14:02:36 2018
[    46.989] (==) Using config file: "/etc/X11/xorg.conf"
[    46.989] (==) Using config directory: "/etc/X11/xorg.conf.d"
[    46.989] (==) Using system config directory "/usr/share/X11/xorg.conf.d"
[    46.989] Parse error on line 2 of section Files in file /etc/X11/xorg.conf
        "ModulePath="/usr/lib64/nvidia/xorg"" is not a valid keyword in this section.
[    46.989] (EE) Problem parsing the config file
[    46.989] (EE) Error parsing the config file
[    46.989] (EE) 
Fatal server error:
[    46.989] (EE) no screens found(EE) 
[    46.989] (EE) 
Please consult the The X.Org Foundation support 
         at http://wiki.x.org
 for help. 
[    46.989] (EE) Please also check the log file at "/var/log/Xorg.0.log" for additional information.
[    46.989] (EE) 
[    46.989] (EE) Server terminated with error (1). Closing log file.

I will attach my Xorg.9.log as a file since it is a bit longer.

  1. I realized I have not updated grub to make the following command stick:
nvidia-drm modeset=1

should I?

  1. Last but not least I now seem to have a truly headless system as I only get an unblinking cursor on the monitor attached to the remote box, any idea why? I am attaching the new nvidia-bug-report.

Thanks!
nvidia-bug-report.log (412 KB)

Hello generix,

Thanks so much for your update. After following your suggestions and installing acpid, I think I make some progress but I will still need your help:

  1. I now see that the X server kind of starts:
ps aux | grep X
root       987  0.0  0.0 216884  4688 ?        Ss   14:38   0:00 /usr/bin/abrt-watch-log -F Backtrace /var/log/Xorg.0.log -- /usr/bin/abrt-dump-xorg -xD
root      1161  0.0  0.0  13936   752 tty1     S+   14:38   0:00 /bin/xinit /usr/libexec/initial-setup/firstboot-windowmanager /usr/libexec/initial-setup/initial-setup-graphical --no-stdout-log -- /bin/Xorg :9 -ac -nolisten tcp
root      1162  0.1  1.5 307872 118516 tty2    S<s+ 14:38   0:06 /bin/Xorg :9 -ac -nolisten tcp
root     21203  0.0  0.0 112648   956 pts/0    S+   15:45   0:00 grep --color=auto X

but it seems to be opening on DISPLAY 9 instead of 0, not sure why…

  1. gdm seems to be dead:
systemctl status gdm.service
● gdm.service - GNOME Display Manager
   Loaded: loaded (/usr/lib/systemd/system/gdm.service; enabled; vendor preset: enabled)
   Active: inactive (dead)

and attemting to start it hungs the prompt

  1. connecting from a client with vgl does :
/opt/VirtualGL/bin/vglconnect -s -x myusername@myserver.server.com


VirtualGL Client 64-bit v2.5.2 (Build 20170302)
vglclient is already running on this X display and accepting unencrypted
   connections on port 4242.

myusername@myserver.server.com's password: 
[myusername@myserver ~]$ echo $DISPLAY 
XXX.XX.XX.XXX:0
[myusername@myserver ~]$ vglrun -q 30 -samp 4x glxgears 
Error: couldn't open display XXX.XX.XX.XXX:0

Considering that X on the remove box seems to be running on :.9 I have tried on my vglclient connection:

[myusername@myserver ~]$ export DISPLAY=XXX.XX.XX.XXX:9
[myusername@myserver ~]$ vglrun -q 30 -samp 4x glxgears 
Error: couldn't open display XXX.XX.XX.XXX:9

So even if some some sort of X is running on the remote machine I still don’t seem to be able to connect.

Here is /var/log/Xorg.0.log:

Release Date: 2015-06-16
[    46.987] X Protocol Version 11, Revision 0
[    46.987] Build Operating System:  2.6.32-573.18.1.el6.x86_64 
[    46.987] Current Operating System: Linux myserver.server.com 3.10.0-514.26.2.el7.x86_64 #1 SMP Tue Jul 4 15:04:05 UTC 2017 x86_64
[    46.987] Kernel command line: BOOT_IMAGE=/vmlinuz-3.10.0-514.26.2.el7.x86_64 root=/dev/mapper/centos_myserver-root ro rd.lvm.lv=centos_myserver/root crashkernel=auto rd.lvm.lv=centos_myserver/swap rhgb quiet rd.driver.blacklist=nouveau nouveau.modset=0 nouveau.modeset=0 video=vesa:off
[    46.987] Build Date: 06 November 2016  12:43:39AM
[    46.987] Build ID: xorg-x11-server 1.17.2-22.el7 
[    46.987] Current version of pixman: 0.34.0
[    46.987]    Before reporting problems, check http://wiki.x.org
        to make sure that you have the latest version.
[    46.987] Markers: (--) probed, (**) from config file, (==) default setting,
        (++) from command line, (!!) notice, (II) informational,
        (WW) warning, (EE) error, (NI) not implemented, (??) unknown.
[    46.988] (==) Log file: "/var/log/Xorg.0.log", Time: Tue Mar 13 14:02:36 2018
[    46.989] (==) Using config file: "/etc/X11/xorg.conf"
[    46.989] (==) Using config directory: "/etc/X11/xorg.conf.d"
[    46.989] (==) Using system config directory "/usr/share/X11/xorg.conf.d"
[    46.989] Parse error on line 2 of section Files in file /etc/X11/xorg.conf
        "ModulePath="/usr/lib64/nvidia/xorg"" is not a valid keyword in this section.
[    46.989] (EE) Problem parsing the config file
[    46.989] (EE) Error parsing the config file
[    46.989] (EE) 
Fatal server error:
[    46.989] (EE) no screens found(EE) 
[    46.989] (EE) 
Please consult the The X.Org Foundation support 
         at http://wiki.x.org
 for help. 
[    46.989] (EE) Please also check the log file at "/var/log/Xorg.0.log" for additional information.
[    46.989] (EE) 
[    46.989] (EE) Server terminated with error (1). Closing log file.

I will attach my Xorg.9.log as a file since it is a bit longer.

  1. I realized I have not updated grub to make the following command stick:
nvidia-drm modeset=1

should I?

  1. Last but not least I now seem to have a truly headless system as I only get an unblinking cursor on the monitor attached to the remote box, any idea why? I am attaching the new nvidia-bug-report.

Thanks!
Xorg.9.log (15.7 KB)
nvidia-bug-report.log (385 KB)

The nvidia-bug-report you attached is an old one from 2 days ago.
You have 2 iGPUs, I don’t know about the output layout of your box. Try to connect the monitor to another output, if available.
The Xserver that’s running is some setup screen, run by centos graphical setup. The xserver on the nvidia is not starting up due to a typo in the xorg.conf:

[    46.989] Parse error on line 2 of section Files in file /etc/X11/xorg.conf
        "ModulePath="/usr/lib64/nvidia/xorg"" is not a valid keyword in this section.

remove the ‘=’ between ModulePath and the path. It has to match the xorg.conf from post #4.

Hello again,
Sorry about posting the wrong log. Here is the new one.
Thanks :)
nvidia-bug-report.log (385 KB)

Ok, looks like you found the typo yourself. The Xserver is running on the Nvidia, GLX is loading fine.

nvidia-drm modeset=1

is not needed in your configuration.
Maybe use x11vnc first to have a simple connection to the running xserver to see what kind of setup is going on there.

Hello generix,

I did correct the typo in /etc/X11/xorg.conf over two hours ago but after restarting the system the /var/log/Xorg.0.log file did not get rewritten (possibly something is terminating before), note the time of when the log was last written (the xorg.conf has since been corrected and the system rebooted at least twice):

ls -lt /var/log/Xorg.*
-rw-r--r--. 1 root root 15986 Mar 13 16:36 /var/log/Xorg.9.log
-rw-r--r--. 1 root root 16744 Mar 13 16:35 /var/log/Xorg.9.log.old
-rw-r--r--. 1 root root  1984 Mar 13 14:02 /var/log/Xorg.5.log
-rw-r--r--. 1 root root  1984 Mar 13 14:02 /var/log/Xorg.4.log
-rw-r--r--. 1 root root  1984 Mar 13 14:02 /var/log/Xorg.3.log
-rw-r--r--. 1 root root  1984 Mar 13 14:02 /var/log/Xorg.2.log
-rw-r--r--. 1 root root  1984 Mar 13 14:02 /var/log/Xorg.1.log
<b>-rw-r--r--. 1 root root  1984 Mar 13 14:02 /var/log/Xorg.0.log</b>

For how regards the graphic cards on my system here you go:

lspci -k | grep -EA2 'VGA|3D'
00:02.0 VGA compatible controller: Intel Corporation 82Q35 Express Integrated Graphics Controller (rev 02)
	Subsystem: Dell OptiPlex 755
	Kernel driver in use: i915
--
01:00.0 3D controller: NVIDIA Corporation Device 1bb3 (rev a1)
	Subsystem: NVIDIA Corporation Device 11d8
	Kernel driver in use: nvidia

Thanks :)

One more comment I just rerun the nvdia-debug tool but the weird thing is that it does not see the fact that libglx should now be coming from the nvidia version (as Xorg.9.log sees it):

grep libglx.so nvidia-bug-report.log
   executing: '/bin/chcon -t textrel_shlib_t /usr/lib64/xorg/modules/extensions/libglx.so.304.125'...
[    46.092] (II) Loading /usr/lib64/xorg/modules/extensions/libglx.so
[    47.002] (II) Loading /usr/lib64/xorg/modules/extensions/libglx.so
[    47.888] (II) Loading /usr/lib64/xorg/modules/extensions/libglx.so
[    48.739] (II) Loading /usr/lib64/xorg/modules/extensions/libglx.so
[    49.656] (II) Loading /usr/lib64/xorg/modules/extensions/libglx.so

nvidia-bug-report.log (385 KB)
Xorg.9.log (15.7 KB)

You have currently a working Xserver at :9. This comes from the initial setup of centos, it’s displaying a license-agreement to accept. So you can either connect to it using x11vnc and click on ‘Finish configuration’ or disable it using the systemctl commands from the end of this thread:
https://www.centos.org/forums/viewtopic.php?t=54709
After reboot, your installation should work normal.
Grepping the nvidia-bug-report doesn’t yield anything because Xorg.9.log is not included and that’s where anything is happening now.