CentOS 7 headless with nVidia drivers installed, OpenGL not using nVidia drivers, only llvmpipe

3352 tty1 Ssl+ 0:02 /usr/bin/X -core -noreset :0 -seat seat0 -auth /var/run/lightdm/root/:0 -nolisten tcp vt1 -novtswitch -background none
4001 pts/0 Sl 0:13 /opt/TurboVNC/bin/Xvnc :1 -desktop TurboVNC: takim2:1 (rsalomon) -auth /home/rsalomon/.Xauthority -geometry 1240x900 -depth 24 -rfbwait 120000 -rfbauth /home/rsalomon/.vnc/passwd -x509cert /home/rsalomon/.vnc/x509_cert.pem -x509key /home/rsalomon/.vnc/x509_private.pem -rfbport 5901 -fp catalogue:/etc/X11/fontpath.d -deferupdate 1 -dridir /usr/lib64/dri -registrydir /usr/lib64/xorg
5342 pts/1 S+ 0:00 grep --color=auto X

Also, just to keep up to date, here’s my current X config, which I’m not sure is correct since this should be a headless system, but X does indeed start without issue, and is using the NVIDIA driver
xorg.conf (1.3 KB)

The ps output confirms that the Xserver is running and only one is running. Since you get the correct output when running glxinfo -display :0 the vglserver_config also succeeded, you have access to the Xserver.

I just tested this and the only way I could get vglrun to hang instead of working or bailing out instantly was when using a tcp instead of socket connection to a non-answering host.

Does
vglrun -d localhost:0 glxinfo
hang or instantly give an error?

It just hangs.
Also I should clarify, it doesn’t freeze/hang the terminal, it merely gives no output at all

I don’t understand, does it return immediately or do you have to hit ctrl+c to stop it?

Ah patience was needed.

It wasn’t returning at all at first, I could type things in the lines below, it was displaying what I type onto the terminal, but not actually returning the command prompt.

HOWEVER
after a long while, it finally returned, with the following before returning:

name of display: :1.0
[VGL] ERROR: Could not open display localhost:0.

Now if I retry the command, it returns immediately with that same output:

name of display: :1.0
[VGL] ERROR: Could not open display localhost:0.

To clarify, this is within a terminal window within the TurboVNC session, connected to the server with the Nvidia GPU

Just to make sure,
glxinfo -display :0
still gives you an output?
If you run
vglrun -d :0 glxinfo
(NB: without localhost) does it also return after a while with which error message?

glxinfo -display :0
still returns output

For vglrun -d :0 glxinfo

at first it immediately displays
name of display: :1.0

and is sitting there, not returned to prompt yet,
I started it at 10:08PM, I’ll see if in 5-10 minutes or so it’ll return more

Doesn’t make sense. Do you have selinux enabled? Anything in journal?

I will also add, this is with a VirtualGL install using rpm -i VirtualGL-3.0.x86_64.rpm

Nope, SELinux is disabled.
However, FIPS is enabled on this host.

I didn’t see anything of note in journalctl.
I also didn’t see anything of note in anthing in /var/log files

I didn’t yet try the 3.0 version. Care to uninstall it and use the 2.6.5 version instead? That’s well tested for years now.

Ok, will do. Might be a delay in next comment, will be in a meeting soon. I’ll let ya know tho!

I will add first that my original attempts were with VirtualGL-2.5.2-1.el7.x86_64 , which seems to be the latest available in yum in our current synced repos.

I’ll try the version you noted from rpm

Huh. After a long delay, it eventually came back with the appropriate output.

This system must have a few separate issues though, since initially I was trying this in xfce4-terminal, which seemed to crash upon receiving any large amount of output printed to it.

I installed xterm and used that now, and it doesn’t have that issue.
However, vglrun still seems to carry with it quite a large delay.

My most recent run I started at 11:54.

I’ll keep the window open and Always On Top on, so i can note when it finally returns the result of vglrun -d :0 glxinfo again

The result finally came back… at 12:20pm.

So, a delay of a bit more than 25 minutes.
That’s so weird.

Are you accidentally using the nvidia gpu on the mars rover?
It’s really weird, like said, there’s no magic involved, it should either work or fail instantly. Working after a looong time, I can’t make head or tails of it. Rather contact the virtualgl dev, he should know what’s going on.

Hah exactly! Extremely weird.

Also will do! I had already had in mind that from here this seemed VirtualGL specific, as you noted, but thanks anyway for sticking with me a bit longer to help poke at this!

I’ll reach out to them, and will update here with the result!
I also am working just a half day today and Friday, full day tomorrow, but I’ll keep you posted!

OK! Interesting development!

I was testing using my user account, not root.
I tested the vglrun command using root, and it returned immediately, with the correct info.

So, perhaps permissions issue or an AD issue, since this box is joined to our domain?

I’ll check this with VirtualGL support, just wanted to add a finding here as well

Found the root of the issue, or at least part of the story!
GLCache and AutoFS NFS-mounted home!

Using either export __GL_SHADER_DISK_CACHE=0 to disable the shader cache, or
export __GL_SHADER_DISK_CACHE_PATH=/tmp

enables everything to work immediately!

The VirtualGL dev said they’ve never seen this behavior before, and wondered if there could be something in the driver that would cause this bad interaction with NFS-mounted homes.
I write this knowing there could be something further in our infrastructure that could be the cause, but I first wanted to check with you, in case there was some particular way the cache was being handled that would cause this, considering that it’s stable and repeatable behaior, not random.

Also for the record, I found this out via

strace vglrun +tr glxinfo

which, in the lagged time period, yielded output like this:

poll([{fd=4, events=POLLIN}], 1, -1) = 1 ([{fd=4, revents=POLLIN}])
recvmsg(4, {msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base=“\1\0023\0\0\0\0\0H\2@\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0”, iov_len=4096}], msg_iovlen=1, msg_controllen=0, msg_flags=0}, 0) = 32
getpid() = 16364
getpid() = 16364
getpid() = 16364
getpid() = 16364
getuid() = 37869
geteuid() = 37869
getgid() = 37869
getegid() = 37869
getuid() = 37869
geteuid() = 37869
getgid() = 37869
getegid() = 37869
stat(“/home/username/.nv/GLCache”, {st_mode=S_IFDIR|0700, st_size=4096, …}) = 0
mkdir(“/home/username/.nv/”, 0700) = -1 EEXIST (File exists)
mkdir(“/home/username/.nv/GLCache”, 0700) = -1 EEXIST (File exists)
mkdir(“/home/username/.nv/GLCache/9f7e9a18c64449cd7a0049bdadc5d015/”, 0700) = -1 EEXIST (File exists)
mkdir(“/home/username/.nv/GLCache/9f7e9a18c64449cd7a0049bdadc5d015/cf07cf62221389dd/”, 0700) = -1 EEXIST (File exists)
open(“/home/username/.nv/GLCache/9f7e9a18c64449cd7a0049bdadc5d015/cf07cf62221389dd/c1b003dd27d07ca9.toc”, O_RDWR|O_CLOEXEC) = 15

fcntl(15, F_SETLK, {l_type=F_WRLCK, l_whence=SEEK_CUR, l_start=0, l_len=1}) = -1 EIO (Input/output error)
close(15) = 0
getuid() = 37869
geteuid() = 37869
getgid() = 37869
getegid() = 37869
getuid() = 37869
geteuid() = 37869
getgid() = 37869
getegid() = 37869
stat(“/home/username/.nv/GLCache”, {st_mode=S_IFDIR|0700, st_size=4096, …}) = 0
mkdir(“/home/username/.nv/”, 0700) = -1 EEXIST (File exists)
mkdir(“/home/username/.nv/GLCache”, 0700) = -1 EEXIST (File exists)
mkdir(“/home/username/.nv/GLCache/9f7e9a18c64449cd7a0049bdadc5d015/”, 0700) = -1 EEXIST (File exists)
mkdir(“/home/username/.nv/GLCache/9f7e9a18c64449cd7a0049bdadc5d015/cf07cf62221389dd/”, 0700) = -1 EEXIST (File exists)
open(“/home/username/.nv/GLCache/9f7e9a18c64449cd7a0049bdadc5d015/cf07cf62221389dd/c1b003dd27d07caa.toc”, O_RDWR|O_CLOEXEC) = 15
fcntl(15, F_SETLK, {l_type=F_WRLCK, l_whence=SEEK_CUR, l_start=0, l_len=1}) = -1 EIO (Input/output error)

and it repeats from there

home directory edited here since this is public

Nice find. Without further inspection I suspect a bug in the nvidia driver if no shader cache can’t be created due to no home directory?
Please check your winbind settings.

Edit: can’t instead of can