Broken OpenGL in 515.48 Linux Driver

For the first time in our experience, the Tesla/datacentre linux NV*.run package driver 515.48.07 (with Cuda 11.7) that we use on RedHat/CentOS7.9 breaks remote login Xwindow sessions for ordinary users (but not root). We have replicated this on multiple machines and various GPU cards

It is clear that there is a serious OpenGL/config problem with this driver as ordinary remote X users experience:

  • black backgrounds
  • missing menu bars and headers in gnome
  • segmentation faults when trying to test openGL with glxgears

This does not happen for root user remote - and interestingly is fine for all users if it they are directly on a server console display

So our conclusion is that 515.48 was never thoroughly tested and not fit for purpose, and we have rolled back to the previous stable 510.73.08 (Cuda 11.6) - we have never had such an experience in many previous versions of the stable datacentre drivers. And for clarity we have tried --uninstall and a yum autoremove of older nvidia/cuda debris to no affect

We couldnt find anything obvious message in /var/log/messages or Xorg logs
So am open to suggestions - or ideally an Nvidia engineer:)

We are experiencing identical things with the same driver/kernel at our institution.

Running glxgears randomly crashes the entire remote desktop session.

We have A30 on the nodes.

Output on terminal

$ glxgears
Running synchronized to the vertical refresh.  The framerate should be
approximately the same as the monitor refresh rate.
7026 frames in 5.0 seconds = 1405.105 FPS
X connection to :3.0 broken (explicit kill or server shutdown).
$ glxgears
XIO:  fatal IO error 13 (Permission denied) on X server ":3.0"
      after 32 requests (32 known processed) with 5 events remaining.
$ glxgears
Segmentation fault
$ glxgears
XIO:  fatal IO error 13 (Permission denied) on X server ":3.0"
      after 32 requests (32 known processed) with 5 events remaining.
$ glxgears
XIO:  fatal IO error 13 (Permission denied) on X server ":3.0"
      after 32 requests (32 known processed) with 5 events remaining.
$ glxgears
Segmentation fault

The next glxgears command causes the X Window session to terminate.

Thanks for that reply - interesting to see we aren’t alone.
Our tests all gave the segmentation fault really quickly too on glxgears in remote user mode

I note you were on an A30 - we had this on V100s, A5000s and some older but still supported cards too, so its not hardware specific

regards

Murray

UPDATE, further to the above, we find that the more recent 515.65 datacenter driver is even worse than 515.48 and just gives a black remote screen

So we are about to conclude that Their is a systemic issue with the current development roadmap for x11 drivers