Broken OpenGL in 515.48 Linux Driver

For the first time in our experience, the Tesla/datacentre linux NV*.run package driver 515.48.07 (with Cuda 11.7) that we use on RedHat/CentOS7.9 breaks remote login Xwindow sessions for ordinary users (but not root). We have replicated this on multiple machines and various GPU cards

It is clear that there is a serious OpenGL/config problem with this driver as ordinary remote X users experience:

  • black backgrounds
  • missing menu bars and headers in gnome
  • segmentation faults when trying to test openGL with glxgears

This does not happen for root user remote - and interestingly is fine for all users if it they are directly on a server console display

So our conclusion is that 515.48 was never thoroughly tested and not fit for purpose, and we have rolled back to the previous stable 510.73.08 (Cuda 11.6) - we have never had such an experience in many previous versions of the stable datacentre drivers. And for clarity we have tried --uninstall and a yum autoremove of older nvidia/cuda debris to no affect

We couldnt find anything obvious message in /var/log/messages or Xorg logs
So am open to suggestions - or ideally an Nvidia engineer:)

We are experiencing identical things with the same driver/kernel at our institution.

Running glxgears randomly crashes the entire remote desktop session.

We have A30 on the nodes.

Output on terminal

$ glxgears
Running synchronized to the vertical refresh.  The framerate should be
approximately the same as the monitor refresh rate.
7026 frames in 5.0 seconds = 1405.105 FPS
X connection to :3.0 broken (explicit kill or server shutdown).
$ glxgears
XIO:  fatal IO error 13 (Permission denied) on X server ":3.0"
      after 32 requests (32 known processed) with 5 events remaining.
$ glxgears
Segmentation fault
$ glxgears
XIO:  fatal IO error 13 (Permission denied) on X server ":3.0"
      after 32 requests (32 known processed) with 5 events remaining.
$ glxgears
XIO:  fatal IO error 13 (Permission denied) on X server ":3.0"
      after 32 requests (32 known processed) with 5 events remaining.
$ glxgears
Segmentation fault

The next glxgears command causes the X Window session to terminate.

Thanks for that reply - interesting to see we aren’t alone.
Our tests all gave the segmentation fault really quickly too on glxgears in remote user mode

I note you were on an A30 - we had this on V100s, A5000s and some older but still supported cards too, so its not hardware specific

regards

Murray