For the first time in our experience, the Tesla/datacentre linux NV*.run package driver 515.48.07 (with Cuda 11.7) that we use on RedHat/CentOS7.9 breaks remote login Xwindow sessions for ordinary users (but not root). We have replicated this on multiple machines and various GPU cards
It is clear that there is a serious OpenGL/config problem with this driver as ordinary remote X users experience:
black backgrounds
missing menu bars and headers in gnome
segmentation faults when trying to test openGL with glxgears
This does not happen for root user remote - and interestingly is fine for all users if it they are directly on a server console display
So our conclusion is that 515.48 was never thoroughly tested and not fit for purpose, and we have rolled back to the previous stable 510.73.08 (Cuda 11.6) - we have never had such an experience in many previous versions of the stable datacentre drivers. And for clarity we have tried --uninstall and a yum autoremove of older nvidia/cuda debris to no affect
We couldnt find anything obvious message in /var/log/messages or Xorg logs
So am open to suggestions - or ideally an Nvidia engineer:)
We are experiencing identical things with the same driver/kernel at our institution.
Running glxgears randomly crashes the entire remote desktop session.
We have A30 on the nodes.
Output on terminal
$ glxgears
Running synchronized to the vertical refresh. The framerate should be
approximately the same as the monitor refresh rate.
7026 frames in 5.0 seconds = 1405.105 FPS
X connection to :3.0 broken (explicit kill or server shutdown).
$ glxgears
XIO: fatal IO error 13 (Permission denied) on X server ":3.0"
after 32 requests (32 known processed) with 5 events remaining.
$ glxgears
Segmentation fault
$ glxgears
XIO: fatal IO error 13 (Permission denied) on X server ":3.0"
after 32 requests (32 known processed) with 5 events remaining.
$ glxgears
XIO: fatal IO error 13 (Permission denied) on X server ":3.0"
after 32 requests (32 known processed) with 5 events remaining.
$ glxgears
Segmentation fault
The next glxgears command causes the X Window session to terminate.
Thanks for that reply - interesting to see we aren’t alone.
Our tests all gave the segmentation fault really quickly too on glxgears in remote user mode
I note you were on an A30 - we had this on V100s, A5000s and some older but still supported cards too, so its not hardware specific