XServer lockup in Ubuntu 19.10 (kernel 5.3.0) with RTX4000

OS: Kubuntu 19.10 (earlier versions’ installers end in a unusable state from garbled X server output)

(OpenSuse 15 Leap will install beautifully but upon startx, the X server shows garbled graphics, even in recovery mode, leaving the installation unusable.)

nvidia-settings says this system is ineligible for the NVidia Prime switch to Intel graphics. The mobo has integrated VGA graphics but the BIOS has no selection to select them if a card is installed. So one would need to remove the RTX400 card to get Intel graphics, AFAIK.)

This a dual boot machine; the graphics work well in Windows 10 without incident, as well as with with the nouveau open-source driver in Linux.

Note that the display port “1” is covered by part of the case so the cable is plugged into display port “2”. (Just in case that may be affecting anything.)

The symptoms are that the X server freezes completely and nvidia-smi says something akin to “the device has fallen off the bus”. (Please see the end of my log file ‘nvidia-driver.log’). Also find the ‘nvidia-bug-report.log’ attached. I wrote a script that runs nvidia-smi every five seconds and runs the crash report if smi says the GPU is missing.

The computer and RTX4000 card were selected to work together for high-performance computing using OpenCV with CUDA. But in a Linux platform.

It’s a six-core Intel(R) Xeon(R) E-2276ME CPU @ 2.80GHz, on a Adlink motherboard.

The issue occurs with the proprietary nvidia-drivers-430 -435, and -440. It does not happen with the nouveau driver. But the nouveau driver does not support the resolution of the monitor I happen to have attached. Additionally, the nouveau driver cannot be a fallback solution over the long term because we need to have CUDA working on this machine.

Thank you for your help.

nvidia-driver.log (717.0 KB)

nvidia-bug-report.log (1.1 MB)

You’re running into an XID 79 error, most often caused by overheating or insufficient power. The RTX4000 draws a lot of power with high spikes. Please check/replace PSU or cables.

Thank you. That provides some insight into the nature of the issue. But how come I can use Windows 10 Workstation Professional for hours on end without any lockups? In Windows, I hear the fans spinning up and slowing down to handle heat generated by CPU/GPU usage. Sometimes it’s caused by Windows 10 doing things in the background related to its hands-off update scheme. But regardless of fan speed, the system remains stable. I’m not doing any intensive 3D rendering or GPU computing in either OS. Just IDEs, browser tabs, text editors, etc.

I re-seated all the power connections that I could access without removing the PSU, to no avail. The 8-pin connection to the PSU appears to be seated. I don’t know what the wattage rating is of the PSU. It’s a Seasonic Focus Platinum. Since I did not build this rig, I will have to disassemble it to find the wattage of the PSU. Seasonic makes good PSUs, but apparently they do not deem it necessary to print the wattage on the side that one sees when the PSU is deployed in the case. The PSU very well could be on the lean side for such a card as the RTX4000. But again, neither Windows 10 nor nouveau in Linux seem to run into trouble with it.

Additionally, when the issue occurs in Linux, the fans spin at max speed. Sometimes not right away when it locks up, but within about half a minute, they are consistently at or near maximum speed. Which suggests that the CPU may be taxed by something related to the GPU lockup? I will add to my logging to gather CPU stats.

When the gpu falls off the bus, Xorg goes into loop consuming 100% cpu on one core. Also, the gpu goes into an emergency mode, spinning the fans to 100%. So it’s a rather normal behaviour under that circumstance.
Please also check if you have this Seasonic model:
https://knowledge.seasonic.com/article/20-focus-plus-and-gpu-potential-compatibility-issues
You can try mitigating it by limiting clocks using nvidia-smi
https://forums.developer.nvidia.com/t/quadro-rtx-6000-causes-hpe-server-to-power-off-peaks-way-over-power-limit/72215/8