X freezes, NVIDIA(GPU-0): WAIT

X has been freezing a lot lately, almost everyday at least once. The screen just freezes completely, the cursor can move but the system does not respond to keyboard input (apart from SysRq, I can force it to reboot, but it won’t switch tty even after SysRq+r).
Also the system does not seem to recover, because simply waiting doesn’t seem to help.

[  6270.576] (EE) NVIDIA(GPU-0): WAIT (2, 8, 0x8000, 0x0000fc50, 0x0000fc6c)
[  6277.576] (EE) NVIDIA(GPU-0): WAIT (1, 8, 0x8000, 0x0000fc50, 0x0000fc6c)

(Full xorg.log.old: https://paste.pound-python.org/show/mngWuMPLALEJ0XoYmEer/)
dmesg or syslog don’t seem to report any thing out of the ordinary as far as I can tell, today’s syslog: https://paste.pound-python.org/show/cBhNiWAMSQwcau1XIQ77/

Anyone got any idea what’s going on here?
Is this maybe a driver bug, or is my GPU dying?

(Emerge --info: https://paste.pound-python.org/show/3cxt0UERiLWimv3QqYYS/)
x11-drivers/nvidia-drivers-418.30
x11-base/xorg-server-1.20.3
media-libs/mesa-19.0.0_rc2

When I ssh into the machine and try to reload the nvidia, nvidia_drm, nvidia_modeset modules, the command to rmmmod -f nvidia never completes, it then refuses to shutdown. Here’s a syslog: https://paste.pound-python.org/show/iWv1yZjShTSXp8nqHWON/
And some interesting lines from this log:

Feb 11 09:56:10 andrew-gentoo-pc kernel: [ 2151.026289] NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context
Feb 11 09:56:20 andrew-gentoo-pc kernel: [ 2161.122295] WARNING: CPU: 11 PID: 5556 at /var/tmp/portage/x11-drivers/nvidia-drivers-418.30/work/kernel/nvidia/nv-rsync.c:44 nv_destroy_rsync_info+0x25/0x30 [nvidia]

Sddm fails to stop X:

[10:04:03.092] (WW) DAEMON: Signal received: SIGTERM
[10:04:03.092] (II) DAEMON: Socket server stopping...
[10:04:03.092] (II) DAEMON: Socket server stopped.
[10:04:03.092] (II) DAEMON: Display server stopping...
[10:04:08.097] (WW) DAEMON: QProcess: Destroyed while process ("/usr/libexec/sddm-helper") is still running.
[10:04:08.098] (II) DAEMON: Display server stopping...
[10:04:13.103] (WW) DAEMON: QProcess: Destroyed while process ("/usr/bin/X") is still running.

Also. maybe I should add that I am using the boot parameter “nvidia-drm.modeset=1” because nvidia’s documentation claims that this eliminates/reduces tearing (and enables frame synchronization between the GPU’s I think), however I still see tearing sometimes, but only on the monitor connect to the nvidia gpu.

I also have the following in /usr/share/sddm/scripts/Xsetup:

xrandr --setprovideroutputsource modesetting NVIDIA-0
xrandr --auto --output HDMI-1-2 --mode 1600x900 --pos 3360x90 --output DVI-D-0 --mode 1920x1080 --pos 1440x0 --output DP-1-2 --mode 1440x900 --pos 0x90

The first line enables PRIME, nvidia’s documentation has “xrandr --auto” as the second line.
However, when I use the --auto option KDE completely messes up the monitor configuration (even though sddm detects it just fine), all monitors are put over each other in a configuration that resembles the duplication configuration but is not quite the same because the resolutions do match. It used to work fine with 2 monitors, however ever since I added the third I need to manually specify the correct configuration.

I do not have efifb enabled because when I enable it, I get a low resolution framebuffer on the monitor connected to the nvidia GPU, and no framebuffer on the monitors connected to the intel GPU.
Instead I have it disabled which gives me no framebuffer on the monitor connected to the nvidia GPU, but it does give me a high-resolution framebuffer on the monitors connected to the intel GPU.

I have also had problems with the HDMI output of the nivida GPU (see my other thread here: [SOLVED]Problems with nvidia-drivers and 2nd monitor on IGPU)
The monitor would slowly become completely white when: I logged in from sddm, I switched from tty to X, or whenever the monitor configuration changed.
I have not had this problem ever since I have been using the DVI-D output instead, indicating that this was not a problem with the monitor, but with the GPU.

See also my thread on the gentoo forums I tried to copy most of it here but I might have missed something.

nvidia-bug-report.log.gz (104 KB)

The most recent crashes included XID 62 and 56, something was failing in the display engine. Taking the odd problem with your monitor into account, this might as well be an electrical problem, the monitor kind of overloading the output circuitry. Please disconnect the monitor you currently have connected to the nvidia gpu to test if you get a stable system. If no crashes occur, connect one of the other monitors to it.
BTW, since you’re using KDE and are setting the monitor placement using xrandr, it’s often better to disable the kscreen2 service because this has strange side-effects.

I have switched the monitors around, the one that was connected with DVI-D to the nvidia GPU is now connected to the intel one, and the other way around.
(I can’t disconnect it completely because then there would be no monitor connected to the nvidia GPU and the bios will default to the iGPU and disable nvidia completely, it only supports dual GPU when the dedicated GPU is default and it can only be default if it has a monitor connected to it to boot on.)

After reboot I immediately noticed that everything is significantly smaller even though resolutions are the same. I also do not have any tearing any more on the middle monitor (that used to be connected to nvidia) instead I have this same tearing on the right monitor (that used to be connected to intel).

With prime enabled, nvidia is doing all the rendering right? why then would I get tearing on the only monitor that is directly connected to it, I would expect the other monitors to maybe have some tearing cause there is a extra step involved.

I have also disabled kscreen2. I did not remove it completely because it is part of the plasma-meta package and I would then manually have to remove the meta package and install everything but kscreen.

I’ll test this setup and see how it goes.

Could you maybe explain a bit more about the theory behind this overloading? I was under the impression that the monitor does not send anything but basic info about itself back to the pc. Why would this only be a problem when using HDMI and not when using DVI-D?

[EDIT] With kscreen2 disabled I no longer get the annoying OSD pop up asking me to select the monitor configuration every time I start KDE, so thanks for that :)

[EDIT2] Now my “good” monitor is connected to intel, while my shitty second hand monitor that I randomly found somewhere is connected to nvidia, would this have a performance/quality effect on the rendering on my “good” monitor? (Apart from that the tearing has moved monitor)

My speculation regarding the monitor is about bad grounding, leading to a potential difference charging the capacitors in the signal path over time.
Regarding tearing, you’ve already enabled PRIME sync using the kernel parameter nvidia-drm.modeset=1 but this has only effect on the iGPU connected monitors. To work around tearing on the nvidia connected monitor, create an xorg conf snippet like /usr/share/X11/xorg.conf.d/50-nvidia-drm-outputclass.conf in /etc/X11/xorg.conf.d and add

Option "MetaModes" "DP-0: nvidia-auto-select {ForceCompositionPipeline=On}"
Option "UseNvKmsCompositionPipeline" "false"

Adjust DP-0 to the output you’re actually using.
Since the picture on the iGPu connected monitors is rendered on the nvidia gpu and then is sent over the pci bus to system memory, there’s of course some performance cost to be paid. To what extent can only be measured.
Edit: changed true to false.

Interesting, bad grounding does seem plausible. I have connected the line-In input from my pc to the headset output of that same monitor. The aim is to forward audio from the monitors other input through the pc to the same set of speakers. I hear a lot of noise though as soon as I plug the audio cable in the monitor, I have bought a insulated audio cable to try and minimize it but there is still a lot of noise. Last week I gave up on trying to improve it and figured it was some grounding issue as well.

I’m having some problems with the xorg.conf files, I looked at the arch wiki for an example and with it I have made the following file:

# Fix nvidia tearing
Section "Device"
Identifier "Nvidia Card"
Driver     "nvidia"
VendorName "NVIDIA Corporation"
BoardName  "GeForce GTX 1060"
EndSection

Section "Screen"
Identifier     "Screen0"
Device         "Device0"
Monitor        "Monitor0"
Option         "metamodes" "DVI-D-0: nvidia-auto-select +0+0 {ForceCompositionPipeline=On, ForceFullCompositionPipeline=On}"
Option         "UseNvKmsCompositionPipeline" "false"
Option         "AllowIndirectGLXProtocol" "off"
Option         "TripleBuffer" "on"
EndSection

It does seem to help with tearing, however I completely lose the monitors connected to the iGPU, they don’t even show up in xrandr anymore, what am I missing? I tried replacing nvidia-auto-select with the exact resolution/frame rate and position but that didn’t help.

Just delete the xorg.conf and create a snippet /etc/X11/xorg.conf.d/20-nvidia-prime.conf

Section "OutputClass"
    Identifier     "nvidia-prime"
    MatchDriver    "nvidia-drm"
    Driver         "nvidia"
    Option         "metamodes" "DVI-D-0: nvidia-auto-select +0+0 {ForceCompositionPipeline=On}"
    Option         "UseNvKmsCompositionPipeline" "false"
    Option         "PrimaryGPU" "true"
    Option         "AllowEmptyInitialConfiguration" "true"
EndSection

Edit: added option to run without connected monitors.

BTW, since you connected a different monitor to the nvidia gpu, I hope you also used different cables/adapters as those are the most common reason for bad grounding.

Awesome thanks, that did the trick. Despite my many years using linux the logic and structure behind a xorg config file remains a mystery to me. I suppose the key here was to use OutputClass instead of Screen which I guess applies the options to the output (GPU) instead of the specific screen.

When I switched monitors I also switched cable, however both are the same DVI-D cables that I bought in the same store at the same time, so I don’t think they will be different in any way.

Tha arch wiki claims I need the following options as well so I added these too:

Option         "AllowIndirectGLXProtocol" "off"
Option         "TripleBuffer" "on"

Tearing is gone now, I used to get this all the time when scrolling through pdf’s, always at the same position halfway up the monitor. I didn’t get this when I scrolled through a pdf just now.

I’ll test this configuration to see if X will freeze again or not. So far so good though.

X hasn’t crashed for over nearly 4 days now, I guess the monitor was indeed causing the problem. I’m marking as solved, thanks for your help :)

So apparently I was a bit too fast with thinking this was fixed, because today it happened again. Interestingly enough xorg.log doesn’t show the NVIDIA(GPU-0): WAIT line, in fact it doesn’t show me anything out of the ordinary. Syslog shows me:

Feb 20 17:50:34 andrew-gentoo-pc pulseaudio[4002]: [pulseaudio] module-loopback.c: Too many underruns, increasing latency to 12.00 ms
Feb 20 17:51:39 andrew-gentoo-pc dbus-daemon[3644]: [session uid=105 pid=3642] Reloaded configuration
Feb 20 17:51:39 andrew-gentoo-pc dbus-daemon[3865]: [session uid=1000 pid=3863] Reloaded configuration
Feb 20 17:51:39 andrew-gentoo-pc dbus-daemon[3889]: [session uid=1000 pid=3887] Reloaded configuration
Feb 20 17:55:31 andrew-gentoo-pc dbus-daemon[3644]: [session uid=105 pid=3642] Reloaded configuration
Feb 20 17:55:31 andrew-gentoo-pc dbus-daemon[3889]: [session uid=1000 pid=3887] Reloaded configuration
Feb 20 17:55:31 andrew-gentoo-pc dbus-daemon[3865]: [session uid=1000 pid=3863] Reloaded configuration
Feb 20 17:55:31 andrew-gentoo-pc dbus-daemon[3644]: [session uid=105 pid=3642] Reloaded configuration
Feb 20 17:55:31 andrew-gentoo-pc dbus-daemon[3865]: [session uid=1000 pid=3863] Reloaded configuration
Feb 20 18:03:09 andrew-gentoo-pc kernel: [ 1805.233703] NVRM: GPU at PCI:0000:01:00: GPU-e7705309-e4c3-14d6-c2b9-31ba8aaf6cd9
Feb 20 18:03:09 andrew-gentoo-pc kernel: [ 1805.233705] NVRM: GPU Board Serial Number:
Feb 20 18:03:09 andrew-gentoo-pc kernel: [ 1805.233706] NVRM: Xid (PCI:0000:01:00): 62, 29c10(9310) 84029c1c 1041a100
Feb 20 18:03:42 andrew-gentoo-pc pulseaudio[4002]: [null-sink] asyncq.c: q overrun, queuing locally
Feb 20 18:03:42 andrew-gentoo-pc pulseaudio[4002]: [null-sink] asyncq.c: q overrun, queuing locally
Feb 20 18:03:42 andrew-gentoo-pc kernel: [ 1838.216218] sysrq: SysRq : Keyboard mode set to system default
Feb 20 18:03:42 andrew-gentoo-pc pulseaudio[4002]: [null-sink] asyncq.c: q overrun, queuing locally
Feb 20 18:03:42 andrew-gentoo-pc last message repeated 8 times
Feb 20 18:03:43 andrew-gentoo-pc kernel: [ 1839.043901] NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context
Feb 20 18:03:45 andrew-gentoo-pc kernel: [ 1841.043932] NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context
Feb 20 18:03:47 andrew-gentoo-pc exiting on signal 15

Also, sddm.log didn’t print the line about begin unable to stop X. However the frozen X server continued to be visible on the monitor connected to the nvidia GPU.

Did you discover any final fix? I am facing same issue on Gentoo system.