Xid errors on nvidia quadro k420 after hibernation

Hi,

My system is running Arch linux, with i3wm and using nvidia quadro k420 graphics card connected with display port to a hp z23n monitor. Since the past two months I have been getting Xid errors most of the times when I resume the system from hibernation. If I hibernate for a very short time, say a few minutes everything goes fine, but if the hibernation duration is longer, say 2+ hours, then at the time of resuming the system, it hangs and becomes jerky and the logs get filled with Xid errors. I tried disabling the compositor but still its the same, it fact at the time of writing this, the compositor is not running and I just woke up the system from hibernation, but had to reboot due to the Xid errors.

~❯ journalctl --since="2018-01-29" | grep Xid
Jan 29 01:53:19 aries kernel: NVRM: Xid (PCI:0000:01:00): 31, Ch 00000000, engmask 00000101, intr 10000000
Jan 29 01:53:26 aries kernel: NVRM: Xid (PCI:0000:01:00): 8, Channel 00000010
Jan 29 01:53:34 aries kernel: NVRM: Xid (PCI:0000:01:00): 8, Channel 00000010
Jan 29 13:07:20 aries kernel: NVRM: Xid (PCI:0000:01:00): 31, Ch 00000000, engmask 00000101, intr 10000000
Jan 29 13:07:26 aries kernel: NVRM: Xid (PCI:0000:01:00): 8, Channel 00000008
Jan 29 13:07:34 aries kernel: NVRM: Xid (PCI:0000:01:00): 8, Channel 0000000a
Jan 29 13:07:42 aries kernel: NVRM: Xid (PCI:0000:01:00): 8, Channel 0000000a

Also this card is just a couple months old, its a new system, so most likely it is not a thermal issue. And except for hibernation the system runs fine, there are no crashes or freezes while running the system only if I hibernate do I run into problems.

Would appreciate any help on this, thanks!

nvidia-bug-report.log.gz (196 KB)

Edit: to add to above post, my nvidia settings are:

~❯ inxi -G
Graphics:  Card: NVIDIA GK107GL [Quadro K420]
           Display Server: X.Org 1.19.6 driver: nvidia Resolution: 1920x1080@60.00hz
           OpenGL: renderer: Quadro K420/PCIe/SSE2 version: 4.5.0 NVIDIA 387.34
~❯

Early KMS is in use with kernel parameter nvidia-drm.modeset=1 and initramfs contains nvidia nvidia_modeset nvidia_uvm nvidia_drm since im running rootless xorg

nouveau is automatically blacklisted by nvidia

the dpi is set manually at 96 by using a 20-nvidia.conf file, because by default the dpi was set at 94x95 for a 23" 1920x1080 monitor

GFXPAYLOAD_LINUX is set to text and a GRAPHICAL_TERMINAL_OUTPUT is disabled

Thanks

bump is anyone looking into this. i have attached the nvidia logs in the first message.

Does this also happen when using suspend, or only hibernate? Did you check with earlier drivers from before this started?

@generix, thanks for responding. I really cannot say for sure, since it is very unpredictable. Few hours back I suspended for 30 mins and then hibernated for 15 minutes and on both occasions I did not get any Xid error, but yesterday all my attempts to hibernate led to Xid errors. And a few weeks back, I did get these errors after suspending as well.

I do not know whats causing these errors, here’s what I have tried:

  1. Xid errors sometimes disappear with compositor(compton) disabled, not always. Today there were no Xid errors && compton enabled (currently Im only using backend=“glx” and fading and shadows, all other functions are disabled)

  2. Xid errors sometimes disappear with older 4.9 kernel, but yesterday using 4.9 did not help

So I dont think this is because of the compositor or the kernel. I have not tried any older nvidia driver, have been using the current (387.34) from the official Arch repos. I see 390 has entered testing, any idea if this is fixed in the 390 series?

Also, Im not using IGPU at all, have deactivated it in the BIOS, so I do not have the xf86-video-intel driver.

Any ideas what I can do to fix this?

I’d downgrade to kernel 4.9 and driver 375.x to see if that combo is stable. Drivers 378 and up have a known memory management bug, maybe your issues are connected to this. 390 should have a workaround but the results are mixed so far afaik.

Downgrading to kernel 4.9 is not an option. Soon 4.14 will enter LTS and the mainline kernel will be 4.15, besides, there are some cgroup errors in 4.9 which have been fixed in kernels > 4.10. If there is a memory bug in the nvidia drivers > 378 and if its existence is known, then it should be fixed instead of downgrading the driver or kernel or both.