Intermitent nvidia-modeset timeout failures (NV_ERR_TIMEOUT) on Linux kernel 5.19.3

Since the upgrade to kernel 5.19 and driver to 515.65.01 I’ve been experiencing modeset failures on startup.

I start with nvidia-drm.modeset=1 in the kernel parameters so PRIME will work correctly and output also through the iGPU to the extra monitors.

This setup has been working correctly for months but now roughly half of the time the modeset timesout

Aug 24 11:52:27 ArmchairTraveller kernel: nvidia-modeset: ERROR: GPU:0: Display engine push buffer channel allocation failed: 0x65 (Call timed out [NV_ERR_TIMEOUT])
Aug 24 11:52:27 ArmchairTraveller kernel: nvidia-modeset: ERROR: GPU:0: Failed to allocate display engine core DMA push buffer
Aug 24 11:52:39 ArmchairTraveller kernel: nvidia-modeset: ERROR: GPU:0: Display engine push buffer channel allocation failed: 0x65 (Call timed out [NV_ERR_TIMEOUT])
Aug 24 11:52:39 ArmchairTraveller kernel: nvidia-modeset: ERROR: GPU:0: Failed to allocate display engine core DMA push buffer
Aug 24 11:52:51 ArmchairTraveller kernel: nvidia-modeset: ERROR: GPU:0: Display engine push buffer channel allocation failed: 0x65 (Call timed out [NV_ERR_TIMEOUT])
Aug 24 11:52:51 ArmchairTraveller kernel: nvidia-modeset: ERROR: GPU:0: Failed to allocate display engine core DMA push buffer
Aug 24 11:52:55 ArmchairTraveller kernel: BUG: kernel NULL pointer dereference, address: 0000000000000070
Aug 24 11:52:55 ArmchairTraveller kernel: #PF: supervisor read access in kernel mode
Aug 24 11:52:55 ArmchairTraveller kernel: #PF: error_code(0x0000) - not-present page
Aug 24 11:52:55 ArmchairTraveller kernel: PGD 0 P4D 0
Aug 24 11:52:55 ArmchairTraveller kernel: Oops: 0000 [#1] PREEMPT SMP NOPTI
Aug 24 11:52:55 ArmchairTraveller kernel: CPU: 12 PID: 981 Comm: Xorg Tainted: P           OE     5.19.3-arch1-1 #1 83cb97ae0c76841ed5ae1e3429386aa2a602dddd
Aug 24 11:52:55 ArmchairTraveller kernel: Hardware name: Micro-Star International Co., Ltd. MS-7D32/MAG Z690 TOMAHAWK WIFI (MS-7D32), BIOS H.20 03/02/2022
Aug 24 11:52:55 ArmchairTraveller kernel: RIP: 0010:_nv002522kms+0x18/0x70 [nvidia_modeset]
Aug 24 11:52:55 ArmchairTraveller kernel: Code: ff c6 44 24 2f 01 eb af 66 2e 0f 1f 84 00 00 00 00 00 41 54 55 49 89 fc 53 89 d5 41 b8 04 00 00 00 ba 02 01 02 00 48 83 ec 10 <8b> 46 70 8b 3d ff dc 0f 00 48 8d 4c 24 0c 89 ee 89 44 24 0c e8 bf
Aug 24 11:52:55 ArmchairTraveller kernel: RSP: 0018:ffffb59949d2bc70 EFLAGS: 00010282
Aug 24 11:52:55 ArmchairTraveller kernel: RAX: 0000000000000000 RBX: 0000000020020000 RCX: 000000000001800c
Aug 24 11:52:55 ArmchairTraveller kernel: RDX: 0000000000020102 RSI: 0000000000000000 RDI: ffff9e0486b83008
Aug 24 11:52:55 ArmchairTraveller kernel: RBP: 0000000000010009 R08: 0000000000000004 R09: 000000008040003e
Aug 24 11:52:55 ArmchairTraveller kernel: R10: 0000000000000000 R11: 0000000000000001 R12: ffff9e0486b83008
Aug 24 11:52:55 ArmchairTraveller kernel: R13: ffff9e0486b830a0 R14: 0000000000000fff R15: 0000000000010008
Aug 24 11:52:55 ArmchairTraveller kernel: FS:  00007fde50af9980(0000) GS:ffff9e0c10300000(0000) knlGS:0000000000000000
Aug 24 11:52:55 ArmchairTraveller kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Aug 24 11:52:55 ArmchairTraveller kernel: CR2: 0000000000000070 CR3: 000000013b0a4002 CR4: 0000000000770ee0
Aug 24 11:52:55 ArmchairTraveller kernel: PKRU: 55555554
Aug 24 11:52:55 ArmchairTraveller kernel: Call Trace:
Aug 24 11:52:55 ArmchairTraveller kernel:  <TASK>
Aug 24 11:52:55 ArmchairTraveller kernel:  _nv002521kms+0xb3/0x150 [nvidia_modeset 65a1b9434afc207f79e740d6611c0e4e25e2120d]
Aug 24 11:52:55 ArmchairTraveller kernel:  _nv002295kms+0x4da/0x720 [nvidia_modeset 65a1b9434afc207f79e740d6611c0e4e25e2120d]
Aug 24 11:52:55 ArmchairTraveller kernel:  ? __check_object_size+0x1f8/0x250
Aug 24 11:52:55 ArmchairTraveller kernel:  ? _nv000448kms+0xa0/0xa0 [nvidia_modeset 65a1b9434afc207f79e740d6611c0e4e25e2120d]
Aug 24 11:52:55 ArmchairTraveller kernel:  _nv000633kms+0x34/0x50 [nvidia_modeset 65a1b9434afc207f79e740d6611c0e4e25e2120d]
Aug 24 11:52:55 ArmchairTraveller kernel:  nvKmsIoctl+0x94/0x1d0 [nvidia_modeset 65a1b9434afc207f79e740d6611c0e4e25e2120d]
Aug 24 11:52:55 ArmchairTraveller kernel:  nvkms_ioctl+0x11b/0x190 [nvidia_modeset 65a1b9434afc207f79e740d6611c0e4e25e2120d]
Aug 24 11:52:55 ArmchairTraveller kernel:  nvidia_frontend_unlocked_ioctl+0x39/0x50 [nvidia 70b02d69ccb657a69795c50c60564b1b5a9176b9]
Aug 24 11:52:55 ArmchairTraveller kernel:  __x64_sys_ioctl+0x91/0xd0
Aug 24 11:52:55 ArmchairTraveller kernel:  do_syscall_64+0x5c/0x90
Aug 24 11:52:55 ArmchairTraveller kernel:  ? syscall_exit_to_user_mode+0x1b/0x40
Aug 24 11:52:55 ArmchairTraveller kernel:  ? do_syscall_64+0x6b/0x90
Aug 24 11:52:55 ArmchairTraveller kernel:  ? syscall_exit_to_user_mode+0x1b/0x40
Aug 24 11:52:55 ArmchairTraveller kernel:  ? do_syscall_64+0x6b/0x90
Aug 24 11:52:55 ArmchairTraveller kernel:  ? do_syscall_64+0x6b/0x90
Aug 24 11:52:55 ArmchairTraveller kernel:  entry_SYSCALL_64_after_hwframe+0x63/0xcd
Aug 24 11:52:55 ArmchairTraveller kernel: RIP: 0033:0x7fde514689e
Aug 24 11:52:55 ArmchairTraveller kernel: Code: 00 48 89 44 24 18 31 c0 48 8d 44 24 60 c7 04 24 10 00 00 00 48 89 44 24 08 48 8d 44 24 20 48 89 44 24 10 b8 10 00 00 00 0f 05 <89> c2 3d 00 f0 ff ff 77 18 48 8b 44 24 18 64 48 2b 04 25 28 00 00
Aug 24 11:52:55 ArmchairTraveller kernel: RSP: 002b:00007ffe666e6480 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
Aug 24 11:52:55 ArmchairTraveller kernel: RAX: ffffffffffffffda RBX: 00000000c0106d00 RCX: 00007fde514689ef
Aug 24 11:52:55 ArmchairTraveller kernel: RDX: 00007ffe666e64e0 RSI: 00000000c0106d00 RDI: 0000000000000013
Aug 24 11:52:55 ArmchairTraveller kernel: RBP: 00007ffe666e64e0 R08: 00007ffe666e5a70 R09: 00007ffe666e5a8c
Aug 24 11:52:55 ArmchairTraveller kernel: R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000013
Aug 24 11:52:55 ArmchairTraveller kernel: R13: 00007ffe666e6530 R14: 00005579f70a3950 R15: 00007fde4fe8ec10
Aug 24 11:52:55 ArmchairTraveller kernel:  </TASK>
Aug 24 11:52:55 ArmchairTraveller kernel: Modules linked in: xt_CHECKSUM xt_MASQUERADE xt_conntrack ipt_REJECT nf_reject_ipv4 xt_tcpudp ip6table_mangle ip6table_nat ip6table_filter ip6_tables iptable_mangle iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf>
Aug 24 11:52:55 ArmchairTraveller kernel:  snd_compress iwlmei i2c_i801 mei_me spi_intel ac97_bus i2c_smbus snd_pcm_dmaengine igc mei uvcvideo videobuf2_vmalloc xone_dongle(OE) mousedev xone_gip_bus(OE) snd_hda_codec_hdmi i915 videobuf2_memops btusb snd_us>
Aug 24 11:52:55 ArmchairTraveller kernel:  irqbypass vfio_virqfd vfio_iommu_type1 vfio
Aug 24 11:52:55 ArmchairTraveller kernel: Unloaded tainted modules: pcc_cpufreq():1 acpi_cpufreq():1 pcc_cpufreq():1 pcc_cpufreq():1 acpi_cpufreq():1 pcc_cpufreq():1 acpi_cpufreq():1 acpi_cpufreq():1 pcc_cpufreq():1 pcc_cpufreq():1 acpi_cpufreq():1 pcc_cpu>
Aug 24 11:52:55 ArmchairTraveller kernel:  pcc_cpufreq():1 acpi_cpufreq():1 acpi_cpufreq():1 acpi_cpufreq():1 acpi_cpufreq():1 pcc_cpufreq():1 pcc_cpufreq():1 acpi_cpufreq():1 pcc_cpufreq():1 acpi_cpufreq():1 acpi_cpufreq():1 pcc_cpufreq():1 acpi_cpufreq()>
Aug 24 11:52:55 ArmchairTraveller kernel:  pcc_cpufreq():1 pcc_cpufreq():1 acpi_cpufreq():1 pcc_cpufreq():1 acpi_cpufreq():1 acpi_cpufreq():1 pcc_cpufreq():1 acpi_cpufreq():1 pcc_cpufreq():1 acpi_cpufreq():1 acpi_cpufreq():1 pcc_cpufreq():1 acpi_cpufreq():>
Aug 24 11:52:55 ArmchairTraveller kernel:  pcc_cpufreq():1 acpi_cpufreq():1 pcc_cpufreq():1 acpi_cpufreq():1 pcc_cpufreq():1 pcc_cpufreq():1 acpi_cpufreq():1 pcc_cpufreq():1 acpi_cpufreq():1 acpi_cpufreq():1 pcc_cpufreq():1 acpi_cpufreq():1 pcc_cpufreq():1>
Aug 24 11:52:55 ArmchairTraveller kernel:  acpi_cpufreq():1
Aug 24 11:52:55 ArmchairTraveller kernel: CR2: 0000000000000070
Aug 24 11:52:55 ArmchairTraveller kernel: ---[ end trace 0000000000000000 ]---
Aug 24 11:52:55 ArmchairTraveller kernel: RIP: 0010:_nv002522kms+0x18/0x70 [nvidia_modeset]
Aug 24 11:52:55 ArmchairTraveller kernel: Code: ff c6 44 24 2f 01 eb af 66 2e 0f 1f 84 00 00 00 00 00 41 54 55 49 89 fc 53 89 d5 41 b8 04 00 00 00 ba 02 01 02 00 48 83 ec 10 <8b> 46 70 8b 3d ff dc 0f 00 48 8d 4c 24 0c 89 ee 89 44 24 0c e8 bf
Aug 24 11:52:55 ArmchairTraveller kernel: RSP: 0018:ffffb59949d2bc70 EFLAGS: 00010282
Aug 24 11:52:55 ArmchairTraveller kernel: RAX: 0000000000000000 RBX: 0000000020020000 RCX: 000000000001800c
Aug 24 11:52:55 ArmchairTraveller kernel: RDX: 0000000000020102 RSI: 0000000000000000 RDI: ffff9e0486b83008
Aug 24 11:52:55 ArmchairTraveller kernel: RBP: 0000000000010009 R08: 0000000000000004 R09: 000000008040003e
Aug 24 11:52:55 ArmchairTraveller kernel: R10: 0000000000000000 R11: 0000000000000001 R12: ffff9e0486b83008
Aug 24 11:52:55 ArmchairTraveller kernel: R13: ffff9e0486b830a0 R14: 0000000000000fff R15: 0000000000010008
Aug 24 11:52:55 ArmchairTraveller kernel: FS:  00007fde50af9980(0000) GS:ffff9e0c10300000(0000) knlGS:0000000000000000
Aug 24 11:52:55 ArmchairTraveller kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Aug 24 11:52:55 ArmchairTraveller kernel: CR2: 0000000000000070 CR3: 000000013b0a4002 CR4: 0000000000770ee0
Aug 24 11:52:55 ArmchairTraveller kernel: PKRU: 55555554

My best guess so far is that gdm is starting faster or the drivers are slower because that would explain the intermitent failure.

Error report log:
nvidia-bug-report.log.gz (436.3 KB) (I ran this from a successful run, as I said it’s intermitent, if the journal entries in it are only for the current session it might not have the error in it)

Please run nvidia-bug-report.sh as root and attach the resulting nvidia-bug-report.log.gz file to your post.

Thanks @generix; the log is now attached to the post above.

The error was caught in the logs but didn’t yield more information than you already posted. I doubt a timing issue since there are about 8 seconds between driver loading and X starting.
Does this only happen on cold boots or also sometimes on reboot?

Ah, it does happen so far on reboots. But not every time I reboot.

So far when I fully power-cycle the machine it hadn’t happened but I have done perhaps 30 reboots/power cycles since I reported this so that’s not a lot of data.

I’ve also been testing a different Desktop Manager than GMD (LightDM) and when I start in LightDM so far I don’t have this issue. But I’ve only restarted 4 times into LightDM so far, I’d need more time to test that hypothesis.

I also noticed that when it happens it keeps crashing over and over, which gives me a few seconds to get into tty3 and get journal logs live or even stop gdm (or reboot).

I could do a battery of tests if you think that’d be useful, with the 4 combinations (reboot and power-cyle, gdm and lightdm).

Btw Kernel upgraded to 5.19.4 last night.

I did run a number of tests and it seems to happen both on reboot and power cycle.

I have been testing LightDM instead of GDM and in the last 4 starts with LightDM the issue hasn’t happened.