PopOS 22.04 RTX 3080 Ti Laptop GPU - screen freezes for 3 seconds after high load applied on system

After applying high CPU load on a system, my screen starts to freeze. After few seconds it unfreezes and it continues this cycle until I restart my system. Restarting GDM does change this state.

It’s possible to move mouse cursor during the freeze, but entire GNOME UI is unresponsive.

From journalctl I can see following message: “NVIDIA: Wait for channel idle timed out”. This message appears momentarily after the screen unfreezes. After screen unfreezes I can also see that brave browser with parameter “–type=gpu-process” is 99.9% on CPU, it then drops until the next freeze. Closing Brave browser does not help.

I can reproduce this bug on NVIDIA drivers 525.85.05 but It doesn’t seem to be possible on 515.65.01.

I attach my nvidia logs.
nvidia-bug-report.log.gz (456.3 KB)

1 Like

I was able to reproduce this bug on 515.65.01 but my system regained the stability.
It froze once and worked flawlessly after.

[  489.653283] NVRM: GPU at PCI:0000:01:00: GPU-dcd5b7d8-c699-64ce-a07f-0e86e71601e3
[  489.653289] NVRM: Xid (PCI:0000:01:00): 62, pid='<unknown>', name=<unknown>, 0000(0000) 00000000 00000000
[  497.915358] NVRM: Xid (PCI:0000:01:00): 16, pid='<unknown>', name=<unknown>, Head 00000000 Count 0001e6c1
[  506.107363] NVRM: Xid (PCI:0000:01:00): 16, pid='<unknown>', name=<unknown>, Head 00000000 Count 0001e6c2
[  514.299394] NVRM: Xid (PCI:0000:01:00): 16, pid='<unknown>', name=<unknown>, Head 00000000 Count 0001e6c3
[  522.491346] NVRM: Xid (PCI:0000:01:00): 16, pid='<unknown>', name=<unknown>, Head 00000000 Count 0001e6c4
[  530.683310] NVRM: Xid (PCI:0000:01:00): 16, pid='<unknown>', name=<unknown>, Head 00000000 Count 0001e6c5

Does that also happen if you disconnect the external monitor? Please also try switching the kernel to voluntary preemption
preempt=voluntary

Do you mean that I should try to disconnect the external monitor while my screen is in a freezing cycle or should I try to induce this bug while no external screen is connected?

Also, what does it mean to switch kernel to voluntary preemption? I don’t need any detailed answer, just a general information what it would mean for my system. I just would like to know if it’s better to have older drivers (which isn’t really a problem for me) or to have the kernel in an voluntary preemption mode.

this.

preemption
no for servers
voluntary for desktops
full for low-latency desktops
The effect depends on usual workloads, usually not really noticeable. The nvidia driver sometimes doesn’t like being interrupted.

I understand.

Strange thing is, I can’t reproduce this bug again under any condition (even with full preemption).
I tried following the same steps that worked before but now this bug does not appear at all.
I tried to apply an additional stress test on the whole Laptop - CPU / RAM / GPU / SSD in all variants including stress while applying the previous steps to trigger this error → nothing happened, not even a single stutter.

All I’ve done is upgrade to 525.85.05, which I’ve done about 5 times before, but this time it seems to have just randomly fixed the problem. I also had the same external monitor plugged in the whole time.

Thank you for your support @generix, when I discover if there are hidden conditions to reproduce this bug then I will post more information.

EDIT → actually I did one thing, but it is very unclear for me. I changed the preemption to voluntary but then immediately changed it back to full. I did couple of restarts, anyway, this seems very far-fetched.

I’m back, so it seems that voluntary preemption changes the way the bug behaves, but it does not prevent from occurring.

Compared to ‘full’ mode the whole system remains stable until I manipulate the brave browser.

any manipulation on the browser window (trying to move/scale, even just clicking on it) causes the system to freeze, but until I do anything in the browser the system remains stable.

I will revert my drivers to 515.65.01 and wait for newer version published by PopOS.
Is there anything I can do to report the bug to the driver development team?

1 Like