Kernel 5.4 and 5.6, NOT 4.19 - Full system freeze/lockup when playing back video - GTX 1660

Hello team!

Got a bug report for you for a very frustrating issue I encountered recently.

When playing random videos (not always the same ones) using MPV (often launched via Ranger, the file manager) or more reliably when playing a video using the Olive video editor (which is GPU accelerated), I would receive a full system lockup.

Symptoms:

  • Full System Lockup/Hang/Freeze
  • Often, system shuts down on its own within 5 minutes. Not always though
  • Ctrl + Alt + F2/3/4/etc does not work to switch to a TTY
  • Ctrl + Alt + Backspace does not work to kill Xorg
  • Alt + Sysrq (REISUB) does not appear to work at all.
  • SSH’d in from another machine shows nothing in the logs, and the connection hangs as soon as the crash happens. Pings go unreachable, tmux session doesnt respond, etc.
  • DOES NOT APPEAR TO HANG WITH KERNEL 4.19 (still waiting for a freeze, might happen, but a lot more stable than 5.4/5.6)

Specs:
OS: Manjaro, latest updates as of yesterday
Card: GTX 1660
Driver: 440.82-1
CUDA: 10.2.89-5
Kernels: 5.6.3, 5.4.31, 4.19.114
Logs:
nvidia-bug-report-kernel-4.19.114.log (664.9 KB)
nvidia-bug-report-kernel-5.4.31.log (221.5 KB)
Logs were taken by running sudo nvidia-bug-report.sh after Olive has opened, but before playing a video. I can’t easily take a log as I’m playing the video…

(I can provide a log for kernel 5.6 if desired as well, if desired)

There’s not really anything noticeable in the logs. Did you create it right after the freeze happened (and you rebooted)?
Does that nvidia card have an USB-C connector?

I’ll take a set of logs, post reboot. I’d originally thought that all logs were reset when rebooted, but based on your response maybe not.

There is no USB port of any kind on the card’s plate (where the display ports are).

Please check if removing the (defunct anyway) USB-C devices has any influence by creating a file
/lib/udev/rules.d/90-remove-nvidia-usb.rules
with contents

# Remove NVIDIA USB xHCI Host Controller devices, if present
ACTION=="add", SUBSYSTEM=="pci", ATTR{vendor}=="0x10de", ATTR{class}=="0x0c0330", ATTR{remove}="1"

# Remove NVIDIA USB Type-C UCSI devices, if present
ACTION=="add", SUBSYSTEM=="pci", ATTR{vendor}=="0x10de", ATTR{class}=="0x0c8000", ATTR{remove}="1"

and post the output of
sudo lspci -d 10de*
after reboot. Then check if it freezes when playing video.

Okay, here’s the output of the freeze occurring, a reboot, and then taking a log.

Will execute your USB-C removal trick next.

nvidia-bug-report-kernel-5.4.33-post-reboot.log (217.9 KB)

Also, is that the correct lspci command? I haven’t rebooted with the udev rules yet, but I went ahead and tried it and got the following

[host@domain ~ ]$ sudo lspci -d 10de*
lspci: -d: ':' expected

Yes, typo.

sudo lspci -d 10de:*

Forgot: the new logs didn’t bring up anything new, seems like a complete freeze, the log just stops.

Okay, added the rules, rebooted, tried editing in Olive again, and full system lockup again.

Checking the output of lspci, there was a USB device before, its gone now.

[user@domain ~ ]$ sudo lspci -d 10de:*
01:00.0 VGA compatible controller: NVIDIA Corporation TU116 [GeForce GTX 1660] (rev a1)
01:00.1 Audio device: NVIDIA Corporation TU116 High Definition Audio Controller (rev a1)

Tough luck. I don’t really know what to look at, especially that full system lock-up irritates me.
What kind of hw accel are you using in mpv?
I didn’t find any hw accel that Olive uses besides OpenGL for effects. OTOH, the docs for that aren’t really extensive. Do you know what it uses exactly?

Output from MPV:

 (+) Video --vid=1 (*) (h264 1920x1080 50.000fps)
 (+) Audio --aid=1 --alang=eng (*) (opus 2ch 48000Hz)
Using hardware decoding (nvdec).
AO: [pulse] 48000Hz stereo 2ch float
VO: [gpu] 1920x1080 cuda[nv12]
(Paused) AV: 00:00:06 / 00:09:47 (1%) A-V:  0.002 DS: 2.000/4 Dropped: 2

As for Olive - I’m not exactly sure. I may have to do a little source code diving to be sure.

Based on this, it looks like it uses mostly Qt and FFMPEG, so it might be FFMPEG libraries doing the GPU level decoding.

Found another crash - Crashed when loading World of Warcraft, during the loading screen.

This was a first. I don’t game much, but I did yesterday on the same kernel (I think) with no problems, and today it crashed hard.

Took a log after I rebooted, attaching now.

nvidia-bug-report-kernel-5.4.33-post-WOW-crash.log (1.1 MB)

Please check if loading the module alone provokes the freeze:
sudo modprobe nvidia-uvm
Also, please run
sudo journalctl -b-1 --no-pager |grep kernel >kernel.log
after a freeze/reboot and attach the output file.

Okay, got a new crash to report: Using Lutris, I launched Battlenet, in order to finish downloading WOW. After it finished. I closed Battlenet, and then relaunched it again from Lutris. The account auth screen popped up, logged in, and then - crash/full system lockup.

Note: This was after I’d executed sudo modprobe nvidia-uvm. There was no immediate crash when I loaded the module, so I went about continuing to use the computer (and load Lutris/Battle.net, as stated above).

Nvidia bug report and sudo journalctl -b-1 --no-pager | grep kernel are attached.

kernel-post-battlenet-crash.log (120.5 KB)
nvidia-bug-report-kernel-5.4.33-post-battlenet-crash.log (217.1 KB)

Nothing, again. I’m leaning towards this being some mainboard/bios flaw, no idea why this gets triggered by the 5.4+ kernels.

Bugger. I can’t say that I disagree with you. Just incredibly frustrating that my system is otherwise rock solid, with little reason for me to upgrade to a new mobo/proc/etc.

I might just stick with kernel 4.19 for as long as I can. Luckily, it is an LTS kernel. I’ll keep testing different kernels and run stability tests to triple check things. For instance, ran Memtest86+ all night last night, over 3+ passes, and 0 errors.

I’ll run more CPU and GPU tests over the coming days - see if anything pops up.

You could do a kernel bisect to find the commit that causes this. Probably less work that trying this and that.

Circling back on this - I ended up building a new rig, so new processor/motherboard, same graphics card, kernel, etc. The primary issue of the system locking up when playing a video on Kernel 5.4/5.6 appears to have gone away, thankfully.

However, I did still get an Xorg lockup when attempting to share my screen using Firefox in Jitsi. I had to switch TTY’s, kill Firefox, and then everything was working again.

Brave (browser) shares the screen just fine, so that’s my work around for now…

Linux Driver: 440.82-17 (Manjaro, Kernel 5.6)

This sounds exactly like my problems with a GTX1050Ti (GTX1050Ti apparently causing system reboot). Have to say that kernel (5.6) was one thing I’d not considered – I’ve even tried checking whether it was HDMI, because when I had the card in another machine working fine (with identical kernel & drivers) that was using the Display Port output.

It’s all a bit messy. I can sometimes provoke the same crash from the Windows side of the machine, but not as reliably. And it seems like some recent mesa updates have made things worse, so it’s not just changes to the nvidia drivers themselves which affect things. And similarly I can’t justify replacing mobo/CPU just in the hope of fixing this, when everything else is fine.