Thanks for replying. I’m really frustrated.
My system has been unstable for months. I’m getting a crash a day at this point. I just tested it last night for a couple of hours in windows to see if I could get it to crash. I ran prime95 and 3dmark burn. It did not crash.
If I run Windows and then boot linux, the computer has audio issues. I have to power it off and back on to get audio to work properly. The symptom is that the speaker output jack thinks that things are being plugged in when there is nothing there and switches the audio out that port on and off randomly. Audio doesn’t work.
Gentoo is pretty good about installing packages cleanly:
server ~ # equery list nvidia-drivers
- Searching for nvidia-drivers …
[IP-] [ ] x11-drivers/nvidia-drivers-450.80.02:0/450
server ~ # dmesg | grep -i nvidia
[ 6.110204] input: HDA NVidia HDMI/DP,pcm=3 as /devices/pci0000:00/0000:00:03.1/0000:08:00.1/sound/card0/input2
[ 6.110289] input: HDA NVidia HDMI/DP,pcm=7 as /devices/pci0000:00/0000:00:03.1/0000:08:00.1/sound/card0/input3
[ 6.110371] input: HDA NVidia HDMI/DP,pcm=8 as /devices/pci0000:00/0000:00:03.1/0000:08:00.1/sound/card0/input4
[ 6.110456] input: HDA NVidia HDMI/DP,pcm=9 as /devices/pci0000:00/0000:00:03.1/0000:08:00.1/sound/card0/input5
[ 6.110538] input: HDA NVidia HDMI/DP,pcm=10 as /devices/pci0000:00/0000:00:03.1/0000:08:00.1/sound/card0/input6
[ 6.110624] input: HDA NVidia HDMI/DP,pcm=11 as /devices/pci0000:00/0000:00:03.1/0000:08:00.1/sound/card0/input7
[ 10.630273] nvidia: loading out-of-tree module taints kernel.
[ 10.630281] nvidia: module license ‘NVIDIA’ taints kernel.
[ 10.646730] nvidia-nvlink: Nvlink Core is being initialized, major device number 245
[ 10.647189] nvidia 0000:08:00.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=none:owns=io+mem
[ 10.850882] NVRM: loading NVIDIA UNIX x86_64 Kernel Module 450.80.02 Wed Sep 23 01:13:39 UTC 2020
[ 11.102774] caller _nv000745rm+0x1af/0x200 [nvidia] mapping multiple BARs
[ 11.655834] nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms 450.80.02 Wed Sep 23 00:48:09 UTC 2020
[ 11.660950] [drm] [nvidia-drm] [GPU ID 0x00000800] Loading driver
[ 11.660953] [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:08:00.0 on minor 0
[ 15.977342] caller _nv000745rm+0x1af/0x200 [nvidia] mapping multiple BARs
There is one thing that is unique to this system: I have an old Marvell PCIe SAS Raid controller running to give me three extra SATA connectors. There are six hard drives for a RAID6 array. The hard drives: Device Model: TOSHIBA MG06ACA600EY are inexpensive, I think because they have issues. There have been many firmware updates since I got them. There is a long (usually about a second) lag when you first access them, as if they are asleep. I used hdparm to tune down their power savings, but there is still a lag.
I’m also running three security cameras (one of them is 4K the other two are 2K) which grab all their data through Ethernet and are streamed through SHM to the boot NVMe.
There is a lot of I/O happening on this hardware all the time. I can’t reproduce that in Windows. So that could be a big difference between the testing setups.
The hardware also hangs when running MemTest86+ on the second pass, but I have not found a system in my house that will run memtest86+. The laptops I tried all hang with a blank screen before the test even begins. So I can’t be sure that memtest86+ works and is reporting actual issues.
I told AMD about the issue with memtest86+ and they said it was likely a defective CPU. But because the CPU is out of warranty, I’m SOL. Bummer.
Also, when it crashes, it’s always related to Video. Video is always the thing that dies. Sometimes I can recover it by hitting CTRL-ALT-F1 and a console will come up, and then switching back to X with CTRL-ALT-F7 produces a black screen and cursor, if I repeat the sequence: CTRL-ALT-F1, F7, over and over usually on the third iteration X will restore. Some of the apps have to be minimized and restored to fix their graphics, but everything starts working again. So the thought keeps haunting me: could this be a bad GTX1050gt?
The problems I’m having happen even when DPMS is not active. I can be watching a video using mpv (an mplayer derivative) and video will lock while audio is still working. The same is true of youtube videos. Often CTRL-ALT-F1 to F7 three times will recover those as well.
Then there’s the other problem. If I wake up DPMS at exactly the wrong time (just after it activates) video stays black. CTRL-ALT-F1 works, but switching back with CTRL-ALT-F7 results in a black screen. However, if I unplug the LG TV from the Yamaha Audio Receiver, and plug it back in, video is restored.
I’ve been studying these issues, trying things, for months. And it’s just getting worse. I’m thinking of going back to an old kernel to see if it makes a difference.
Which reminds me, I’m also booting the kernel with ‘idle=nomwait rcu_nocbs=0-11 pci=msi’ because that was how you got the Ryzen 5 1600 to be stable back in the day. I wonder if I should remove those now. I haven’t tried that yet.
It is far more likely to crash when the CPUs are busy. Specifically CPU 0 which does all the I/O. I have seen that. When the system is doing a compile (gentoo is a source based linux distro) everything crawls in X, even though the builds are supposed to be lower in priority, it seems like the combination of I/O and compiling in the background causes X to get really laggy. If I kill the security cameras (zoneminder), then the lag is noticeably less.
I’m looking at a hardware upgrade as the only choice to fix this and I’m still not convinced it’s a hardware issue. I want to wait for Zen 4 or at least until Zen3 has a cheaper six core option.
Honestly, I don’t want to replace things when there may not actually be a hardware issue. It could be software, right?
Edit: I also told the linux kernel to not use CPU0 and CPU1 (core 0) to see if that would help. I can see the I/O on core 0 as blips in htop. But even that didn’t help stability.