Hi all,
Just chiming in with my experiences.
My hardware: GeForce RTX 2060 in an Asus Prime X570-P with a Ryzen 3900X. I have two monitors.
My software: Ubuntu 20.04 with kernel 5.8.9-050809-generic and Nvidia driver 450.66 and X.Org X Server 1.20.8.
I’m experience three types of hangs:
Hang type A) The X display freezes but the mouse point is still alive. Keyboard not working.
Hang type B) The X display is frozen with mouse and keyboard unresponsive.
Hang type C) Complete system hang. Display is frozen but machine has crashed.
Hang type A sometimes turns into hang type B. With A and B I can ssh into the machine to see what’s going on but I’m never able to restart X. The machine never cleanly shutdown after a shutdown command is issued.
Hang type C is much rarer and I can’t even ping the machine.
I’ve noticed hang types A and B have always (as far as I can tell) occured when a video is playing in Firefox or Chrome. Usually when I am using the activity switcher which shows an overview of the windows or after I return to the machine having paused a Youtube video. These type of hangs happen one a day or more frequent. When I increase the minimum GPU clock rates (using sudo nvidia-smi -pm ENABLED; sudo nvidia-smi -lgc 1000,2000
) the crashes are less frequent, perhaps ever 3 or 4 days.
I have also reduced the VCCDR SOC Voltage from 1.1V to 1.04375V as suggested by NickB while also not reducing the GPU clock rates with nvidia-smi -lgc
. This initially seemed to improve matters for a few days of good uptime, I then got two type A hangs in quick succession this evening.
The type C hang only seems to happen when the machine is unattended, usually at night, and this has only happened a few times. Because I can’t log into the machine I’m not able to determine the cause of the hang, so I’m focussing on types A and B only for now.
During a type A or B hang, here is the output from nvidia-smi:
Mon Sep 28 21:57:47 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.66 Driver Version: 450.66 CUDA Version: 11.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 GeForce RTX 2060 Off | 00000000:05:00.0 On | N/A |
| 25% 39C P8 14W / 160W | 638MiB / 5926MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 1845 G /usr/lib/xorg/Xorg 78MiB |
| 0 N/A N/A 2453 G /usr/lib/xorg/Xorg 315MiB |
| 0 N/A N/A 402270 G .../zoom-client/99/zoom/zoom 60MiB |
| 0 N/A N/A 407831 G ...token=7240939308061852906 47MiB |
+-----------------------------------------------------------------------------+
Top shows the irq/128-nvidia process has gotten stuck:
top - 21:58:19 up 2 days, 2:15, 3 users, load average: 4.89, 5.16, 4.17
Tasks: 549 total, 5 running, 544 sleeping, 0 stopped, 0 zombie
%Cpu(s): 4.2 us, 4.3 sy, 0.0 ni, 91.5 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
MiB Mem : 48163.2 total, 26576.2 free, 7086.4 used, 14500.6 buff/cache
MiB Swap: 0.0 total, 0.0 free, 0.0 used. 40438.9 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
2453 root 20 0 426440 257020 135636 R 100.0 0.5 290:59.97 Xorg
1925 root -51 0 0 0 0 R 99.4 0.0 151:02.75 irq/128-nvidia
794997 juan 20 0 6331968 2.5g 2.4g S 2.9 5.4 25:50.58 VirtualBoxVM
2365 juan 9 -11 3318324 26764 21624 S 0.6 0.1 28:41.09 pulseaudio
2567 juan 20 0 6230064 512480 128060 S 0.6 1.0 116:01.79 gnome-shell
794942 juan 20 0 34392 9760 7808 S 0.6 0.0 0:55.20 VBoxXPCOMIPCD
794948 juan 20 0 918636 25308 16544 S 0.6 0.1 1:51.22 VBoxSVC
1 root 20 0 317304 13908 8436 S 0.0 0.0 0:04.25 systemd
2 root 20 0 0 0 0 S 0.0 0.0 0:00.13 kthreadd
3 root 0 -20 0 0 0 I 0.0 0.0 0:00.00 rcu_gp
4 root 0 -20 0 0 0 I 0.0 0.0 0:00.00 rcu_par_gp
6 root 0 -20 0 0 0 I 0.0 0.0 0:00.00 kworker/0:0H-kblockd
9 root 0 -20 0 0 0 I 0.0 0.0 0:00.00 mm_percpu_wq
10 root 20 0 0 0 0 S 0.0 0.0 0:06.49 ksoftirqd/0
11 root 20 0 0 0 0 I 0.0 0.0 0:38.52 rcu_sched
12 root rt 0 0 0 0 S 0.0 0.0 0:00.21 migration/0
13 root -51 0 0 0 0 S 0.0 0.0 0:00.00 idle_inject/0
14 root 20 0 0 0 0 S 0.0 0.0 0:00.00 cpuhp/0
15 root 20 0 0 0 0 S 0.0 0.0 0:00.00 cpuhp/1
16 root -51 0 0 0 0 S 0.0 0.0 0:00.00 idle_inject/1
17 root rt 0 0 0 0 S 0.0 0.0 0:00.54 migration/1
18 root 20 0 0 0 0 S 0.0 0.0 0:02.17 ksoftirqd/1
20 root 0 -20 0 0 0 I 0.0 0.0 0:00.00 kworker/1:0H-kblockd
21 root 20 0 0 0 0 S 0.0 0.0 0:00.00 cpuhp/2
Even if I kill the Xorg process, the irq/128-nvidia process will still be pegged to 100%.
Syslog shows this:
Sep 28 21:26:42 yaffle kernel: [179011.023461] NVRM: GPU at PCI:0000:05:00: GPU-392ec1a3-7517-dd87-46d9-90bee037fd48
Sep 28 21:26:42 yaffle kernel: [179011.023470] NVRM: GPU Board Serial Number:
Sep 28 21:26:42 yaffle kernel: [179011.023474] NVRM: Xid (PCI:0000:05:00): 8, pid=2567, Channel 00000020
Sep 28 21:26:52 yaffle kernel: [179021.255534] GpuWatchdog[19305]: segfault at 0 ip 000055cfefe69de7 sp 00007fa683c206d0 error 6 in signal-desktop[55cfecc8b000+53d3000]
Sep 28 21:26:52 yaffle kernel: [179021.255542] Code: 7d b7 00 79 09 48 8b 7d a0 e8 75 53 d3 fe 8b 83 00 01 00 00 85 c0 0f 84 91 00 00 00 48 8b 03 48 89 df be 01 00 00 00 ff
50 68 <c7> 04 25 00 00 00 00 37 13 00 00 c6 05 b7 b0 6f 02 01 80 7d 87 00
Hopefully this can be of use to someone,
Juan.