Irq/139-nvidia crashing Ubuntu 20.04 Under High Load

I am having a problem I have elaborated on different forms without resolution. In summary, under a high CPU load, a process I see on “top” irq/139-nvidia that goes up to 100% CPU load (on a 12 core system). As a consequence, the system becomes totally unresponsive. I cannot even get to console mode (alt-cntrl-f7). I am not sure how to properly reproduce it - although it occurs on a daily basis. It occurs most reproducibly when I run scientific software over many cores (Phenix for crystallography for example) and switch between different apps (Brave/Zoom/Skype).

For more details I have written about this problem here:

System: Host: medsciradhoping Kernel: 5.4.0-37-generic x86_64 bits: 64 Desktop: Gnome 3.36.2

Distro: Ubuntu 20.04 LTS (Focal Fossa)

Machine: Type: Desktop System: Gigabyte product: X570 AORUS PRO WIFI v: -CF

Mobo: Gigabyte model: X570 AORUS PRO WIFI

UEFI [Legacy]: American Megatrends v: F11 date: 12/06/2019

CPU: Topology: 12-Core model: AMD Ryzen 9 3900X bits: 64 type: MT MCP L2 cache: 6144 KiB

Speed: 2201 MHz min/max: 2200/3800 MHz Core speeds (MHz): 1: 2192 2: 2190 3: 2196

4: 2194 5: 2190 6: 2187 7: 2195 8: 2196 9: 2195 10: 2185 11: 2192 12: 2189 13: 2194

14: 2195 15: 2196 16: 2194 17: 2186 18: 2196 19: 2195 20: 2196 21: 2197 22: 2197

23: 2197 24: 2191

Graphics: Device-1: NVIDIA TU106 [GeForce RTX 2060 SUPER] driver: nvidia v: 440.64

Display: x11 server: X.Org 1.20.8 driver: nvidia

unloaded: fbdev,modesetting,nouveau,vesa resolution: 1920x1080~60Hz

OpenGL: renderer: GeForce RTX 2060 SUPER/PCIe/SSE2 v: 4.6.0 NVIDIA 440.64

Audio: Device-1: NVIDIA TU106 High Definition Audio driver: snd_hda_intel

Device-2: AMD Starship/Matisse HD Audio driver: snd_hda_intel

Sound Server: ALSA v: k5.4.0-37-generic

Network: Device-1: Intel Wi-Fi 6 AX200 driver: iwlwifi

IF: wlp4s0 state

Device-2: Intel I211 Gigabit Network driver: igb

IF: enp5s0 state: up speed: 1000 Mbps duplex

Drives: Local Storage: total: 931.51 GiB used: 191.63 GiB (20.6%)

ID-1: /dev/nvme0n1 vendor: Sabrent model: Rocket 4.0 1TB size: 931.51 GiB

Partition: ID-1: / size: 915.40 GiB used: 191.63 GiB (20.9%) fs: ext4 dev: /dev/nvme0n1p5

Sensors: System Temperatures: cpu: 53.2 C mobo: N/A gpu: nvidia temp: 31 C

Fan Speeds (RPM): N/A gpu: nvidia fan: 41%

Info: Processes: 548 Uptime: 23m Memory: 31.37 GiB used: 4.29 GiB (13.7%) Shell: bash

inxi: 3.0.38
$ apt search nvidia-driver | fgrep 'installed'



WARNING: apt does not have a stable CLI interface. Use with caution in scripts.



nvidia-driver-440/focal,now 440.82+really.440.64-0ubuntu6 amd64 [installed]

xserver-xorg-video-nvidia-440/focal,now 440.82+really.440.64-0ubuntu6 amd64 [installed,automatic]

You might be running into this:
https://forums.developer.nvidia.com/t/random-xid-61-and-xorg-lock-up/79731/240
Ryzen 3rd gen + Nvidia Turing gpu

Perhaps. I am not sure. I think this could be a distinct problem. I will review that form. It appears my specific complaint is replicated elsewhere.

Please run nvidia-bug-report.sh as root and attach the resulting nvidia-bug-report.log.gz file to your post. You will have to rename the file ending to something else since the forum software doesn’t accept .gz files (nifty!).

Attached. I restarted it today at 11:46am after the irq/139-nvidia process “crashed” it.

nvidia-bug-report.log.gz.txt (953.9 KB)

Jun 14 11:00:47 axoneme kernel: [ 4115.637580] NVRM: Xid (PCI:0000:09:00): 61, pid=1179, 0cec(3098) 00000000 00000000

Sorry, same issue, XID 61
The irq thread going nuts is just a sideeffect of the gpu not responding correctly anymore.

Damn. That thread looks like a mess. Ill join the line. Thanks

Some better info from that thread:
https://forums.developer.nvidia.com/t/random-xid-61-and-xorg-lock-up/79731/185