Nvidia driver for 2080 ti causes one AMD CPU to lock up (Ubuntu)

After I ssh into my Ubuntu 20.04 machine, I quickly start getting the error message kernel:[ 632.398797] watchdog: BUG: soft lockup - CPU#17 stuck for 22s! [irq/119-nvidia:1148] in my terminal every 10 seconds or so.

Nothing I tried seemed to fix it, so I actually did an entire fresh installation of Ubuntu. As soon as I installed the Nvidia driver, I had the issue again (on the same CPU, #17)

Using top, I can see that irq/119-nvidia is using 100% of a cpu core.

Any ideas what this could be?

The complete syslog entries:

Apr 14 02:05:47 derek-20 kernel: [ 1352.403887] watchdog: BUG: soft lockup - CPU#17 stuck for 22s! [irq/119-nvidia:1148]
Apr 14 02:05:47 derek-20 kernel: [ 1352.403889] Modules linked in: cmac algif_hash algif_skcipher af_alg bnep binfmt_misc nvidia_uvm(OE) nvidia_drm(POE) nls_iso8859_1 nvidia_modeset(POE) snd_hda_codec_realtek snd_hda_codec_generic ledtrig_audio snd_hda_codec_hdmi iwlmvm edac_mce_amd snd_hda_intel btusb snd_intel_dspcfg btrtl nvidia(POE) btbcm kvm_amd snd_hda_codec btintel input_leds ccp snd_hda_core mac80211 kvm snd_hwdep snd_pcm bluetooth libarc4 crct10dif_pclmul ghash_clmulni_intel snd_seq_midi snd_seq_midi_event snd_rawmidi aesni_intel crypto_simd iwlwifi cryptd glue_helper snd_seq ecdh_generic ucsi_ccg typec_ucsi rapl ecc wmi_bmof mxm_wmi joydev snd_seq_device cfg80211 typec drm_kms_helper snd_timer efi_pstore cec rc_core fb_sys_fops snd syscopyarea k10temp sysfillrect sysimgblt soundcore mac_hid sch_fq_codel parport_pc ppdev lp drm parport ip_tables x_tables autofs4 hid_generic usbhid hid crc32_pclmul r8169 ahci libahci i2c_piix4 realtek nvme xhci_pci i2c_nvidia_gpu xhci_pci_renesas nvme_core wmi
Apr 14 02:05:47 derek-20 kernel: [ 1352.403914] CPU: 17 PID: 1148 Comm: irq/119-nvidia Tainted: P           OEL    5.8.0-49-generic #55~20.04.1-Ubuntu
Apr 14 02:05:47 derek-20 kernel: [ 1352.403914] Hardware name: Micro-Star International Co., Ltd. MS-7C35/MEG X570 UNIFY (MS-7C35), BIOS A.40 07/11/2020
Apr 14 02:05:47 derek-20 kernel: [ 1352.404219] RIP: 0010:_nv018508rm+0x24f/0x270 [nvidia]
Apr 14 02:05:47 derek-20 kernel: [ 1352.404220] Code: 00 00 4c 89 f6 48 89 df 49 8b 86 08 05 00 00 e8 e7 7a ae cf be 00 00 81 02 bf 95 df 5e 0e 31 c0 e8 76 f6 c8 ff e8 f1 44 3b 00 <eb> fe 48 8b 04 25 a8 01 00 00 0f 0b be 00 00 77 02 bf 95 df 5e 0e
Apr 14 02:05:47 derek-20 kernel: [ 1352.404221] RSP: 0018:ffff9b8581d47d60 EFLAGS: 00000282
Apr 14 02:05:47 derek-20 kernel: [ 1352.404222] RAX: 0000000000000000 RBX: ffff8f412fd58008 RCX: 0000000000000020
Apr 14 02:05:47 derek-20 kernel: [ 1352.404223] RDX: 0000000000000001 RSI: ffff8f413074dd14 RDI: 0000000000000001
Apr 14 02:05:47 derek-20 kernel: [ 1352.404223] RBP: ffff8f413074dd20 R08: 0000000000000020 R09: ffff8f413074dd08
Apr 14 02:05:47 derek-20 kernel: [ 1352.404224] R10: ffff8f412fd58008 R11: ffff8f412fd59098 R12: ffff8f4137622008
Apr 14 02:05:47 derek-20 kernel: [ 1352.404225] R13: 0000000000000010 R14: ffff8f412fcd6008 R15: 000000000001ffdf
Apr 14 02:05:47 derek-20 kernel: [ 1352.404226] FS:  0000000000000000(0000) GS:ffff8f416ee40000(0000) knlGS:0000000000000000
Apr 14 02:05:47 derek-20 kernel: [ 1352.404227] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Apr 14 02:05:47 derek-20 kernel: [ 1352.404228] CR2: 00005627647baef0 CR3: 0000000fe4ba8000 CR4: 0000000000340ee0
Apr 14 02:05:47 derek-20 kernel: [ 1352.404228] Call Trace:
Apr 14 02:05:47 derek-20 kernel: [ 1352.404529]  ? _nv030154rm+0x14c/0x190 [nvidia]
Apr 14 02:05:47 derek-20 kernel: [ 1352.404838]  ? _nv028760rm+0x9f9/0xdc0 [nvidia]
Apr 14 02:05:47 derek-20 kernel: [ 1352.405141]  ? _nv028768rm+0x15d/0x400 [nvidia]
Apr 14 02:05:47 derek-20 kernel: [ 1352.405327]  ? _nv000710rm+0xa9/0x240 [nvidia]
Apr 14 02:05:47 derek-20 kernel: [ 1352.405329]  ? irq_finalize_oneshot.part.0+0xf0/0xf0
Apr 14 02:05:47 derek-20 kernel: [ 1352.405513]  ? rm_isr_bh+0x1c/0x60 [nvidia]
Apr 14 02:05:47 derek-20 kernel: [ 1352.405673]  ? nvidia_isr_kthread_bh+0x1f/0x40 [nvidia]
Apr 14 02:05:47 derek-20 kernel: [ 1352.405675]  ? irq_thread_fn+0x28/0x60
Apr 14 02:05:47 derek-20 kernel: [ 1352.405677]  ? irq_thread+0xda/0x170
Apr 14 02:05:47 derek-20 kernel: [ 1352.405678]  ? irq_forced_thread_fn+0x80/0x80
Apr 14 02:05:47 derek-20 kernel: [ 1352.405680]  ? kthread+0x114/0x150
Apr 14 02:05:47 derek-20 kernel: [ 1352.405681]  ? irq_thread_check_affinity+0xf0/0xf0
Apr 14 02:05:47 derek-20 kernel: [ 1352.405682]  ? kthread_park+0x90/0x90
Apr 14 02:05:47 derek-20 kernel: [ 1352.405684]  ? ret_from_fork+0x22/0x30

old bug report:
nvidia-bug-report.log.gz (2.0 KB)
new bug report with startx -- -logverbose 6 run first and --safe-mode option:
nvidia-bug-report.log.gz (2.0 KB)

dmesg from Ubuntu 20.04 install with driver 460 installed:
20.04_dmesg_with_driver.txt (127.4 KB)

dmesg from Ubuntu 18.04 before driver installed (using Nouveau):
18.04_dmesg_no_driver.txt (99.4 KB)

Please run nvidia-bug-report.sh as root and attach the resulting nvidia-bug-report.log.gz file to your post.

1 Like

Thank you for taking a look @generix . I have attached the log to the original post now.

I should add that an external display attached to my machine via my 2080 ti’s displayport worked with the fresh installation of Ubuntu, but has not worked since installing the driver.

On the previous Ubuntu install I had the same issue, and removing the driver gave me a fractured repeating pattern on the screen.

Also, the activity I did right before this happened was add a systemd service unit to start Dropbox on startup. Not sure if that could possibly be related as I imagine it would have been wiped out with the fresh install. How to Install Dropbox on a Headless Ubuntu Server

That log unfortunately contains nearly nothing. Please try running the script with --safe-mode option.

1 Like

@generix unfortunately it appears to still be hanging up in the same place. I also found another webpage with instructions to run startx -- -logverbose 6 first, but that command also appears to hang.

I have attached it to the post, but it does look very similar to me.

Let me know if you have any other ideas for troubleshooting. I have a new CPU arriving tomorrow in case that is the issue. Planning to try fresh installs of both Ubuntu and Windows.

Please try getting a dmesg output and attach that.

@generix just added the dmesg to my original post. I also included a dmesg from installing Ubuntu 18.04, before I add the drivers. No luck so far with other operating systems- I even tried flashing the bios, but it didn’t seem to affect anything.

I also noticed the following warnings when I install the driver:

W: Possible missing firmware /lib/firmware/rtl_nic/rtl18125a-3.fw for module r8169
W: Possible missing firmware /lib/firmware/rtl_nic/rtl18168fp-3.fw for module r8169

The driver is just crashing out of nowhere, might even be a defective gpu (which is supported by the pattern you were seeing when switching back to nouveau once).
I’d try disabling the xserver from starting and then use gpu-burn for 10minutes to check.
To rule out a simple thing, did you already try to reseat the card in its pcie slot?

Thank you again @generix. Yes I actually took apart the whole build and reassembled a couple times. Might be able try it on a friend’s motherboard next week in case it’s a problem with the slot itself.

To clarify for an amateur, by disabling the xserver do you mean just running sudo service lightdm stop? Or is it editing grub/systemd as instructed here:

I am about to be away from my setup for the weekend, but will look more into using gpu-burn when I am back.

Should I run it after installing the Nvidia driver, or before, while I still have Nouveau?

I am very afraid that it is indeed a defective GPU. With the recent demand surge, I will sadly have to pay more than double what I originally paid to replace it, which would make it the most expensive thing I own outside of my car.

Not that complicated, just run
sudo systemctl disable display-manager
to disable the xserver from starting on boot. If you have the opportunity to check the gpu in another system, it’s the more reliable way. For a simple test, did you try to start the system with monitor physically disconnected?
Might also just be a system memory failure in a very crucial region. Normally, the nvidia driver would output specific error messages in case of a gpu failure, not jut crash.

Starting without xserver does seem to help me get a little further. I am able to attach a more complete bug report log and dmesg output, which might be helpful:

dmesg_driver_working.txt (88.3 KB)
nvidia-bug-report_working.log.gz (272.4 KB)

Unfortunately I can’t seem to make gpu-burn because of an error with CUDA:

derek@derek-20:~/gpu-burn$ make
g++  -O3 -Wno-unused-result -I/usr/local/cuda/include -c gpu_burn-drv.cpp
gpu_burn-drv.cpp:50:10: fatal error: cuda.h: No such file or directory
   50 | #include <cuda.h>
      |          ^~~~~~~~
compilation terminated.
make: *** [Makefile:32: gpu_burn-drv.o] Error 1

My nvidia-smi output does say CUDA Version: 11.2, but following the official steps to install the cuda toolkit, I hit a snag:

The following packages have unmet dependencies:
 cuda : Depends: cuda-11-3 (>= 11.3.0) but it is not going to be installed
E: Unable to correct problems, you have held broken packages.

Would love any more insight you can provide @generix

Never install full ‘cuda’, this will overwrite the driver. Just install the toolkit, eg
sudo apt install cuda-toolkit-11-2
Unfortunately, the logs also showed an XID 44+62, so this is very likely a broken gpu. You could try limiting clocks
sudo nvidia-smi -lgc 300,1500
then start gdm
sudo systemctl start gdm
and check if this gives you a stable login. This would be only a temporary work-around for a dying gpu, though.