Ubuntu 18.04 kernel 5.7.0-994-generic experiences "soft lockup" with nvidia 510.47.03, Quadro RTX 3000 Mobile

I recently installed nvidia driver 510.47.03, and for the second time when leaving the machine unattended for a few hours, it is unresponsive. The screen is turned on, but moving the mouse and using the keyboard does not prompt any response. The kernel log is quite suggestive that the nvidia kernel module is to blame.

This is the first related kernel message:

Apr 04 19:29:22 ed-Precision-7540 kernel: NVRM: GPU at PCI:0000:01:00: GPU-a0eb5e6a-d015-b1c5-8e42-37af04a93cb6
Apr 04 19:29:22 ed-Precision-7540 kernel: NVRM: Xid (PCI:0000:01:00): 61, pid=8132, 0d20(323c) 00000000 00000000
Apr 04 19:29:52 ed-Precision-7540 kernel: watchdog: BUG: soft lockup - CPU#5 stuck for 23s! [irq/165-nvidia:1217]
Apr 04 19:29:52 ed-Precision-7540 kernel: Modules linked in: xt_comment ip6table_filter ip6_tables ufs qnx4 hfsplus hfs minix ntfs msdos jfs xfs ext2 xt_conntrack xt_MASQ
Apr 04 19:29:52 ed-Precision-7540 kernel:  cdc_ether ac97_bus snd_hda_codec_hdmi efi_pstore usbnet snd_pcm_dmaengine cfg80211 r8152 cmdlinepart snd_hda_intel intel_spi_pc
Apr 04 19:29:52 ed-Precision-7540 kernel:  ip_tables x_tables autofs4 btrfs blake2b_generic algif_skcipher af_alg dm_crypt raid10 raid456 async_raid6_recov async_memcpy a
Apr 04 19:29:52 ed-Precision-7540 kernel: CPU: 5 PID: 1217 Comm: irq/165-nvidia Tainted: P           OE     5.7.0-994-generic #202005112208
Apr 04 19:29:52 ed-Precision-7540 kernel: Hardware name: Dell Inc. Precision 7540/0499T7, BIOS 1.17.0 12/08/2021
Apr 04 19:29:52 ed-Precision-7540 kernel: RIP: 0010:_nv018899rm+0x24f/0x280 [nvidia]
Apr 04 19:29:52 ed-Precision-7540 kernel: Code: 49 8b b5 e0 03 00 00 48 89 df e8 fc d6 30 00 bf 95 df 5e 0e be 00 00 9f 02 31 c0 e8 7b ff c8 ff bf 9f 02 00 00 e8 31 81 3c
Apr 04 19:29:52 ed-Precision-7540 kernel: RSP: 0018:ffff9f0d80ed3d40 EFLAGS: 00000282 ORIG_RAX: ffffffffffffff13
Apr 04 19:29:52 ed-Precision-7540 kernel: RAX: 0000000000000000 RBX: ffff8bd40e3d0008 RCX: 0000000000000020
Apr 04 19:29:52 ed-Precision-7540 kernel: RDX: 0000000000000001 RSI: ffff8bd40e3c2d14 RDI: 0000000000000001
Apr 04 19:29:52 ed-Precision-7540 kernel: RBP: ffff8bd40e3c2d20 R08: 0000000000000020 R09: ffff8bd40e3c2d08
Apr 04 19:29:52 ed-Precision-7540 kernel: R10: ffff8bd40e3d0008 R11: 0000000000000000 R12: ffff8bd40e3d23b0
Apr 04 19:29:52 ed-Precision-7540 kernel: R13: ffff8bd40e0e8008 R14: ffff8bd40e395008 R15: ffff8bd40bb2b008
Apr 04 19:29:52 ed-Precision-7540 kernel: FS:  0000000000000000(0000) GS:ffff8bd41c340000(0000) knlGS:0000000000000000
Apr 04 19:29:52 ed-Precision-7540 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Apr 04 19:29:52 ed-Precision-7540 kernel: CR2: 00007f479bf22758 CR3: 0000001d3900a001 CR4: 00000000003606e0
Apr 04 19:29:52 ed-Precision-7540 kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Apr 04 19:29:52 ed-Precision-7540 kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Apr 04 19:29:52 ed-Precision-7540 kernel: Call Trace:
Apr 04 19:29:52 ed-Precision-7540 kernel:  ? _nv029379rm+0xe7/0x120 [nvidia]
Apr 04 19:29:52 ed-Precision-7540 kernel:  ? _nv026122rm+0x60/0xc0 [nvidia]
Apr 04 19:29:52 ed-Precision-7540 kernel:  ? _nv010182rm+0x173/0x2c0 [nvidia]
Apr 04 19:29:52 ed-Precision-7540 kernel:  ? _nv026136rm+0xc2/0x1f0 [nvidia]
Apr 04 19:29:52 ed-Precision-7540 kernel:  ? _nv000654rm+0x118/0x150 [nvidia]
Apr 04 19:29:52 ed-Precision-7540 kernel:  ? irq_finalize_oneshot.part.0+0xf0/0xf0
Apr 04 19:29:52 ed-Precision-7540 kernel:  ? rm_isr_bh+0x1c/0x60 [nvidia]
Apr 04 19:29:52 ed-Precision-7540 kernel:  ? nvidia_isr_kthread_bh+0x1f/0x40 [nvidia]
Apr 04 19:29:52 ed-Precision-7540 kernel:  ? irq_thread_fn+0x28/0x60
Apr 04 19:29:52 ed-Precision-7540 kernel:  ? irq_thread+0xda/0x170
Apr 04 19:29:52 ed-Precision-7540 kernel:  ? irq_forced_thread_fn+0x80/0x80
Apr 04 19:29:52 ed-Precision-7540 kernel:  ? kthread+0x104/0x140
Apr 04 19:29:52 ed-Precision-7540 kernel:  ? irq_thread_check_affinity+0xf0/0xf0
Apr 04 19:29:52 ed-Precision-7540 kernel:  ? kthread_park+0x90/0x90
Apr 04 19:29:52 ed-Precision-7540 kernel:  ? ret_from_fork+0x1f/0x40

Here is the kernel log from the relevant time period:
kernel.log (1.6 MB)

And the nvidia bug report (not collected during the problem as I can’t access the machine):
nvidia-bug-report.log.gz (430.0 KB)

Hi edmcman,
Can you please downgrade nvidia driver and confirm if you still sees the same issue and problem is with only nvidia driver 510.47.03.
I tried duplicating issue locally on one of my test system but no luck.

Hi @amrits,

I already did that because I didn’t want my machine to lock up each night :-) I downgraded to 455.45.01 because I think that is what I was using previously. Is that too old, or would you like me to try a newer version? If so, what do you recommend? I don’t really understand the driver numbering convention.

My machine did not lock up last night on 455.45.01. I should say that I’ve only experienced the lock up on 2 nights. It was the first two nights I was using the 510.47.03 driver, but not a ton of data points.

Thanks for providing data point, we are trying to analyze issue with the provided logs as I do not have local repro.
There is Xid 61 in the dmesg log at the time of crash but the full bug report is taken after reboot. So I cannot see anything wrong in the bug report. Can you please take bug report again as soon as Xid 61 is seen (without reboot and may be over ssh if the system becomes inaccessible?)

Apr 04 19:29:22 ed-Precision-7540 kernel: NVRM: GPU at PCI:0000:01:00: GPU-a0eb5e6a-d015-b1c5-8e42-37af04a93cb6
Apr 04 19:29:22 ed-Precision-7540 kernel: NVRM: Xid (PCI:0000:01:00): 61, pid=8132, 0d20(323c) 00000000 00000000

Sure, I’ll work on this.

Ok, I am back on 510, and should be set up to automatically run the nvidia debug script tomorrow morning. So if the problem occurs tonight and the machine is still functioning enough to do stuff, I should have a debug log. 🤞

“Unfortunately” there was not a hang last night. I’ll update back once there is.

Bad news. There was a hang last night, but I was unable to collect a log. I scheduled the log to be collected using at at 7:00am. But the log was collected shortly after I restarted the machine, which suggests that the machine was not functional enough to run at. In retrospect, this is not surprising, as the kernel log is also completely empty for several hours after the hang, and this never happens during normal operation.

What can I do to help debug? I don’t have much experience debugging kernel crashes. I did not try to use magic sysrq, but in retrospect I probably should have.

You suspect that system do not respond to ssh once it hangs.
If yes, can you please try once more time and update.

I am still working on getting this using magic sysrq.

I haven’t had a lockup since my last message. I’m hoping that with magic sysrq I’ll be able to collect the debug logs.

One more update. I had another lock up. I was unable to do anything with magic sysrq. It did seem to be responding, but after I did alt + magic sysrq + v to restore the framebuffer console, the machine stopped responding to further magic sysrq commands.

There is a bit of good news though. I was trying to use kdump but wasn’t getting any crashes, and I think it is because of Bug #1918085 “Package linux-crashdump in 20.04 configures kernel...” : Bugs : linux-meta package : Ubuntu which is easy to work around. So hopefully next time I get a lock up I’ll get a kdump file for you.

For some reason I went a long time without any crashes, but finally had one. I was able to collect a kdump, but I don’t have the debug symbols for that kernel so it’s mostly useless. I’m rebuilding the kernel so I have the debug symbols.

I was able to collect a kdump finally. Sadly, I did not have kernel.softlockup_panic enabled, so I forced the kdump using magic sysrq.

Is there any information I can collect from here that would be helpful?

crash> bt 339
PID: 339    TASK: ffff96557beb97c0  CPU: 5   COMMAND: "irq/164-nvidia"
 #0 [fffffe0000114e40] crash_nmi_callback at ffffffffb5c67777
 #1 [fffffe0000114e50] nmi_handle at ffffffffb5c38754
 #2 [fffffe0000114ea8] default_do_nmi at ffffffffb5c389ca
 #3 [fffffe0000114ed0] do_nmi at ffffffffb5c38bfe
 #4 [fffffe0000114ef0] end_repeat_nmi at ffffffffb68016e0
    [exception RIP: _nv018899rm+591]
    RIP: ffffffffc0f6d55f  RSP: ffffb6eac0813d40  RFLAGS: 00000282
    RAX: 0000000000000000  RBX: ffff9655793c4008  RCX: 0000000000000020
    RDX: 0000000000000001  RSI: ffff96557d285d14  RDI: 0000000000000001
    RBP: ffff96557d285d20   R8: 0000000000000020   R9: ffff96557d285d08
    R10: ffff9655793c4008  R11: ffff96561c7d4000  R12: ffff9655793c63b0
    R13: ffff96557c150008  R14: ffff96557ef6b008  R15: ffff96557b8eb808
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
--- <NMI exception stack> ---
 #5 [ffffb6eac0813d40] _nv018899rm at ffffffffc0f6d55f [nvidia]
 #6 [ffffb6eac0813d68] _nv029379rm at ffffffffc0f6ad87 [nvidia]
 #7 [ffffb6eac0813d98] _nv026122rm at ffffffffc112a580 [nvidia]
 #8 [ffffb6eac0813db8] _nv010182rm at ffffffffc112c0a3 [nvidia]
 #9 [ffffb6eac0813de8] _nv026136rm at ffffffffc112c472 [nvidia]
#10 [ffffb6eac0813e18] _nv000654rm at ffffffffc152f6e8 [nvidia]
#11 [ffffb6eac0813e48] rm_isr_bh at ffffffffc152ffec [nvidia]
#12 [ffffb6eac0813e68] nvidia_isr_kthread_bh at ffffffffc0bd388f [nvidia]
#13 [ffffb6eac0813e78] irq_thread_fn at ffffffffb5d1f298
#14 [ffffb6eac0813ea0] irq_thread at ffffffffb5d1fb9a
#15 [ffffb6eac0813f08] kthread at ffffffffb5ccce44
#16 [ffffb6eac0813f50] ret_from_fork at ffffffffb68001ff

More good news. I was able to get a kdump during one of the soft lockups. In fact, it only took a few hours. I think that perhaps soft lockups may have been occurring before the machine becomes unusable. This means that perhaps I can actually run the nvidia debugging tool. Fingers crossed for that.

Can you please share complete log for triage purpose.

The kdump is a memory dump rather than a log file. I can’t share it directly because this is my work machine and it might have sensitive information in it. If you are familiar with kdumps, I can extract any information you are interested in via the crash tool.

I’m still optimistic that I’ll be able to run the nvidia log collector, but no luck on that yet.

More good news/bad news. I experienced a soft lockup in which I still had some control over the machine, and was able to run the nvidia debug script.

The bad news is that the script hung and barely contains any useful information.

nvidia-bug-report.log.gz (2.0 KB)

Here is the process tree:

You can see that it gets stuck while executing cat on the power file.

Also relevant is that the irq/i64-nvidia process is using 100% CPU at this time.

I’m not really sure what else I can do to help at this point.

Indeed the attached information doesn’t have useful information and unfortunately I am not able to reproduce issue so far.
Request you to try with latest released driver and share bug report if issue persists.

Please try with latest released driver and share bug report if issue persists.