3090 Hangs under load

My MSI RTX 3090 SUPRIM hangs under load. In most cases, the process can’t be killed. The machine needs a physical shutdown. I often see this process when running top: irq/119-nvidia. This post seems to refer to a similar issue: Nvidia driver for 2080 ti causes one AMD CPU to lock up (Ubuntu)

OS: PopOS 22.04
Kernel: Linux 6.2.6-76060206-generic #202303130630~1685473338~22.04~995127e SMP PREEMPT_DYNAMIC Tue M
Driver Version: 535.113.01

Here is an extract from dmesg:
[ 763.804989] NVRM: GPU at PCI:0000:08:00: GPU-803cfee9-e1dc-44df-af41-6cd5dc491cec
[ 763.804993] NVRM: Xid (PCI:0000:08:00): 62, pid=‘’, name=, 00000000 00000000 00000000 00000000 00220030 00300000 00000000 00000000
[ 763.805664] NVRM: Xid (PCI:0000:08:00): 45, pid=6006, name=python3, Ch 00000008
[ 763.822723] NVRM: Xid (PCI:0000:08:00): 62, pid=6006, name=python3, 00000000 00000000 00000000 00000000 00220030 00300000 00000000 00000000
[ 763.823385] NVRM: Xid (PCI:0000:08:00): 45, pid=6006, name=python3, Ch 00000008
[ 763.824437] NVRM: Xid (PCI:0000:08:00): 45, pid=6006, name=python3, Ch 00000009
[ 763.825483] NVRM: Xid (PCI:0000:08:00): 45, pid=6006, name=python3, Ch 0000000a
[ 763.826536] NVRM: Xid (PCI:0000:08:00): 45, pid=6006, name=python3, Ch 0000000b
[ 763.827583] NVRM: Xid (PCI:0000:08:00): 45, pid=6006, name=python3, Ch 0000000c
[ 763.828628] NVRM: Xid (PCI:0000:08:00): 45, pid=6006, name=python3, Ch 0000000d
[ 763.829673] NVRM: Xid (PCI:0000:08:00): 45, pid=6006, name=python3, Ch 0000000e
[ 763.830724] NVRM: Xid (PCI:0000:08:00): 45, pid=6006, name=python3, Ch 0000000f
[ 936.631529] NVRM: Xid (PCI:0000:08:00): 45, pid=6006, name=python3, Ch 00000009
[ 936.632607] NVRM: Xid (PCI:0000:08:00): 45, pid=6006, name=python3, Ch 0000000a
[ 936.633662] NVRM: Xid (PCI:0000:08:00): 45, pid=6006, name=python3, Ch 0000000b
[ 936.634722] NVRM: Xid (PCI:0000:08:00): 45, pid=6006, name=python3, Ch 0000000c
[ 936.635775] NVRM: Xid (PCI:0000:08:00): 45, pid=6006, name=python3, Ch 0000000d
[ 936.636827] NVRM: Xid (PCI:0000:08:00): 45, pid=6006, name=python3, Ch 0000000e
[ 936.637883] NVRM: Xid (PCI:0000:08:00): 45, pid=6006, name=python3, Ch 0000000f
[ 967.137809] INFO: task python3:6499 blocked for more than 120 seconds.
[ 967.166224] Tainted: P OE 6.5.6-76060506-generic #202310061235~1697396945~22.04~9283e32
[ 967.166652] “echo 0 > /proc/sys/kernel/hung_task_timeout_secs” disables this message.
[ 967.167074] task:python3 state:D stack:0 pid:6499 ppid:5990 flags:0x00000002
[ 967.167078] Call Trace:
[ 967.167080]
[ 967.167083] __schedule+0x2cc/0x750
[ 967.167090] schedule+0x63/0x110
[ 967.167093] schedule_timeout+0x157/0x170
[ 967.167097] ___down_common+0xfd/0x160
[ 967.167104] __down_common+0x22/0xd0
[ 967.167107] __down+0x1d/0x30
[ 967.167109] down+0x54/0x80
[ 967.167112] nvidia_ioctl+0x1d8/0x8c0 [nvidia]
[ 967.167292] nvidia_frontend_unlocked_ioctl+0x5b/0xa0 [nvidia]
[ 967.167457] __x64_sys_ioctl+0xa3/0xf0
[ 967.167461] do_syscall_64+0x5b/0x90
[ 967.167464] ? srso_alias_return_thunk+0x5/0x7f
[ 967.167467] ? __rseq_handle_notify_resume+0x37/0x70
[ 967.167471] ? srso_alias_return_thunk+0x5/0x7f
[ 967.167473] ? exit_to_user_mode_loop+0xe5/0x130
[ 967.167477] ? srso_alias_return_thunk+0x5/0x7f
[ 967.167479] ? exit_to_user_mode_prepare+0x9b/0xb0
[ 967.167482] ? srso_alias_return_thunk+0x5/0x7f
[ 967.167484] ? syscall_exit_to_user_mode+0x37/0x60
[ 967.167487] ? srso_alias_return_thunk+0x5/0x7f
[ 967.167489] ? do_syscall_64+0x67/0x90
[ 967.167491] ? srso_alias_return_thunk+0x5/0x7f
[ 967.167493] ? do_syscall_64+0x67/0x90
[ 967.167495] entry_SYSCALL_64_after_hwframe+0x6e/0xd8
[ 967.167498] RIP: 0033:0x7ffaf711ab3f
[ 967.167504] RSP: 002b:00007ff9b9edba30 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[ 967.167507] RAX: ffffffffffffffda RBX: 00007ff9b9edbb40 RCX: 00007ffaf711ab3f
[ 967.167508] RDX: 00007ff9b9edbb40 RSI: 00000000c020462a RDI: 0000000000000008
[ 967.167510] RBP: 00007ff9b9edbae0 R08: 00007ff9b9edbb40 R09: 00007ff9b9edbb5c
[ 967.167511] R10: 000000005c000003 R11: 0000000000000246 R12: 00000000c020462a
[ 967.167512] R13: 0000000000000008 R14: 00007ff9b9edbb5c R15: 00007ff9b9edbaa0
[ 967.167516]

I noticed these error messages which are concerning:
Xid (PCI:0000:08:00): 45
Xid (PCI:0000:08:00): 62

I have tried a second 3090 in same machine without any problems. I have a sinking feeling that my card is glorked.

Any insight would be appreciated.

Uploading: nvidia-bug-report.log.gz…