Ubuntu 22.04 - GPU Falls off Bus - Unable to determine the device handle for GPU0000:01:00.0: Unknown Error

Hi, all,
I have a new 3090. It can run a heavy workload overnight just fine, but every so often it seems to deadlock on an operation.

I had nvidia-smi dmon running while it died:

# gpu   pwr gtemp mtemp    sm   mem   enc   dec  mclk  pclk
# Idx     W     C     C     %     %     %     %   MHz   MHz
    0    292     81      -    68     33      0      0   9501   1020
    0    279     81      -    97     50      0      0   9501   1095
    0    296     81      -    91     44      0      0   9501    990
    0    288     81      -    55     27      0      0   9501   1515
    0    297     82      -    65     34      0      0   9501    780
    0    288     82      -    56     32      0      0   9501    787
    0    284     80      -    57     38      0      0   9501   1785
    0    300     81      -    55     40      0      0   9501   1785
    0    299     79      -    82     57      0      0   9501   1155
    0    287     80      -    88     59      0      0   9501   1440
    0    278     81      -    99     65      0      0   9501   1560
    0    279     79      -   100     58      0      0   9501   1365
    0    277     76      -    56     36      0      0   9501   1920
    0    289     76      -    63     47      0      0   9501   1920
    0    288     77      -    88     60      0      0   9501   1905
    0    282     78      -   100     51      0      0   9501   1725
    0    283     77      -   100     49      0      0   9501   1905
    0      -     -     -

Note that last line is incomplete; I did not truncate it. nvidia-smi truncated it.

Here’s some choice output from dmesg with duplicates omitted:

[    5.605621] nvidia 0000:01:00.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=none:owns=io+mem
[    5.652453] NVRM: loading NVIDIA UNIX x86_64 Kernel Module  525.105.17  Tue Mar 28 18:02:59 UTC 2023
[    5.663885] nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms  525.105.17  Tue Mar 28 22:18:37 UTC 2023
[    5.681370] [drm] [nvidia-drm] [GPU ID 0x00000100] Loading driver
[    5.681372] [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:01:00.0 on minor 0
[    5.912934] nvidia_uvm: module uses symbols from proprietary module nvidia, inheriting taint.
[    5.916542] nvidia-uvm: Loaded the UVM driver, major device number 507.
[    9.173026] e1000e 0000:00:1f.6 enp0s31f6: NIC Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx
[    9.173101] IPv6: ADDRCONF(NETDEV_CHANGE): enp0s31f6: link becomes ready
[   17.852287] loop5: detected capacity change from 0 to 8
[350775.385809] FS-Cache: Loaded
[350775.403050] FS-Cache: Netfs 'cifs' registered for caching
[350775.405237] Key type cifs.spnego registered
[350775.405245] Key type cifs.idmap registered
[350775.405500] Malformed UNC in devname
...
[350775.405519] CIFS: VFS: Malformed UNC in devname
[350920.653476] CIFS: No dialect specified on mount. Default has changed to a more secure dialect, SMB2.1 or later (e.g. SMB3.1.1), from CIFS (SMB1). To use the less secure SMB1 dialect to access old servers which do not support SMB3.1.1 (or even SMB3 or SMB2.1) specify vers=1.0 on mount.
[350920.653485] CIFS: Attempting to mount \\synology\MLData
[350920.695369] CIFS: Status code returned 0xc000006d STATUS_LOGON_FAILURE
[350920.695391] CIFS: VFS: \\synology Send error in SessSetup = -13
[351076.195760] CIFS: Attempting to mount \\synology\MLData
[352377.832353] CIFS: Attempting to mount \\synology\MLData
...
[362713.785429] perf: interrupt took too long (2512 > 2500), lowering kernel.perf_event_max_sample_rate to 79500
[398329.008318] Lockdown: mdadm: /dev/mem,kmem,port is restricted; see man kernel_lockdown.7
[398384.755321] Lockdown: mdadm: /dev/mem,kmem,port is restricted; see man kernel_lockdown.7
[440008.299189] NVRM: GPU at PCI:0000:01:00: GPU-508f8624-3013-b396-84aa-c207917faf36
[440008.299192] NVRM: Xid (PCI:0000:01:00): 79, pid='<unknown>', name=<unknown>, GPU has fallen off the bus.
[440008.299194] NVRM: GPU 0000:01:00.0: GPU has fallen off the bus.
[440008.300548] NVRM: A GPU crash dump has been created. If possible, please run
                NVRM: nvidia-bug-report.sh as root to collect this data before
                NVRM: the NVIDIA kernel module is unloaded.
[440553.072866] sysrq: Show backtrace of all active CPUs
[440553.072884] NMI backtrace for cpu 1
[440553.072885] CPU: 1 PID: 107535 Comm: nvidia-bug-repo Tainted: P           O      5.15.0-70-generic #77-Ubuntu
[440553.072887] Hardware name: System manufacturer System Product Name/Z170-A, BIOS 1902 06/24/2016
[440553.072888] Call Trace:
[440553.072889]  <TASK>
[440553.072890]  show_stack+0x52/0x5c
[440553.072894]  dump_stack_lvl+0x4a/0x63
[440553.072896]  dump_stack+0x10/0x16
[440553.072897]  nmi_cpu_backtrace.cold+0x4d/0x93
[440553.072899]  ? lapic_can_unplug_cpu+0x90/0x90
[440553.072902]  nmi_trigger_cpumask_backtrace+0xec/0x100
[440553.072905]  arch_trigger_cpumask_backtrace+0x19/0x20
[440553.072908]  sysrq_handle_showallcpus+0x17/0x20
[440553.072910]  __handle_sysrq.cold+0xc9/0x1a6
[440553.072912]  ? apparmor_file_permission+0x70/0x160
[440553.072914]  write_sysrq_trigger+0x28/0x40
[440553.072916]  proc_reg_write+0x5b/0xa0
[440553.072918]  ? __cond_resched+0x1a/0x50
[440553.072921]  vfs_write+0xc4/0x270
[440553.072923]  ksys_write+0x67/0xf0
[440553.072924]  __x64_sys_write+0x19/0x20
[440553.072926]  do_syscall_64+0x59/0xc0
[440553.072928]  ? syscall_exit_to_user_mode+0x27/0x50
[440553.072930]  ? __x64_sys_close+0x11/0x50
[440553.072932]  ? do_syscall_64+0x69/0xc0
[440553.072934]  ? __x64_sys_close+0x11/0x50
[440553.072935]  ? do_syscall_64+0x69/0xc0
[440553.072937]  ? irqentry_exit_to_user_mode+0x9/0x20
[440553.072939]  ? irqentry_exit+0x1d/0x30
[440553.072940]  ? exc_page_fault+0x89/0x170
[440553.072942]  entry_SYSCALL_64_after_hwframe+0x61/0xcb
[440553.072943] RIP: 0033:0x7f5bb4424a37
[440553.072946] Code: 10 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 48 89 54 24 18 48 89 74 24
[440553.072947] RSP: 002b:00007ffcfe47d618 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
[440553.072949] RAX: ffffffffffffffda RBX: 0000000000000002 RCX: 00007f5bb4424a37
[440553.072950] RDX: 0000000000000002 RSI: 000055e971460560 RDI: 0000000000000001
[440553.072951] RBP: 000055e971460560 R08: 000055e971456f02 R09: 0000000000000000
[440553.072952] R10: 000055e971456f01 R11: 0000000000000246 R12: 0000000000000001
[440553.072953] R13: 0000000000000002 R14: 0000000000000000 R15: 0000000000000000
[440553.072955]  </TASK>
[440553.072956] Sending NMI from CPU 1 to CPUs 0,2-7:
[440553.072960] NMI backtrace for cpu 5
[440553.072961] CPU: 5 PID: 0 Comm: swapper/5 Tainted: P           O      5.15.0-70-generic #77-Ubuntu
[440553.072963] Hardware name: System manufacturer System Product Name/Z170-A, BIOS 1902 06/24/2016
[440553.072964] RIP: 0010:mwait_idle_with_hints.constprop.0+0x4f/0xa0
[440553.072966] Code: 48 89 d1 65 48 8b 04 25 c0 fb 01 00 0f 01 c8 48 8b 00 a8 08 75 14 66 90 0f 00 2d 78 b7 b7 00 b9 01 00 00 00 48 89 f8 0f 01 c9 <65> 48 8b 04 25 c0 fb 01 00 f0 80 60 02 df f0 83 44 24 fc 00 48 8b
[440553.072967] RSP: 0018:ffffa924400fbdf0 EFLAGS: 00000046
[440553.072968] RAX: 0000000000000040 RBX: 0000000000000001 RCX: 0000000000000001
[440553.072969] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000040
[440553.072970] RBP: ffffa924400fbe18 R08: 000190ae40932dc6 R09: 0000000000000000
[440553.072971] R10: 0000000000000001 R11: 071c71c71c71c71c R12: 0000000000000006
[440553.072971] R13: ffffffffaf4d49c0 R14: 0000000000000006 R15: ffffffffaf4d4c48
[440553.072972] FS:  0000000000000000(0000) GS:ffff9ae2b6540000(0000) knlGS:0000000000000000
[440553.072974] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[440553.072975] CR2: 0000562e6e0374d8 CR3: 0000000324610005 CR4: 00000000003706e0
[440553.072976] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[440553.072976] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[440553.072977] Call Trace:
[440553.072977]  <TASK>
[440553.072978]  ? intel_idle_ibrs+0x4d/0xd0
[440553.072980]  cpuidle_enter_state+0x97/0x620
[440553.072982]  ? tick_nohz_stop_tick+0x16a/0x1d0
[440553.072984]  cpuidle_enter+0x2e/0x50
[440553.072985]  cpuidle_idle_call+0x142/0x1e0
[440553.072987]  do_idle+0x83/0xf0
[440553.072988]  cpu_startup_entry+0x20/0x30
[440553.072990]  start_secondary+0x12a/0x180
[440553.072992]  secondary_startup_64_no_verify+0xc2/0xcb
[440553.072995]  </TASK>
[440553.072996] NMI backtrace for cpu 3
[440553.072997] CPU: 3 PID: 0 Comm: swapper/3 Tainted: P           O      5.15.0-70-generic #77-Ubuntu
[440553.072999] Hardware name: System manufacturer System Product Name/Z170-A, BIOS 1902 06/24/2016
[440553.073000] RIP: 0010:mwait_idle_with_hints.constprop.0+0x4f/0xa0
[440553.073002] Code: 48 89 d1 65 48 8b 04 25 c0 fb 01 00 0f 01 c8 48 8b 00 a8 08 75 14 66 90 0f 00 2d 78 b7 b7 00 b9 01 00 00 00 48 89 f8 0f 01 c9 <65> 48 8b 04 25 c0 fb 01 00 f0 80 60 02 df f0 83 44 24 fc 00 48 8b
[440553.073004] RSP: 0018:ffffa924400ebdf0 EFLAGS: 00000046
[440553.073005] RAX: 0000000000000020 RBX: 0000000000000001 RCX: 0000000000000001
[440553.073006] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000020
[440553.073007] RBP: ffffa924400ebe18 R08: 000190ae416871a1 R09: 00000000000c3500
[440553.073008] R10: 0000000000000007 R11: 071c71c71c71c71c R12: 0000000000000004
[440553.073008] R13: ffffffffaf4d49c0 R14: 0000000000000004 R15: ffffffffaf4d4b78
[440553.073009] FS:  0000000000000000(0000) GS:ffff9ae2b64c0000(0000) knlGS:0000000000000000
[440553.073010] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[440553.073011] CR2: 00007f31bf953a70 CR3: 0000000324610003 CR4: 00000000003706e0
[440553.073012] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[440553.073013] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[440553.073014] Call Trace:
[440553.073014]  <TASK>
[440553.073015]  ? intel_idle_ibrs+0x4d/0xd0
[440553.073017]  cpuidle_enter_state+0x97/0x620
[440553.073019]  cpuidle_enter+0x2e/0x50
[440553.073020]  cpuidle_idle_call+0x142/0x1e0
[440553.073022]  do_idle+0x83/0xf0
[440553.073024]  cpu_startup_entry+0x20/0x30
[440553.073025]  start_secondary+0x12a/0x180
[440553.073027]  secondary_startup_64_no_verify+0xc2/0xcb
[440553.073030]  </TASK>
[440553.073030] NMI backtrace for cpu 7
[440553.073031] CPU: 7 PID: 0 Comm: swapper/7 Tainted: P           O      5.15.0-70-generic #77-Ubuntu
[440553.073033] Hardware name: System manufacturer System Product Name/Z170-A, BIOS 1902 06/24/2016
[440553.073034] RIP: 0010:mwait_idle_with_hints.constprop.0+0x4f/0xa0
[440553.073035] Code: 48 89 d1 65 48 8b 04 25 c0 fb 01 00 0f 01 c8 48 8b 00 a8 08 75 14 66 90 0f 00 2d 78 b7 b7 00 b9 01 00 00 00 48 89 f8 0f 01 c9 <65> 48 8b 04 25 c0 fb 01 00 f0 80 60 02 df f0 83 44 24 fc 00 48 8b
[440553.073036] RSP: 0018:ffffa9244010bdf0 EFLAGS: 00000046
[440553.073037] RAX: 0000000000000040 RBX: 0000000000000001 RCX: 0000000000000001
[440553.073038] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000040
[440553.073039] RBP: ffffa9244010be18 R08: 000190ae414ab521 R09: 00000000000c3500
[440553.073039] R10: 0000000000000007 R11: 071c71c71c71c71c R12: 0000000000000006
[440553.073040] R13: ffffffffaf4d49c0 R14: 0000000000000006 R15: ffffffffaf4d4c48
[440553.073041] FS:  0000000000000000(0000) GS:ffff9ae2b65c0000(0000) knlGS:0000000000000000
[440553.073042] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[440553.073042] CR2: 000055a281de74d8 CR3: 0000000324610005 CR4: 00000000003706e0
[440553.073043] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[440553.073044] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[440553.073044] Call Trace:
[440553.073045]  <TASK>
[440553.073045]  ? intel_idle_ibrs+0x4d/0xd0
[440553.073047]  cpuidle_enter_state+0x97/0x620
[440553.073048]  cpuidle_enter+0x2e/0x50
[440553.073049]  cpuidle_idle_call+0x142/0x1e0
[440553.073051]  do_idle+0x83/0xf0
[440553.073052]  cpu_startup_entry+0x20/0x30
[440553.073053]  start_secondary+0x12a/0x180
[440553.073055]  secondary_startup_64_no_verify+0xc2/0xcb
[440553.073057]  </TASK>
[440553.073058] NMI backtrace for cpu 0
[440553.073059] CPU: 0 PID: 0 Comm: swapper/0 Tainted: P           O      5.15.0-70-generic #77-Ubuntu
[440553.073061] Hardware name: System manufacturer System Product Name/Z170-A, BIOS 1902 06/24/2016
[440553.073061] RIP: 0010:mwait_idle_with_hints.constprop.0+0x4f/0xa0
[440553.073064] Code: 48 89 d1 65 48 8b 04 25 c0 fb 01 00 0f 01 c8 48 8b 00 a8 08 75 14 66 90 0f 00 2d 78 b7 b7 00 b9 01 00 00 00 48 89 f8 0f 01 c9 <65> 48 8b 04 25 c0 fb 01 00 f0 80 60 02 df f0 83 44 24 fc 00 48 8b
[440553.073065] RSP: 0018:ffffffffaf203d88 EFLAGS: 00000046
[440553.073066] RAX: 0000000000000010 RBX: 0000000000000003 RCX: 0000000000000001
[440553.073067] RDX: 0000000000000000 RSI: ffffffffaf4d49c0 RDI: 0000000000000010
[440553.073068] RBP: ffffffffaf203da8 R08: 000190ae41667cf8 R09: 0000000000030d40
[440553.073069] R10: 0000000000000007 R11: 071c71c71c71c71c R12: 0000000000000003
[440553.073070] R13: ffffffffaf4d49c0 R14: 0000000000000003 R15: ffffffffaf4d4b10
[440553.073071] FS:  0000000000000000(0000) GS:ffff9ae2b6400000(0000) knlGS:0000000000000000
[440553.073072] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[440553.073073] CR2: 00005619c6c68000 CR3: 0000000324610002 CR4: 00000000003706f0
[440553.073074] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[440553.073074] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[440553.073075] Call Trace:
[440553.073076]  <TASK>
[440553.073076]  ? intel_idle+0x30/0x50
[440553.073078]  cpuidle_enter_state+0x97/0x620
[440553.073081]  cpuidle_enter+0x2e/0x50
[440553.073082]  cpuidle_idle_call+0x142/0x1e0
[440553.073084]  do_idle+0x83/0xf0
[440553.073085]  cpu_startup_entry+0x20/0x30
[440553.073086]  rest_init+0xd3/0x100
[440553.073088]  ? acpi_enable_subsystem+0x20b/0x217
[440553.073090]  arch_call_rest_init+0xe/0x23
[440553.073092]  start_kernel+0x4a9/0x4ca
[440553.073094]  x86_64_start_reservations+0x24/0x2a
[440553.073095]  x86_64_start_kernel+0xfb/0x106
[440553.073097]  secondary_startup_64_no_verify+0xc2/0xcb
[440553.073100]  </TASK>
[440553.073100] NMI backtrace for cpu 4
[440553.073101] CPU: 4 PID: 0 Comm: swapper/4 Tainted: P           O      5.15.0-70-generic #77-Ubuntu
[440553.073103] Hardware name: System manufacturer System Product Name/Z170-A, BIOS 1902 06/24/2016
[440553.073103] RIP: 0010:mwait_idle_with_hints.constprop.0+0x4f/0xa0
[440553.073105] Code: 48 89 d1 65 48 8b 04 25 c0 fb 01 00 0f 01 c8 48 8b 00 a8 08 75 14 66 90 0f 00 2d 78 b7 b7 00 b9 01 00 00 00 48 89 f8 0f 01 c9 <65> 48 8b 04 25 c0 fb 01 00 f0 80 60 02 df f0 83 44 24 fc 00 48 8b
[440553.073106] RSP: 0018:ffffa924400f3df0 EFLAGS: 00000046
[440553.073107] RAX: 0000000000000040 RBX: 0000000000000001 RCX: 0000000000000001
[440553.073108] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000040
[440553.073108] RBP: ffffa924400f3e18 R08: 000190ae414a42b3 R09: 0000000000000000
[440553.073109] R10: 0000000000000001 R11: 071c71c71c71c71c R12: 0000000000000006
[440553.073110] R13: ffffffffaf4d49c0 R14: 0000000000000006 R15: ffffffffaf4d4c48
[440553.073111] FS:  0000000000000000(0000) GS:ffff9ae2b6500000(0000) knlGS:0000000000000000
[440553.073111] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[440553.073112] CR2: 00007f2dbc64ea50 CR3: 0000000324610003 CR4: 00000000003706e0
[440553.073113] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[440553.073114] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[440553.073114] Call Trace:
[440553.073115]  <TASK>
[440553.073115]  ? intel_idle_ibrs+0x4d/0xd0
[440553.073117]  cpuidle_enter_state+0x97/0x620
[440553.073118]  ? tick_nohz_stop_tick+0x16a/0x1d0
[440553.073119]  cpuidle_enter+0x2e/0x50
[440553.073120]  cpuidle_idle_call+0x142/0x1e0
[440553.073122]  do_idle+0x83/0xf0
[440553.073123]  cpu_startup_entry+0x20/0x30
[440553.073125]  start_secondary+0x12a/0x180
[440553.073126]  secondary_startup_64_no_verify+0xc2/0xcb
[440553.073129]  </TASK>
[440553.073129] NMI backtrace for cpu 2
[440553.073131] CPU: 2 PID: 106415 Comm: python Tainted: P           O      5.15.0-70-generic #77-Ubuntu
[440553.073133] Hardware name: System manufacturer System Product Name/Z170-A, BIOS 1902 06/24/2016
[440553.073133] RIP: 0010:entry_SYSCALL_64_after_hwframe+0x57/0xcb
[440553.073136] Code: 45 31 e4 45 31 ed 45 31 f6 45 31 ff 48 89 e7 48 63 f0 66 90 b9 48 00 00 00 65 48 8b 14 25 c8 fb 01 00 89 d0 48 c1 ea 20 0f 30 <0f> 1f 44 00 00 e8 07 3a fa ff 0f 1f 44 00 00 48 8b 4c 24 58 4c 8b
[440553.073137] RSP: 0018:ffffa92441bb3f58 EFLAGS: 00000046
[440553.073139] RAX: 0000000000000001 RBX: 0000000000000000 RCX: 0000000000000048
[440553.073140] RDX: 0000000000000000 RSI: 0000000000000018 RDI: ffffa92441bb3f58
[440553.073141] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
[440553.073141] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
[440553.073142] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
[440553.073143] FS:  00007fb025f0ab80(0000) GS:ffff9ae2b6480000(0000) knlGS:0000000000000000
[440553.073144] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[440553.073145] CR2: 000056038f0b0cc0 CR3: 0000000808942004 CR4: 00000000003706e0
[440553.073146] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[440553.073146] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[440553.073147] Call Trace:
[440553.073148]  <TASK>
[440553.073150]  </TASK>
[440553.073150] NMI backtrace for cpu 6
[440553.073151] CPU: 6 PID: 0 Comm: swapper/6 Tainted: P           O      5.15.0-70-generic #77-Ubuntu
[440553.073153] Hardware name: System manufacturer System Product Name/Z170-A, BIOS 1902 06/24/2016
[440553.073154] RIP: 0010:mwait_idle_with_hints.constprop.0+0x4f/0xa0
[440553.073158] Code: 48 89 d1 65 48 8b 04 25 c0 fb 01 00 0f 01 c8 48 8b 00 a8 08 75 14 66 90 0f 00 2d 78 b7 b7 00 b9 01 00 00 00 48 89 f8 0f 01 c9 <65> 48 8b 04 25 c0 fb 01 00 f0 80 60 02 df f0 83 44 24 fc 00 48 8b
[440553.073159] RSP: 0018:ffffa92440103df0 EFLAGS: 00000046
[440553.073161] RAX: 0000000000000040 RBX: 0000000000000001 RCX: 0000000000000001
[440553.073162] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000040
[440553.073163] RBP: ffffa92440103e18 R08: 000190ae410d4c9a R09: 0000000000000000
[440553.073165] R10: 0000000000000001 R11: 071c71c71c71c71c R12: 0000000000000006
[440553.073166] R13: ffffffffaf4d49c0 R14: 0000000000000006 R15: ffffffffaf4d4c48
[440553.073167] FS:  0000000000000000(0000) GS:ffff9ae2b6580000(0000) knlGS:0000000000000000
[440553.073169] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[440553.073170] CR2: 00007f5bb43a2db0 CR3: 0000000324610005 CR4: 00000000003706e0
[440553.073172] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[440553.073173] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[440553.073174] Call Trace:
[440553.073175]  <TASK>
[440553.073176]  ? intel_idle_ibrs+0x4d/0xd0
[440553.073179]  cpuidle_enter_state+0x97/0x620
[440553.073181]  ? tick_nohz_stop_tick+0x16a/0x1d0
[440553.073184]  cpuidle_enter+0x2e/0x50
[440553.073186]  cpuidle_idle_call+0x142/0x1e0
[440553.073189]  do_idle+0x83/0xf0
[440553.073191]  cpu_startup_entry+0x20/0x30
[440553.073193]  start_secondary+0x12a/0x180
[440553.073196]  secondary_startup_64_no_verify+0xc2/0xcb
[440553.073200]  </TASK>
[440556.401698] snd_hda_intel 0000:01:00.1: can't change power state from D3cold to D0 (config space inaccessible)
[440556.778840] snd_hda_codec_hdmi hdaudioC1D0: Unable to sync register 0x4f0800. -5
[440556.778860] snd_hda_codec_hdmi hdaudioC1D0: HDMI: invalid ELD buf size -1
[440556.778862] snd_hda_codec_hdmi hdaudioC1D0: HDMI: invalid ELD buf size -1
...
00:00.0 Host bridge: Intel Corporation Xeon E3-1200 v5/E3-1500 v5/6th Gen Core Processor Host Bridge/DRAM Registers (rev 07)
00:01.0 PCI bridge: Intel Corporation 6th-10th Gen Core Processor PCIe Controller (x16) (rev 07)
00:14.0 USB controller: Intel Corporation 100 Series/C230 Series Chipset Family USB 3.0 xHCI Controller (rev 31)
00:16.0 Communication controller: Intel Corporation 100 Series/C230 Series Chipset Family MEI Controller #1 (rev 31)
00:17.0 SATA controller: Intel Corporation Q170/Q150/B150/H170/H110/Z170/CM236 Chipset SATA Controller [AHCI Mode] (rev 31)
00:1b.0 PCI bridge: Intel Corporation 100 Series/C230 Series Chipset Family PCI Express Root Port #17 (rev f1)
00:1c.0 PCI bridge: Intel Corporation 100 Series/C230 Series Chipset Family PCI Express Root Port #1 (rev f1)
00:1c.2 PCI bridge: Intel Corporation 100 Series/C230 Series Chipset Family PCI Express Root Port #3 (rev f1)
00:1d.0 PCI bridge: Intel Corporation 100 Series/C230 Series Chipset Family PCI Express Root Port #9 (rev f1)
00:1f.0 ISA bridge: Intel Corporation Z170 Chipset LPC/eSPI Controller (rev 31)
00:1f.2 Memory controller: Intel Corporation 100 Series/C230 Series Chipset Family Power Management Controller (rev 31)
00:1f.3 Audio device: Intel Corporation 100 Series/C230 Series Chipset Family HD Audio Controller (rev 31)
00:1f.4 SMBus: Intel Corporation 100 Series/C230 Series Chipset Family SMBus (rev 31)
00:1f.6 Ethernet controller: Intel Corporation Ethernet Connection (2) I219-V (rev 31)
01:00.0 VGA compatible controller: NVIDIA Corporation GA102 [GeForce RTX 3090] (rev ff)
01:00.1 Audio device: NVIDIA Corporation GA102 High Definition Audio Controller (rev ff)
03:00.0 USB controller: ASMedia Technology Inc. ASM1142 USB 3.1 Host Controller
04:00.0 PCI bridge: ASMedia Technology Inc. ASM1083/1085 PCIe to PCI Bridge (rev 04)
06:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD Controller PM9A1/PM9A3/980PRO

And the output of LSPCI after the crash:

$ lspci
00:00.0 Host bridge: Intel Corporation Xeon E3-1200 v5/E3-1500 v5/6th Gen Core Processor Host Bridge/DRAM Registers (rev 07)
00:01.0 PCI bridge: Intel Corporation 6th-10th Gen Core Processor PCIe Controller (x16) (rev 07)
00:14.0 USB controller: Intel Corporation 100 Series/C230 Series Chipset Family USB 3.0 xHCI Controller (rev 31)
00:16.0 Communication controller: Intel Corporation 100 Series/C230 Series Chipset Family MEI Controller #1 (rev 31)
00:17.0 SATA controller: Intel Corporation Q170/Q150/B150/H170/H110/Z170/CM236 Chipset SATA Controller [AHCI Mode] (rev 31)
00:1b.0 PCI bridge: Intel Corporation 100 Series/C230 Series Chipset Family PCI Express Root Port #17 (rev f1)
00:1c.0 PCI bridge: Intel Corporation 100 Series/C230 Series Chipset Family PCI Express Root Port #1 (rev f1)
00:1c.2 PCI bridge: Intel Corporation 100 Series/C230 Series Chipset Family PCI Express Root Port #3 (rev f1)
00:1d.0 PCI bridge: Intel Corporation 100 Series/C230 Series Chipset Family PCI Express Root Port #9 (rev f1)
00:1f.0 ISA bridge: Intel Corporation Z170 Chipset LPC/eSPI Controller (rev 31)
00:1f.2 Memory controller: Intel Corporation 100 Series/C230 Series Chipset Family Power Management Controller (rev 31)
00:1f.3 Audio device: Intel Corporation 100 Series/C230 Series Chipset Family HD Audio Controller (rev 31)
00:1f.4 SMBus: Intel Corporation 100 Series/C230 Series Chipset Family SMBus (rev 31)
00:1f.6 Ethernet controller: Intel Corporation Ethernet Connection (2) I219-V (rev 31)
01:00.0 VGA compatible controller: NVIDIA Corporation GA102 [GeForce RTX 3090] (rev ff)
01:00.1 Audio device: NVIDIA Corporation GA102 High Definition Audio Controller (rev ff)
03:00.0 USB controller: ASMedia Technology Inc. ASM1142 USB 3.1 Host Controller
04:00.0 PCI bridge: ASMedia Technology Inc. ASM1083/1085 PCIe to PCI Bridge (rev 04)
06:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD Controller PM9A1/PM9A3/980PRO

Any input that helps me narrow this down to a PSU issue, hardware issue, or driver issue would be appreciated.

The full nvidia log dump is attached:
controlnet-training-crash-nvidia-bug-report.log.gz (341.4 KB)

Following up, I’m less inclined to think this is a PSU or Thermal issue.

Based on some advice in other threads, I ran gpu_burn for an hour with doubles and the GPU held up just fine. When I encounter the crash above the failure happens in seconds or minutes. I guess that makes it a driver issue? Maybe.

$ ./gpu_burn -d 3600
Using compare file: compare.ptx
Burning for 3600 seconds.
GPU 0: NVIDIA GeForce RTX 3090 (UUID: GPU-508f8624-3013-b396-84aa-c207917faf36)
Initialized device 0 with 24257 MB of memory (23646 MB available, using 21282 MB of it), using DOUBLES
Results are 536870912 bytes each, thus performing 39 iterations
10.0%  proc'd: 156 (544 Gflop/s)   errors: 0   temps: 75 C
        Summary at:   Fri Apr 28 06:28:43 PM UTC 2023

20.2%  proc'd: 351 (540 Gflop/s)   errors: 0   temps: 77 C
        Summary at:   Fri Apr 28 06:34:48 PM UTC 2023

30.2%  proc'd: 507 (540 Gflop/s)   errors: 0   temps: 77 C
        Summary at:   Fri Apr 28 06:40:49 PM UTC 2023

40.2%  proc'd: 702 (540 Gflop/s)   errors: 0   temps: 77 C
        Summary at:   Fri Apr 28 06:46:49 PM UTC 2023

50.2%  proc'd: 858 (540 Gflop/s)   errors: 0   temps: 77 C
        Summary at:   Fri Apr 28 06:52:50 PM UTC 2023

60.4%  proc'd: 1053 (540 Gflop/s)   errors: 0   temps: 77 C
        Summary at:   Fri Apr 28 06:58:55 PM UTC 2023

70.4%  proc'd: 1209 (540 Gflop/s)   errors: 0   temps: 78 C
        Summary at:   Fri Apr 28 07:04:56 PM UTC 2023

80.4%  proc'd: 1404 (540 Gflop/s)   errors: 0   temps: 78 C
        Summary at:   Fri Apr 28 07:10:56 PM UTC 2023

90.4%  proc'd: 1599 (540 Gflop/s)   errors: 0   temps: 78 C
        Summary at:   Fri Apr 28 07:16:57 PM UTC 2023

100.0%  proc'd: 1755 (540 Gflop/s)   errors: 0   temps: 77 C
Killing processes with SIGTERM (soft kill)

Killing processes with SIGKILL (force kill)
done

Tested 1 GPUs:
        GPU 0: OK

Another small update: I’ve tried updating my drivers from 525 to 530. Still able to encounter the issue.

I am able to attach to the python process when the device deadlocks. I can’t view the debug details, but I can see pieces:

#1  0x00007f1286a5168b in ?? () from /lib/x86_64-linux-gnu/libcuda.so.1
(gdb) down
#0  0x00007f12a7112cab in sched_yield () at ../sysdeps/unix/syscall-template.S:120
120     in ../sysdeps/unix/syscall-template.S                                     

Looks like there’s a wait on a syscall. My guess is that it’s probably related to a crash handler rather than a root cause, though. I can’t really say from here without symbols.

#11 0x00007f122dc1433c in ?? () from /home/joseph/.pyenv/versions/controlnet/lib/python3.8/site-packages/torch/lib/libcudart-e409450e.so.11.0                       │
(gdb) up                                                                                                                                                            │
#12 0x00007f122dc6a566 in cudaLaunchKernel () from /home/joseph/.pyenv/versions/controlnet/lib/python3.8/site-packages/torch/lib/libcudart-e409450e.so.11.0         │
(gdb) up                                                                                                                                                            │
#13 0x00007f1230eab5f9 in void at::native::gpu_kernel_impl<at::native::CUDAFunctor_add<float> >(at::TensorIteratorBase&, at::native::CUDAFunctor_add<float> const&) │
    () from /home/joseph/.pyenv/versions/controlnet/lib/python3.8/site-packages/torch/lib/libtorch_cuda_cu.so   

Is it possible for a custom kernel function (CUDA Functor Add) to trigger a GPU failure?

Here’s the full thread list. Looks like a bunch of waiting. Fans are on full-tilt.

(gdb) info threads
  Id   Target Id                                         Frame
  1    Thread 0x7f12a6f21b80 (LWP 4438) "python"         0x00007f1286cda79c in ?? () from /lib/x86_64-linux-gnu/libcuda.so.1
  2    Thread 0x7f122adff640 (LWP 4494) "python"         __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x7f122d58eae0 <thread_status+96>) at ./nptl/futex-internal.c:57
  3    Thread 0x7f122a5fe640 (LWP 4495) "python"         __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x7f122d58eb60 <thread_status+224>) at ./nptl/futex-internal.c:57
  4    Thread 0x7f1229dfd640 (LWP 4496) "python"         __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x7f122d58ebe0 <thread_status+352>) at ./nptl/futex-internal.c:57
  5    Thread 0x7f12295fc640 (LWP 4497) "python"         __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x7f122d58ec60 <thread_status+480>) at ./nptl/futex-internal.c:57
  6    Thread 0x7f1224dfb640 (LWP 4498) "python"         __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x7f122d58ece0 <thread_status+608>) at ./nptl/futex-internal.c:57
  7    Thread 0x7f12205fa640 (LWP 4499) "python"         __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x7f122d58ed60 <thread_status+736>) at ./nptl/futex-internal.c:57
  8    Thread 0x7f121ddf9640 (LWP 4500) "python"         __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x7f122d58ede0 <thread_status+864>) at ./nptl/futex-internal.c:57
  9    Thread 0x7f120367e640 (LWP 4735) "python"         0x00007f129c818b31 in ?? () from /home/joseph/.pyenv/versions/controlnet/lib/python3.8/site-packages/torch/lib/libgomp-a34b3233.so.1
  10   Thread 0x7f1202e7d640 (LWP 4736) "python"         0x00007f129c818b31 in ?? () from /home/joseph/.pyenv/versions/controlnet/lib/python3.8/site-packages/torch/lib/libgomp-a34b3233.so.1
  11   Thread 0x7f120267c640 (LWP 4737) "python"         0x00007f129c818b31 in ?? () from /home/joseph/.pyenv/versions/controlnet/lib/python3.8/site-packages/torch/lib/libgomp-a34b3233.so.1
  12   Thread 0x7f10e1172640 (LWP 4744) "cuda-EvtHandlr" 0x00007f12a7122d7f in __GI___poll (fds=0x55939d7964a0, nfds=2, timeout=-1) at ../sysdeps/unix/sysv/linux/poll.c:29
* 13   Thread 0x7f10e0971640 (LWP 4745) "cuda-EvtHandlr" 0x00007f1286cda799 in ?? () from /lib/x86_64-linux-gnu/libcuda.so.1
  14   Thread 0x7f10dbd9e640 (LWP 4746) "python"         __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x7f10dbd9dde0, op=393, expected=0, futex_word=0x559389e86698) at ./nptl/futex-internal.c:57
  15   Thread 0x7f1201e3b640 (LWP 4747) "python"         __futex_abstimed_wait_common64 (private=<optimized out>, cancel=true, abstime=0x7f1201e39ed0, op=393, expected=0, futex_word=0x7f10f0000fe0) at ./nptl/futex-internal.c:57
  16   Thread 0x7f120163a640 (LWP 4748) "python"         __futex_abstimed_wait_common64 (private=<optimized out>, cancel=true, abstime=0x7f12016390b0, op=393, expected=0, futex_word=0x7f0fb8001200) at ./nptl/futex-internal.c:57
  17   Thread 0x7f1200d79640 (LWP 4749) "python"         __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x5594846c867c) at ./nptl/futex-internal.c:57
  18   Thread 0x7f11fbfff640 (LWP 4750) "python"         __futex_abstimed_wait_common64 (private=<optimized out>, cancel=true, abstime=0x7f11fbffe0b0, op=393, expected=0, futex_word=0x7f0e90001200) at ./nptl/futex-internal.c:57

EDIT: I reached out to the manufacturer and to some folks on the PyTorch forums to see if it was possible to try at this from a software angle. It looks like I have correlations swapped. Crash happened BEFORE this deadlock and we’re only seeing the symptoms of it from all the waits. I’m currently rerunning GPU burn with tensor cores enabled to see if perhaps the error is coming from those.

I can reproduce the issue 100% of the time, but I’m trying to narrow down the cause. There just aren’t enough hours in the day. :)

A final update until a resolution. The issue may have been solved by either enabling persistence mode, disabling clock speed switching, or dropping the speed cap.

I was planning to just cap the device speed to see if it would help with stability but got a notice that ‘persistence mode’ was disabled.

Here are the exact steps in the terminal. I haven’t been able to reproduce the issue since then, and it always happened within a few minutes of training:

$ sudo nvidia-smi -lgc 0,1000
GPU clocks set to "(gpuClkMin 0, gpuClkMax 1000)" for GPU 00000000:01:00.0

Warning: persistence mode is disabled on device 00000000:01:00.0. See the Known Issues section of the nvidia-smi(1) man page for more information. Run with [--help
| -h] switch to get more information on how to enable persistence mode.
All done.
$ sudo nvidia-smi -pm 1


Enabled persistence mode for GPU 00000000:01:00.0.
All done.
$ sudo nvidia-smi -lgc 0,1000


GPU clocks set to "(gpuClkMin 0, gpuClkMax 1000)" for GPU 00000000:01:00.0
All done.

Normally, the clock maxes out at 1.9GHz, but averages 1.3GHz during training, so I’m running at about 75% speed. That’s fine for me, since the training machine is sitting in a corner for weeks at a time. Completing a training job slowly is better than not completing one quickly.

If I can find the nerve to divide and conquer, I’ll try just leaving persistence mode and uncapping the clocks and vice versa to narrow it down.

To preempt the “are you sure it’s not the PSU?” comments, I am able to draw significantly more overnight during GPU BURN runs. These training runs are far from the most stressful events for this GPU. They are, however, the one that causes the most dynamic clock switching.

Apologies for bumping this old issue but I have new updates and it doesn’t look like anyone will be notified, so I think it’s okay.

I upgraded the PSU which was 650 watts to a 1200 watt PSU. We’re now at less than 50% of max draw even at full tilt, but the problem still appears. Can say with almost perfect certainty it’s not a PSU issue.

I reported the issue to Zotac who finally agreed to RMA the card after a lot of back and forth. Their support system directs every reply to a different person who has the same script, so it was a few days of repeating, “I have tried upgrading the drivers. No it is not a power issue. Yes it is not a software issue.” Ultimately, though, they said that if they couldn’t find the issue I’d be billed for it and I’m not sure I trust them enough to be able to reproduce this issue so I’ll probably just suck it up and deal with the card failing every once in a while until I can afford to replace it.