Hi, all,
I have a new 3090. It can run a heavy workload overnight just fine, but every so often it seems to deadlock on an operation.
I had nvidia-smi dmon running while it died:
# gpu pwr gtemp mtemp sm mem enc dec mclk pclk
# Idx W C C % % % % MHz MHz
0 292 81 - 68 33 0 0 9501 1020
0 279 81 - 97 50 0 0 9501 1095
0 296 81 - 91 44 0 0 9501 990
0 288 81 - 55 27 0 0 9501 1515
0 297 82 - 65 34 0 0 9501 780
0 288 82 - 56 32 0 0 9501 787
0 284 80 - 57 38 0 0 9501 1785
0 300 81 - 55 40 0 0 9501 1785
0 299 79 - 82 57 0 0 9501 1155
0 287 80 - 88 59 0 0 9501 1440
0 278 81 - 99 65 0 0 9501 1560
0 279 79 - 100 58 0 0 9501 1365
0 277 76 - 56 36 0 0 9501 1920
0 289 76 - 63 47 0 0 9501 1920
0 288 77 - 88 60 0 0 9501 1905
0 282 78 - 100 51 0 0 9501 1725
0 283 77 - 100 49 0 0 9501 1905
0 - - -
Note that last line is incomplete; I did not truncate it. nvidia-smi truncated it.
Here’s some choice output from dmesg with duplicates omitted:
[ 5.605621] nvidia 0000:01:00.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=none:owns=io+mem
[ 5.652453] NVRM: loading NVIDIA UNIX x86_64 Kernel Module 525.105.17 Tue Mar 28 18:02:59 UTC 2023
[ 5.663885] nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms 525.105.17 Tue Mar 28 22:18:37 UTC 2023
[ 5.681370] [drm] [nvidia-drm] [GPU ID 0x00000100] Loading driver
[ 5.681372] [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:01:00.0 on minor 0
[ 5.912934] nvidia_uvm: module uses symbols from proprietary module nvidia, inheriting taint.
[ 5.916542] nvidia-uvm: Loaded the UVM driver, major device number 507.
[ 9.173026] e1000e 0000:00:1f.6 enp0s31f6: NIC Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx
[ 9.173101] IPv6: ADDRCONF(NETDEV_CHANGE): enp0s31f6: link becomes ready
[ 17.852287] loop5: detected capacity change from 0 to 8
[350775.385809] FS-Cache: Loaded
[350775.403050] FS-Cache: Netfs 'cifs' registered for caching
[350775.405237] Key type cifs.spnego registered
[350775.405245] Key type cifs.idmap registered
[350775.405500] Malformed UNC in devname
...
[350775.405519] CIFS: VFS: Malformed UNC in devname
[350920.653476] CIFS: No dialect specified on mount. Default has changed to a more secure dialect, SMB2.1 or later (e.g. SMB3.1.1), from CIFS (SMB1). To use the less secure SMB1 dialect to access old servers which do not support SMB3.1.1 (or even SMB3 or SMB2.1) specify vers=1.0 on mount.
[350920.653485] CIFS: Attempting to mount \\synology\MLData
[350920.695369] CIFS: Status code returned 0xc000006d STATUS_LOGON_FAILURE
[350920.695391] CIFS: VFS: \\synology Send error in SessSetup = -13
[351076.195760] CIFS: Attempting to mount \\synology\MLData
[352377.832353] CIFS: Attempting to mount \\synology\MLData
...
[362713.785429] perf: interrupt took too long (2512 > 2500), lowering kernel.perf_event_max_sample_rate to 79500
[398329.008318] Lockdown: mdadm: /dev/mem,kmem,port is restricted; see man kernel_lockdown.7
[398384.755321] Lockdown: mdadm: /dev/mem,kmem,port is restricted; see man kernel_lockdown.7
[440008.299189] NVRM: GPU at PCI:0000:01:00: GPU-508f8624-3013-b396-84aa-c207917faf36
[440008.299192] NVRM: Xid (PCI:0000:01:00): 79, pid='<unknown>', name=<unknown>, GPU has fallen off the bus.
[440008.299194] NVRM: GPU 0000:01:00.0: GPU has fallen off the bus.
[440008.300548] NVRM: A GPU crash dump has been created. If possible, please run
NVRM: nvidia-bug-report.sh as root to collect this data before
NVRM: the NVIDIA kernel module is unloaded.
[440553.072866] sysrq: Show backtrace of all active CPUs
[440553.072884] NMI backtrace for cpu 1
[440553.072885] CPU: 1 PID: 107535 Comm: nvidia-bug-repo Tainted: P O 5.15.0-70-generic #77-Ubuntu
[440553.072887] Hardware name: System manufacturer System Product Name/Z170-A, BIOS 1902 06/24/2016
[440553.072888] Call Trace:
[440553.072889] <TASK>
[440553.072890] show_stack+0x52/0x5c
[440553.072894] dump_stack_lvl+0x4a/0x63
[440553.072896] dump_stack+0x10/0x16
[440553.072897] nmi_cpu_backtrace.cold+0x4d/0x93
[440553.072899] ? lapic_can_unplug_cpu+0x90/0x90
[440553.072902] nmi_trigger_cpumask_backtrace+0xec/0x100
[440553.072905] arch_trigger_cpumask_backtrace+0x19/0x20
[440553.072908] sysrq_handle_showallcpus+0x17/0x20
[440553.072910] __handle_sysrq.cold+0xc9/0x1a6
[440553.072912] ? apparmor_file_permission+0x70/0x160
[440553.072914] write_sysrq_trigger+0x28/0x40
[440553.072916] proc_reg_write+0x5b/0xa0
[440553.072918] ? __cond_resched+0x1a/0x50
[440553.072921] vfs_write+0xc4/0x270
[440553.072923] ksys_write+0x67/0xf0
[440553.072924] __x64_sys_write+0x19/0x20
[440553.072926] do_syscall_64+0x59/0xc0
[440553.072928] ? syscall_exit_to_user_mode+0x27/0x50
[440553.072930] ? __x64_sys_close+0x11/0x50
[440553.072932] ? do_syscall_64+0x69/0xc0
[440553.072934] ? __x64_sys_close+0x11/0x50
[440553.072935] ? do_syscall_64+0x69/0xc0
[440553.072937] ? irqentry_exit_to_user_mode+0x9/0x20
[440553.072939] ? irqentry_exit+0x1d/0x30
[440553.072940] ? exc_page_fault+0x89/0x170
[440553.072942] entry_SYSCALL_64_after_hwframe+0x61/0xcb
[440553.072943] RIP: 0033:0x7f5bb4424a37
[440553.072946] Code: 10 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 48 89 54 24 18 48 89 74 24
[440553.072947] RSP: 002b:00007ffcfe47d618 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
[440553.072949] RAX: ffffffffffffffda RBX: 0000000000000002 RCX: 00007f5bb4424a37
[440553.072950] RDX: 0000000000000002 RSI: 000055e971460560 RDI: 0000000000000001
[440553.072951] RBP: 000055e971460560 R08: 000055e971456f02 R09: 0000000000000000
[440553.072952] R10: 000055e971456f01 R11: 0000000000000246 R12: 0000000000000001
[440553.072953] R13: 0000000000000002 R14: 0000000000000000 R15: 0000000000000000
[440553.072955] </TASK>
[440553.072956] Sending NMI from CPU 1 to CPUs 0,2-7:
[440553.072960] NMI backtrace for cpu 5
[440553.072961] CPU: 5 PID: 0 Comm: swapper/5 Tainted: P O 5.15.0-70-generic #77-Ubuntu
[440553.072963] Hardware name: System manufacturer System Product Name/Z170-A, BIOS 1902 06/24/2016
[440553.072964] RIP: 0010:mwait_idle_with_hints.constprop.0+0x4f/0xa0
[440553.072966] Code: 48 89 d1 65 48 8b 04 25 c0 fb 01 00 0f 01 c8 48 8b 00 a8 08 75 14 66 90 0f 00 2d 78 b7 b7 00 b9 01 00 00 00 48 89 f8 0f 01 c9 <65> 48 8b 04 25 c0 fb 01 00 f0 80 60 02 df f0 83 44 24 fc 00 48 8b
[440553.072967] RSP: 0018:ffffa924400fbdf0 EFLAGS: 00000046
[440553.072968] RAX: 0000000000000040 RBX: 0000000000000001 RCX: 0000000000000001
[440553.072969] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000040
[440553.072970] RBP: ffffa924400fbe18 R08: 000190ae40932dc6 R09: 0000000000000000
[440553.072971] R10: 0000000000000001 R11: 071c71c71c71c71c R12: 0000000000000006
[440553.072971] R13: ffffffffaf4d49c0 R14: 0000000000000006 R15: ffffffffaf4d4c48
[440553.072972] FS: 0000000000000000(0000) GS:ffff9ae2b6540000(0000) knlGS:0000000000000000
[440553.072974] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[440553.072975] CR2: 0000562e6e0374d8 CR3: 0000000324610005 CR4: 00000000003706e0
[440553.072976] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[440553.072976] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[440553.072977] Call Trace:
[440553.072977] <TASK>
[440553.072978] ? intel_idle_ibrs+0x4d/0xd0
[440553.072980] cpuidle_enter_state+0x97/0x620
[440553.072982] ? tick_nohz_stop_tick+0x16a/0x1d0
[440553.072984] cpuidle_enter+0x2e/0x50
[440553.072985] cpuidle_idle_call+0x142/0x1e0
[440553.072987] do_idle+0x83/0xf0
[440553.072988] cpu_startup_entry+0x20/0x30
[440553.072990] start_secondary+0x12a/0x180
[440553.072992] secondary_startup_64_no_verify+0xc2/0xcb
[440553.072995] </TASK>
[440553.072996] NMI backtrace for cpu 3
[440553.072997] CPU: 3 PID: 0 Comm: swapper/3 Tainted: P O 5.15.0-70-generic #77-Ubuntu
[440553.072999] Hardware name: System manufacturer System Product Name/Z170-A, BIOS 1902 06/24/2016
[440553.073000] RIP: 0010:mwait_idle_with_hints.constprop.0+0x4f/0xa0
[440553.073002] Code: 48 89 d1 65 48 8b 04 25 c0 fb 01 00 0f 01 c8 48 8b 00 a8 08 75 14 66 90 0f 00 2d 78 b7 b7 00 b9 01 00 00 00 48 89 f8 0f 01 c9 <65> 48 8b 04 25 c0 fb 01 00 f0 80 60 02 df f0 83 44 24 fc 00 48 8b
[440553.073004] RSP: 0018:ffffa924400ebdf0 EFLAGS: 00000046
[440553.073005] RAX: 0000000000000020 RBX: 0000000000000001 RCX: 0000000000000001
[440553.073006] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000020
[440553.073007] RBP: ffffa924400ebe18 R08: 000190ae416871a1 R09: 00000000000c3500
[440553.073008] R10: 0000000000000007 R11: 071c71c71c71c71c R12: 0000000000000004
[440553.073008] R13: ffffffffaf4d49c0 R14: 0000000000000004 R15: ffffffffaf4d4b78
[440553.073009] FS: 0000000000000000(0000) GS:ffff9ae2b64c0000(0000) knlGS:0000000000000000
[440553.073010] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[440553.073011] CR2: 00007f31bf953a70 CR3: 0000000324610003 CR4: 00000000003706e0
[440553.073012] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[440553.073013] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[440553.073014] Call Trace:
[440553.073014] <TASK>
[440553.073015] ? intel_idle_ibrs+0x4d/0xd0
[440553.073017] cpuidle_enter_state+0x97/0x620
[440553.073019] cpuidle_enter+0x2e/0x50
[440553.073020] cpuidle_idle_call+0x142/0x1e0
[440553.073022] do_idle+0x83/0xf0
[440553.073024] cpu_startup_entry+0x20/0x30
[440553.073025] start_secondary+0x12a/0x180
[440553.073027] secondary_startup_64_no_verify+0xc2/0xcb
[440553.073030] </TASK>
[440553.073030] NMI backtrace for cpu 7
[440553.073031] CPU: 7 PID: 0 Comm: swapper/7 Tainted: P O 5.15.0-70-generic #77-Ubuntu
[440553.073033] Hardware name: System manufacturer System Product Name/Z170-A, BIOS 1902 06/24/2016
[440553.073034] RIP: 0010:mwait_idle_with_hints.constprop.0+0x4f/0xa0
[440553.073035] Code: 48 89 d1 65 48 8b 04 25 c0 fb 01 00 0f 01 c8 48 8b 00 a8 08 75 14 66 90 0f 00 2d 78 b7 b7 00 b9 01 00 00 00 48 89 f8 0f 01 c9 <65> 48 8b 04 25 c0 fb 01 00 f0 80 60 02 df f0 83 44 24 fc 00 48 8b
[440553.073036] RSP: 0018:ffffa9244010bdf0 EFLAGS: 00000046
[440553.073037] RAX: 0000000000000040 RBX: 0000000000000001 RCX: 0000000000000001
[440553.073038] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000040
[440553.073039] RBP: ffffa9244010be18 R08: 000190ae414ab521 R09: 00000000000c3500
[440553.073039] R10: 0000000000000007 R11: 071c71c71c71c71c R12: 0000000000000006
[440553.073040] R13: ffffffffaf4d49c0 R14: 0000000000000006 R15: ffffffffaf4d4c48
[440553.073041] FS: 0000000000000000(0000) GS:ffff9ae2b65c0000(0000) knlGS:0000000000000000
[440553.073042] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[440553.073042] CR2: 000055a281de74d8 CR3: 0000000324610005 CR4: 00000000003706e0
[440553.073043] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[440553.073044] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[440553.073044] Call Trace:
[440553.073045] <TASK>
[440553.073045] ? intel_idle_ibrs+0x4d/0xd0
[440553.073047] cpuidle_enter_state+0x97/0x620
[440553.073048] cpuidle_enter+0x2e/0x50
[440553.073049] cpuidle_idle_call+0x142/0x1e0
[440553.073051] do_idle+0x83/0xf0
[440553.073052] cpu_startup_entry+0x20/0x30
[440553.073053] start_secondary+0x12a/0x180
[440553.073055] secondary_startup_64_no_verify+0xc2/0xcb
[440553.073057] </TASK>
[440553.073058] NMI backtrace for cpu 0
[440553.073059] CPU: 0 PID: 0 Comm: swapper/0 Tainted: P O 5.15.0-70-generic #77-Ubuntu
[440553.073061] Hardware name: System manufacturer System Product Name/Z170-A, BIOS 1902 06/24/2016
[440553.073061] RIP: 0010:mwait_idle_with_hints.constprop.0+0x4f/0xa0
[440553.073064] Code: 48 89 d1 65 48 8b 04 25 c0 fb 01 00 0f 01 c8 48 8b 00 a8 08 75 14 66 90 0f 00 2d 78 b7 b7 00 b9 01 00 00 00 48 89 f8 0f 01 c9 <65> 48 8b 04 25 c0 fb 01 00 f0 80 60 02 df f0 83 44 24 fc 00 48 8b
[440553.073065] RSP: 0018:ffffffffaf203d88 EFLAGS: 00000046
[440553.073066] RAX: 0000000000000010 RBX: 0000000000000003 RCX: 0000000000000001
[440553.073067] RDX: 0000000000000000 RSI: ffffffffaf4d49c0 RDI: 0000000000000010
[440553.073068] RBP: ffffffffaf203da8 R08: 000190ae41667cf8 R09: 0000000000030d40
[440553.073069] R10: 0000000000000007 R11: 071c71c71c71c71c R12: 0000000000000003
[440553.073070] R13: ffffffffaf4d49c0 R14: 0000000000000003 R15: ffffffffaf4d4b10
[440553.073071] FS: 0000000000000000(0000) GS:ffff9ae2b6400000(0000) knlGS:0000000000000000
[440553.073072] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[440553.073073] CR2: 00005619c6c68000 CR3: 0000000324610002 CR4: 00000000003706f0
[440553.073074] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[440553.073074] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[440553.073075] Call Trace:
[440553.073076] <TASK>
[440553.073076] ? intel_idle+0x30/0x50
[440553.073078] cpuidle_enter_state+0x97/0x620
[440553.073081] cpuidle_enter+0x2e/0x50
[440553.073082] cpuidle_idle_call+0x142/0x1e0
[440553.073084] do_idle+0x83/0xf0
[440553.073085] cpu_startup_entry+0x20/0x30
[440553.073086] rest_init+0xd3/0x100
[440553.073088] ? acpi_enable_subsystem+0x20b/0x217
[440553.073090] arch_call_rest_init+0xe/0x23
[440553.073092] start_kernel+0x4a9/0x4ca
[440553.073094] x86_64_start_reservations+0x24/0x2a
[440553.073095] x86_64_start_kernel+0xfb/0x106
[440553.073097] secondary_startup_64_no_verify+0xc2/0xcb
[440553.073100] </TASK>
[440553.073100] NMI backtrace for cpu 4
[440553.073101] CPU: 4 PID: 0 Comm: swapper/4 Tainted: P O 5.15.0-70-generic #77-Ubuntu
[440553.073103] Hardware name: System manufacturer System Product Name/Z170-A, BIOS 1902 06/24/2016
[440553.073103] RIP: 0010:mwait_idle_with_hints.constprop.0+0x4f/0xa0
[440553.073105] Code: 48 89 d1 65 48 8b 04 25 c0 fb 01 00 0f 01 c8 48 8b 00 a8 08 75 14 66 90 0f 00 2d 78 b7 b7 00 b9 01 00 00 00 48 89 f8 0f 01 c9 <65> 48 8b 04 25 c0 fb 01 00 f0 80 60 02 df f0 83 44 24 fc 00 48 8b
[440553.073106] RSP: 0018:ffffa924400f3df0 EFLAGS: 00000046
[440553.073107] RAX: 0000000000000040 RBX: 0000000000000001 RCX: 0000000000000001
[440553.073108] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000040
[440553.073108] RBP: ffffa924400f3e18 R08: 000190ae414a42b3 R09: 0000000000000000
[440553.073109] R10: 0000000000000001 R11: 071c71c71c71c71c R12: 0000000000000006
[440553.073110] R13: ffffffffaf4d49c0 R14: 0000000000000006 R15: ffffffffaf4d4c48
[440553.073111] FS: 0000000000000000(0000) GS:ffff9ae2b6500000(0000) knlGS:0000000000000000
[440553.073111] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[440553.073112] CR2: 00007f2dbc64ea50 CR3: 0000000324610003 CR4: 00000000003706e0
[440553.073113] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[440553.073114] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[440553.073114] Call Trace:
[440553.073115] <TASK>
[440553.073115] ? intel_idle_ibrs+0x4d/0xd0
[440553.073117] cpuidle_enter_state+0x97/0x620
[440553.073118] ? tick_nohz_stop_tick+0x16a/0x1d0
[440553.073119] cpuidle_enter+0x2e/0x50
[440553.073120] cpuidle_idle_call+0x142/0x1e0
[440553.073122] do_idle+0x83/0xf0
[440553.073123] cpu_startup_entry+0x20/0x30
[440553.073125] start_secondary+0x12a/0x180
[440553.073126] secondary_startup_64_no_verify+0xc2/0xcb
[440553.073129] </TASK>
[440553.073129] NMI backtrace for cpu 2
[440553.073131] CPU: 2 PID: 106415 Comm: python Tainted: P O 5.15.0-70-generic #77-Ubuntu
[440553.073133] Hardware name: System manufacturer System Product Name/Z170-A, BIOS 1902 06/24/2016
[440553.073133] RIP: 0010:entry_SYSCALL_64_after_hwframe+0x57/0xcb
[440553.073136] Code: 45 31 e4 45 31 ed 45 31 f6 45 31 ff 48 89 e7 48 63 f0 66 90 b9 48 00 00 00 65 48 8b 14 25 c8 fb 01 00 89 d0 48 c1 ea 20 0f 30 <0f> 1f 44 00 00 e8 07 3a fa ff 0f 1f 44 00 00 48 8b 4c 24 58 4c 8b
[440553.073137] RSP: 0018:ffffa92441bb3f58 EFLAGS: 00000046
[440553.073139] RAX: 0000000000000001 RBX: 0000000000000000 RCX: 0000000000000048
[440553.073140] RDX: 0000000000000000 RSI: 0000000000000018 RDI: ffffa92441bb3f58
[440553.073141] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
[440553.073141] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
[440553.073142] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
[440553.073143] FS: 00007fb025f0ab80(0000) GS:ffff9ae2b6480000(0000) knlGS:0000000000000000
[440553.073144] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[440553.073145] CR2: 000056038f0b0cc0 CR3: 0000000808942004 CR4: 00000000003706e0
[440553.073146] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[440553.073146] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[440553.073147] Call Trace:
[440553.073148] <TASK>
[440553.073150] </TASK>
[440553.073150] NMI backtrace for cpu 6
[440553.073151] CPU: 6 PID: 0 Comm: swapper/6 Tainted: P O 5.15.0-70-generic #77-Ubuntu
[440553.073153] Hardware name: System manufacturer System Product Name/Z170-A, BIOS 1902 06/24/2016
[440553.073154] RIP: 0010:mwait_idle_with_hints.constprop.0+0x4f/0xa0
[440553.073158] Code: 48 89 d1 65 48 8b 04 25 c0 fb 01 00 0f 01 c8 48 8b 00 a8 08 75 14 66 90 0f 00 2d 78 b7 b7 00 b9 01 00 00 00 48 89 f8 0f 01 c9 <65> 48 8b 04 25 c0 fb 01 00 f0 80 60 02 df f0 83 44 24 fc 00 48 8b
[440553.073159] RSP: 0018:ffffa92440103df0 EFLAGS: 00000046
[440553.073161] RAX: 0000000000000040 RBX: 0000000000000001 RCX: 0000000000000001
[440553.073162] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000040
[440553.073163] RBP: ffffa92440103e18 R08: 000190ae410d4c9a R09: 0000000000000000
[440553.073165] R10: 0000000000000001 R11: 071c71c71c71c71c R12: 0000000000000006
[440553.073166] R13: ffffffffaf4d49c0 R14: 0000000000000006 R15: ffffffffaf4d4c48
[440553.073167] FS: 0000000000000000(0000) GS:ffff9ae2b6580000(0000) knlGS:0000000000000000
[440553.073169] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[440553.073170] CR2: 00007f5bb43a2db0 CR3: 0000000324610005 CR4: 00000000003706e0
[440553.073172] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[440553.073173] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[440553.073174] Call Trace:
[440553.073175] <TASK>
[440553.073176] ? intel_idle_ibrs+0x4d/0xd0
[440553.073179] cpuidle_enter_state+0x97/0x620
[440553.073181] ? tick_nohz_stop_tick+0x16a/0x1d0
[440553.073184] cpuidle_enter+0x2e/0x50
[440553.073186] cpuidle_idle_call+0x142/0x1e0
[440553.073189] do_idle+0x83/0xf0
[440553.073191] cpu_startup_entry+0x20/0x30
[440553.073193] start_secondary+0x12a/0x180
[440553.073196] secondary_startup_64_no_verify+0xc2/0xcb
[440553.073200] </TASK>
[440556.401698] snd_hda_intel 0000:01:00.1: can't change power state from D3cold to D0 (config space inaccessible)
[440556.778840] snd_hda_codec_hdmi hdaudioC1D0: Unable to sync register 0x4f0800. -5
[440556.778860] snd_hda_codec_hdmi hdaudioC1D0: HDMI: invalid ELD buf size -1
[440556.778862] snd_hda_codec_hdmi hdaudioC1D0: HDMI: invalid ELD buf size -1
...
00:00.0 Host bridge: Intel Corporation Xeon E3-1200 v5/E3-1500 v5/6th Gen Core Processor Host Bridge/DRAM Registers (rev 07)
00:01.0 PCI bridge: Intel Corporation 6th-10th Gen Core Processor PCIe Controller (x16) (rev 07)
00:14.0 USB controller: Intel Corporation 100 Series/C230 Series Chipset Family USB 3.0 xHCI Controller (rev 31)
00:16.0 Communication controller: Intel Corporation 100 Series/C230 Series Chipset Family MEI Controller #1 (rev 31)
00:17.0 SATA controller: Intel Corporation Q170/Q150/B150/H170/H110/Z170/CM236 Chipset SATA Controller [AHCI Mode] (rev 31)
00:1b.0 PCI bridge: Intel Corporation 100 Series/C230 Series Chipset Family PCI Express Root Port #17 (rev f1)
00:1c.0 PCI bridge: Intel Corporation 100 Series/C230 Series Chipset Family PCI Express Root Port #1 (rev f1)
00:1c.2 PCI bridge: Intel Corporation 100 Series/C230 Series Chipset Family PCI Express Root Port #3 (rev f1)
00:1d.0 PCI bridge: Intel Corporation 100 Series/C230 Series Chipset Family PCI Express Root Port #9 (rev f1)
00:1f.0 ISA bridge: Intel Corporation Z170 Chipset LPC/eSPI Controller (rev 31)
00:1f.2 Memory controller: Intel Corporation 100 Series/C230 Series Chipset Family Power Management Controller (rev 31)
00:1f.3 Audio device: Intel Corporation 100 Series/C230 Series Chipset Family HD Audio Controller (rev 31)
00:1f.4 SMBus: Intel Corporation 100 Series/C230 Series Chipset Family SMBus (rev 31)
00:1f.6 Ethernet controller: Intel Corporation Ethernet Connection (2) I219-V (rev 31)
01:00.0 VGA compatible controller: NVIDIA Corporation GA102 [GeForce RTX 3090] (rev ff)
01:00.1 Audio device: NVIDIA Corporation GA102 High Definition Audio Controller (rev ff)
03:00.0 USB controller: ASMedia Technology Inc. ASM1142 USB 3.1 Host Controller
04:00.0 PCI bridge: ASMedia Technology Inc. ASM1083/1085 PCIe to PCI Bridge (rev 04)
06:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD Controller PM9A1/PM9A3/980PRO
And the output of LSPCI after the crash:
$ lspci
00:00.0 Host bridge: Intel Corporation Xeon E3-1200 v5/E3-1500 v5/6th Gen Core Processor Host Bridge/DRAM Registers (rev 07)
00:01.0 PCI bridge: Intel Corporation 6th-10th Gen Core Processor PCIe Controller (x16) (rev 07)
00:14.0 USB controller: Intel Corporation 100 Series/C230 Series Chipset Family USB 3.0 xHCI Controller (rev 31)
00:16.0 Communication controller: Intel Corporation 100 Series/C230 Series Chipset Family MEI Controller #1 (rev 31)
00:17.0 SATA controller: Intel Corporation Q170/Q150/B150/H170/H110/Z170/CM236 Chipset SATA Controller [AHCI Mode] (rev 31)
00:1b.0 PCI bridge: Intel Corporation 100 Series/C230 Series Chipset Family PCI Express Root Port #17 (rev f1)
00:1c.0 PCI bridge: Intel Corporation 100 Series/C230 Series Chipset Family PCI Express Root Port #1 (rev f1)
00:1c.2 PCI bridge: Intel Corporation 100 Series/C230 Series Chipset Family PCI Express Root Port #3 (rev f1)
00:1d.0 PCI bridge: Intel Corporation 100 Series/C230 Series Chipset Family PCI Express Root Port #9 (rev f1)
00:1f.0 ISA bridge: Intel Corporation Z170 Chipset LPC/eSPI Controller (rev 31)
00:1f.2 Memory controller: Intel Corporation 100 Series/C230 Series Chipset Family Power Management Controller (rev 31)
00:1f.3 Audio device: Intel Corporation 100 Series/C230 Series Chipset Family HD Audio Controller (rev 31)
00:1f.4 SMBus: Intel Corporation 100 Series/C230 Series Chipset Family SMBus (rev 31)
00:1f.6 Ethernet controller: Intel Corporation Ethernet Connection (2) I219-V (rev 31)
01:00.0 VGA compatible controller: NVIDIA Corporation GA102 [GeForce RTX 3090] (rev ff)
01:00.1 Audio device: NVIDIA Corporation GA102 High Definition Audio Controller (rev ff)
03:00.0 USB controller: ASMedia Technology Inc. ASM1142 USB 3.1 Host Controller
04:00.0 PCI bridge: ASMedia Technology Inc. ASM1083/1085 PCIe to PCI Bridge (rev 04)
06:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD Controller PM9A1/PM9A3/980PRO
Any input that helps me narrow this down to a PSU issue, hardware issue, or driver issue would be appreciated.
The full nvidia log dump is attached:
controlnet-training-crash-nvidia-bug-report.log.gz (341.4 KB)