GPU is falling of the bus. Need help on how to proceed

nvidia-bug-report.log.gz (66.0 KB)

I’m getting gpu crashes in linux. They appear to be random, but usually occur when watching videos on youtube at full-screen. The screen freezes, but the computer not. Logging remotely, one can still access it.

Attached is the nvidia-bug-report-log.

This is the kernel error from these events:

[ 2434.575193] NVRM: GPU at PCI:0000:01:00: GPU-11d3516a-065c-f341-452b-55043e080074
[ 2434.575196] NVRM: Xid (PCI:0000:01:00): 79, pid=0, GPU has fallen off the bus.
[ 2434.575198] NVRM: GPU 0000:01:00.0: GPU has fallen off the bus.
[ 2434.575209] NVRM: GPU 0000:01:00.0: GPU serial number is .
[ 2434.575216] NVRM: A GPU crash dump has been created. If possible, please run
NVRM: nvidia-bug-report.sh as root to collect this data before
NVRM: the NVIDIA kernel module is unloaded.
[ 2434.783134] irq 16: nobody cared (try booting with the “irqpoll” option)
[ 2434.783138] CPU: 2 PID: 0 Comm: swapper/2 Tainted: P O 5.15.16-gentoo #2
[ 2434.783139] Hardware name: Gigabyte Technology Co., Ltd. Z370 AORUS ULTRA GAMING 2.0/Z370 AORUS ULTRA GAMING
[ 2434.783140] Call Trace:
[ 2434.783142]
[ 2434.783143] dump_stack_lvl+0x34/0x44
[ 2434.783147] __report_bad_irq+0x30/0xa2
[ 2434.783149] note_interrupt.cold+0xb/0x61
[ 2434.783150] handle_irq_event+0x95/0xa0
[ 2434.783153] handle_fasteoi_irq+0x6e/0x1b0
[ 2434.783155] __common_interrupt+0x39/0x90
[ 2434.783158] common_interrupt+0x7b/0xa0
[ 2434.783161]
[ 2434.783161]
[ 2434.783161] asm_common_interrupt+0x1e/0x40
[ 2434.783164] RIP: 0010:cpuidle_enter_state+0xc5/0x310
[ 2434.783166] Code: 49 89 c5 0f 1f 44 00 00 31 ff e8 96 22 75 ff 45 84 ff 74 12 9c 58 f6 c4 02 0f 85 2e 02 00 0
[ 2434.783168] RSP: 0018:ffffa393800d3ea8 EFLAGS: 00000202
[ 2434.783169] RAX: ffff99c5eeca9840 RBX: 0000000000000008 RCX: 000000000000001f
[ 2434.783170] RDX: 0000000000000000 RSI: 000000002d958513 RDI: 0000000000000000
[ 2434.783171] RBP: ffff99c5eecb2070 R08: 00000236e4623845 R09: 00000237ae88fc15
[ 2434.783172] R10: 0000000000000002 R11: ffff99c5eeca8864 R12: ffffffffb89c9b00
[ 2434.783173] R13: 00000236e4623845 R14: 0000000000000008 R15: 0000000000000000
[ 2434.783174] ? cpuidle_enter_state+0xaa/0x310
[ 2434.783175] cpuidle_enter+0x24/0x40
[ 2434.783177] do_idle+0x1c6/0x250
[ 2434.783179] cpu_startup_entry+0x14/0x20
[ 2434.783180] secondary_startup_64_no_verify+0xb0/0xbb
[ 2434.783182]
[ 2434.783182] handlers:
[ 2434.783183] [<00000000ffb4f5a5>] i801_isr
[ 2434.783186] Disabling IRQ #16

How can one fix this? Thank you!

This is most likely a hardware issue. Please monitor temperatures of the gpu, make sure the fans are working. Try reseating the gpu in its pcie slot, check if it works in another slot if available, check if it works in another system.