Question on disabling a potentially failed GPU

Hi there,

We have a server running with 8 GPUs, and potentially some of them are failing from the hardware perspective. The following is the kernel log to indicate it:

Mar 4 00:12:25 lux1-training-prod-001 kernel: [221342.906485] NVRM: _kgspBootGspRm: unexpected WPR2 already up, cannot proceed with booting GSP
Mar 4 00:12:25 lux1-training-prod-001 kernel: [221342.906486] NVRM: _kgspBootGspRm: (the GPU is likely in a bad state and may need to be reset)
Mar 4 00:12:25 lux1-training-prod-001 kernel: [221342.906493] NVRM: RmInitAdapter: Cannot initialize GSP firmware RM
Mar 4 00:12:25 lux1-training-prod-001 kernel: [221342.908062] NVRM: GPU 0000:3d:00.0: RmInitAdapter failed! (0x62:0x40:1784)
Mar 4 00:12:25 lux1-training-prod-001 kernel: [221342.906485] NVRM: _kgspBootGspRm: unexpected WPR2 already up, cannot proceed with booting GSP
Mar 4 00:12:25 lux1-training-prod-001 kernel: [221342.906486] NVRM: _kgspBootGspRm: (the GPU is likely in a bad state and may need to be reset)
Mar 4 00:12:25 lux1-training-prod-001 kernel: [221342.906493] NVRM: RmInitAdapter: Cannot initialize GSP firmware RM
Mar 4 00:12:25 lux1-training-prod-001 kernel: [221342.908062] NVRM: GPU 0000:3d:00.0: RmInitAdapter failed! (0x62:0x40:1784)
Mar 4 00:12:25 lux1-training-prod-001 kernel: [221342.909274] NVRM: GPU 0000:3d:00.0: rm_init_adapter failed, device minor number 4
Mar 4 00:12:25 lux1-training-prod-001 kernel: [221342.909274] NVRM: GPU 0000:3d:00.0: rm_init_adapter failed, device minor number 4

Not sure if it is related, but our Ubuntu 20.04 server with 5.4.0-156-generic kernel sometimes could also have some CPU soft-lockup, which hangs the whole system. Some example backtrace could be found as the following:

Mar 4 10:11:12 lux1-training-prod-001 kernel: [257270.777906] CPU: 47 PID: 2231504 Comm: nvidia-smi Tainted: G OEL 5.4.0-156-generic #173-Ubuntu
Mar 4 10:11:12 lux1-training-prod-001 kernel: [257270.777907] Hardware name: Supermicro SYS-4029GP-TRT2/X11DPG-OT-CPU, BIOS 3.8b 01/17/2023
Mar 4 10:11:12 lux1-training-prod-001 kernel: [257270.777915] RIP: 0010:smp_call_function_single+0x9b/0x110
Mar 4 10:11:12 lux1-training-prod-001 kernel: [257270.777918] Code: 65 8b 05 d0 87 8d 4e a9 00 01 1f 00 75 79 85 c9 75 40 48 c7 c6 80 0d 03 00 65 48 03 35 86 1f 8d 4e 8b 46 18 a8 01 74 09 f3 90 <8b> 46 18 a8 01 75 f7 83 4e 18 01 4c 89 c9 4c 89 c2 e8 7f fe ff ff
Mar 4 10:11:12 lux1-training-prod-001 kernel: [257270.777918] RSP: 0018:ffffa14b6d7dfba0 EFLAGS: 00000202 ORIG_RAX: ffffffffffffff13
Mar 4 10:11:12 lux1-training-prod-001 kernel: [257270.777920] RAX: 0000000000000001 RBX: 000000002048ac01 RCX: 0000000000000000
Mar 4 10:11:12 lux1-training-prod-001 kernel: [257270.777921] RDX: 0000000000000000 RSI: ffff8b4d80cf0d80 RDI: 0000000000000043
Mar 4 10:11:12 lux1-training-prod-001 kernel: [257270.777922] RBP: ffffa14b6d7dfbe8 R08: ffffffffb1647630 R09: 0000000000000000
Mar 4 10:11:12 lux1-training-prod-001 kernel: [257270.777923] R10: 0000000000000000 R11: ffff8b4d668d5800 R12: 0000000000000043
Mar 4 10:11:12 lux1-training-prod-001 kernel: [257270.777924] R13: 0000e9f61dfeb5a6 R14: 0000000000000000 R15: ffff8b3c8899b800
Mar 4 10:11:12 lux1-training-prod-001 kernel: [257270.777925] FS: 00007f7ef0117340(0000) GS:ffff8b4d80cc0000(0000) knlGS:0000000000000000
Mar 4 10:11:12 lux1-training-prod-001 kernel: [257270.777926] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Mar 4 10:11:12 lux1-training-prod-001 kernel: [257270.777927] CR2: 00007f7ef02cf8fc CR3: 00000039f9fd0006 CR4: 00000000007606e0
Mar 4 10:11:12 lux1-training-prod-001 kernel: [257270.777928] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Mar 4 10:11:12 lux1-training-prod-001 kernel: [257270.777929] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Mar 4 10:11:12 lux1-training-prod-001 kernel: [257270.777929] PKRU: 55555554
Mar 4 10:11:12 lux1-training-prod-001 kernel: [257270.777930] Call Trace:
Mar 4 10:11:12 lux1-training-prod-001 kernel: [257270.777936] ? ktime_get+0x3e/0xa0
Mar 4 10:11:12 lux1-training-prod-001 kernel: [257270.777941] aperfmperf_snapshot_cpu+0x42/0x50
Mar 4 10:11:12 lux1-training-prod-001 kernel: [257270.777943] arch_freq_prepare_all+0x67/0xa0
Mar 4 10:11:12 lux1-training-prod-001 kernel: [257270.777948] cpuinfo_open+0x13/0x30
Mar 4 10:11:12 lux1-training-prod-001 kernel: [257270.777952] proc_reg_open+0x77/0x130
Mar 4 10:11:12 lux1-training-prod-001 kernel: [257270.777954] ? proc_put_link+0x10/0x10
Mar 4 10:11:12 lux1-training-prod-001 kernel: [257270.777958] do_dentry_open+0x143/0x3a0
Mar 4 10:11:12 lux1-training-prod-001 kernel: [257270.777960] vfs_open+0x2d/0x30
Mar 4 10:11:12 lux1-training-prod-001 kernel: [257270.777964] do_last+0x194/0x900
Mar 4 10:11:12 lux1-training-prod-001 kernel: [257270.777966] path_openat+0x8d/0x290
Mar 4 10:11:12 lux1-training-prod-001 kernel: [257270.777968] do_filp_open+0x91/0x100
Mar 4 10:11:12 lux1-training-prod-001 kernel: [257270.777971] ? __alloc_fd+0x46/0x150
Mar 4 10:11:12 lux1-training-prod-001 kernel: [257270.777974] do_sys_open+0x17e/0x290
Mar 4 10:11:12 lux1-training-prod-001 kernel: [257270.777976] __x64_sys_openat+0x20/0x30
Mar 4 10:11:12 lux1-training-prod-001 kernel: [257270.777980] do_syscall_64+0x57/0x190
Mar 4 10:11:12 lux1-training-prod-001 kernel: [257270.777983] entry_SYSCALL_64_after_hwframe+0x5c/0xc1
Mar 4 10:11:12 lux1-training-prod-001 kernel: [257270.777985] RIP: 0033:0x7f7ef022ff5b

We are using the Nvidia 550.90.07 open driver on this server, and GPU model is NVIDIA GeForce RTX 2080 Ti.

Since the above issue does not happen on other servers with similar configuration, I’m suspecting the soft cpu lockup issue could result from the failed GPU and driver related.

For this specific question, I would like to know if there is a way to disable certain GPU, so the nvidia kernel driver will not complaining about this PCIe devices all the time and we could schedule a GPU repair on server later?

One of the suggestions here may help.

1 Like