Kernel crash on hibernate using Nvidia driver v470.94,v470.82 - Tesla T4 GPU

Hi,

I am testing out hibernate capability on a Tesla T4 GPU card in an AWS instance running Centos 7 with Linux kernel 5.15.

Installed the Nvidia driver versions 470.94 & 470.82.01(separately),followed the instructions as in Chapter 21. Configuring Power Management Support and set options
nvidia NVreg_PreserveVideoMemoryAllocations=1
NVreg_TemporaryFilePath=/tmp-nvidia

On trying the “systemctl hibernate” command,the following kernel crash is seen (for both the driver versions)

=============================================================================================
[root@localhost ~]# systemctl hibernate
[ 256.752371] PM: hibernation: hibernation entry
[ 262.216283] Filesystems sync: 5.464 seconds
[ 262.216297] Freezing user space processes … (elapsed 0.001 seconds) done.
[ 262.217490] OOM killer disabled.
[ 262.218599] PM: hibernation: Preallocating image memory
[ 263.118489] PM: hibernation: Allocated 65323 pages for snapshot
[ 263.118506] PM: hibernation: Allocated 261292 kbytes in 0.89 seconds (293.58 MB/s)
[ 263.118513] Freezing remaining freezable tasks … (elapsed 0.000 seconds) done.
[ 263.120883] Disabling non-boot CPUs …
[ 263.122303] smpboot: CPU 1 is now offline
[ 263.123950] smpboot: CPU 2 is now offline
[ 263.227221] smpboot: CPU 3 is now offline
[ 263.227446] PM: hibernation: Creating image:
[ 263.287031] PM: hibernation: Need to copy 64977 pages
[ 263.400985] PM: hibernation: Image created (64977 pages copied)
[ 263.401018] BUG: unable to handle page fault for address: ffff83081ee41005
[ 263.401025] #PF: supervisor read access in kernel mode
[ 263.401028] #PF: error_code(0x0001) - permissions violation
[ 263.401032] PGD 8000000000000062 P4D 8000000000000062 PUD c2c2c2c2c2c2c2c2
[ 263.401038] Oops: 0001 [#1] SMP NOPTI
[ 263.401042] CPU: 0 PID: 8087 Comm: systemd-sleep Tainted: P O 5.15.0-xspot #9
[ 263.401047] RIP: e030:xen_convert_trap_info+0x5b/0x90
[ 263.401056] Code: 72 ff 45 31 db 45 31 e4 eb 15 45 84 ed 74 04 41 83 c4 01 49 8d 43 01 4d 39 f3 74 31 49 89 c3 4c 89 de 48 c1 e6 04 48 03 75 02 <0f> b6 46 05 83 e0 1e 3c 0e 75 d5 44 89 e2 44 89 df 48 c1 e2 04 48
[ 263.401063] RSP: e02b:ffffc90040aafd10 EFLAGS: 00010086
[ 263.401068] RAX: 0000000000001000 RBX: ffffffff82ad7360 RCX: 0000000000000000
[ 263.401073] RDX: 0000000000000100 RSI: ffff83081ee41000 RDI: ffffffff82b9bdb7
[ 263.401078] RBP: ffffffff82b9bdb7 R08: aaaaaaaaaaaaaaaa R09: ffff88813fe16310
[ 263.401083] R10: 0000000000007ff0 R11: 0000000000000000 R12: 0000000000000000
[ 263.401087] R13: 0000000000000000 R14: 00000000000000ff R15: ffff8881032de6e0
[ 263.401099] FS: 00007f02c9c86780(0000) GS:ffff88813fe00000(0000) knlGS:0000000000000000
[ 263.401105] CS: 10000e030 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 263.401109] CR2: ffff83081ee41005 CR3: 0000000101114000 CR4: 0000000000010660
[ 263.401115] Call Trace:
[ 263.401119] xen_load_idt+0x3f/0x80
[ 263.401125] restore_processor_state+0x7f/0x2d0
[ 263.401132] ? swsusp_save+0x30c/0x31a
[ 263.401139] hibernation_snapshot+0x13e/0x310
[ 263.401145] hibernate.cold+0x8b/0x205
[ 263.401150] state_store+0xc6/0xd0
[ 263.401157] kernfs_fop_write_iter+0x10c/0x190
[ 263.401168] new_sync_write+0x11d/0x1b0
[ 263.401175] vfs_write+0x154/0x240
[ 263.401180] ksys_write+0x5a/0xd0
[ 263.401185] do_syscall_64+0x43/0xc0
[ 263.401193] entry_SYSCALL_64_after_hwframe+0x44/0xae
[ 263.401200] RIP: 0033:0x7f02c9363ba0
[ 263.401207] Code: Unable to access opcode bytes at RIP 0x7f02c9363b76.
[ 263.401211] RSP: 002b:00007ffe2d7f0ea8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
[ 263.401221] RAX: ffffffffffffffda RBX: 0000000000000005 RCX: 00007f02c9363ba0
[ 263.401225] RDX: 0000000000000005 RSI: 00007f02c9c93000 RDI: 0000000000000004
[ 263.401230] RBP: 00007f02c9c93000 R08: 00007f02c9c86780 R09: 00007f02c9c86780
[ 263.401235] R10: 0000000000000022 R11: 0000000000000246 R12: 0000564a68dac530
[ 263.401240] R13: 0000000000000005 R14: 0000564a68dac530 R15: 00007ffe2d7f1030
[ 263.401248] Modules linked in: nvidia_drm(PO) nvidia_modeset(PO) nvidia(PO) [last unloaded: vfio]
[ 263.401259] CR2: ffff83081ee41005
[ 263.401263] —[ end trace 44a56e39a9ab76c9 ]—
[ 263.401266] RIP: e030:xen_convert_trap_info+0x5b/0x90
[ 263.401274] Code: 72 ff 45 31 db 45 31 e4 eb 15 45 84 ed 74 04 41 83 c4 01 49 8d 43 01 4d 39 f3 74 31 49 89 c3 4c 89 de 48 c1 e6 04 48 03 75 02 <0f> b6 46 05 83 e0 1e 3c 0e 75 d5 44 89 e2 44 89 df 48 c1 e2 04 48
[ 263.401284] RSP: e02b:ffffc90040aafd10 EFLAGS: 00010086
[ 263.401291] RAX: 0000000000001000 RBX: ffffffff82ad7360 RCX: 0000000000000000
[ 263.401295] RDX: 0000000000000100 RSI: ffff83081ee41000 RDI: ffffffff82b9bdb7
[ 263.401300] RBP: ffffffff82b9bdb7 R08: aaaaaaaaaaaaaaaa R09: ffff88813fe16310
[ 263.401308] R10: 0000000000007ff0 R11: 0000000000000000 R12: 0000000000000000
[ 263.401313] R13: 0000000000000000 R14: 00000000000000ff R15: ffff8881032de6e0
[ 263.401321] FS: 00007f02c9c86780(0000) GS:ffff88813fe00000(0000) knlGS:0000000000000000
[ 263.401325] CS: 10000e030 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 263.401329] CR2: 00007f02c9363b76 CR3: 0000000101114000 CR4: 0000000000010660
[ 263.401556] ------------[ cut here ]------------
[ 263.401565] WARNING: CPU: 0 PID: 0 at kernel/time/timekeeping.c:824 ktime_get+0x86/0x90
[ 263.401574] Modules linked in: nvidia_drm(PO) nvidia_modeset(PO) nvidia(PO) [last unloaded: vfio]
[ 263.401580] CPU: 0 PID: 0 Comm: swapper/0 Tainted: P D O 5.15.0-xspot #9
[ 263.401587] RIP: e030:ktime_get+0x86/0x90
[ 263.401591] Code: d0 40 a4 01 48 0f 45 c5 8b 35 a6 40 a4 01 41 39 f4 75 a1 48 0f af d0 48 8d 04 3a 48 d3 e8 48 01 d8 5b 5d 41 5c c3 f3 90 eb 8a <0f> 0b eb 84 66 0f 1f 44 00 00 41 54 55 31 ed 53 44 8b 25 73 40 a4
[ 263.401601] RSP: e02b:ffffffff82403de0 EFLAGS: 00010002
[ 263.401604] RAX: 0000000000000001 RBX: ffff88813fe1b160 RCX: 0000000000000100
[ 263.401607] RDX: ffffffffffffff01 RSI: 0000000000000000 RDI: ffffffff82af2980
[ 263.401611] RBP: 00000000000000e4 R08: 00000000000000f0 R09: 0000000000000000
[ 263.401614] R10: 0000000000000000 R11: 0000000000000003 R12: ffffffff82415580
[ 263.401617] R13: 0000000000000000 R14: ffffffff82415118 R15: 0000000000000000
[ 263.401624] FS: 0000000000000000(0000) GS:ffff88813fe00000(0000) knlGS:0000000000000000
[ 263.401630] CS: 10000e030 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 263.401635] CR2: 00007f02c9363b76 CR3: 000000000240c000 CR4: 0000000000010660
[ 263.401640] Call Trace:
[ 263.401644] tick_nohz_idle_enter+0x2a/0x50
[ 263.401649] do_idle+0x41/0x240
[ 263.401657] cpu_startup_entry+0x14/0x20
[ 263.401663] start_kernel+0x5ff/0x624
[ 263.401670] xen_start_kernel+0x4ee/0x4f7
[ 263.401676] startup_xen+0x3e/0x3e
[ 263.401682] —[ end trace 44a56e39a9ab76ca ]—

==============================================================================================

Even with kernel hibernate call approach the same crash is seen.

Have attached nvidia-bug-report.log.gz
nvidia-bug-report.log.gz (1.2 MB)
for reference.
Please advise on the situation and possible resolution.

Thanks

That rather looks like a Xen issue, telling by the backtrace.

Thanks for your inputs.Will check further on the xen side.

Along the similar lines have a question,is runtime PM(runtime_suspend(),runtime_resume()) supported by Nvidia driver?
Using the configuration described earlier,i.e
NVreg_PreserveVideoMemoryAllocations=1
NVreg_TemporaryFilePath=/tmp-nvidia

I tried “nvidia-sleep.sh hibernate” hoping that the nvidia device specific hibernation would be triggered and corresponding image file would be created in /tmp-nvidia,but no activity and no image file created.I do see the nvidia-hibernate.service enabled as below


● nvidia-hibernate.service - NVIDIA system hibernate actions
Loaded: loaded (/usr/lib/systemd/system/nvidia-hibernate.service; enabled; vendor preset: disabled)
Active: inactive (dead)
[root@localhost ~]# systemctl status nvidia-resume.service
● nvidia-resume.service - NVIDIA system resume actions
Loaded: loaded (/usr/lib/systemd/system/nvidia-resume.service; enabled; vendor preset: disabled)
Active: inactive (dead)


Is this a valid procedure to try? Please clarify.

Those systemd units are not meant to be called manually, they only save/load the video memory contents on suspend/hibernate/resume, not doing any suspend actions beyond that. They’re only called in systemd’s suspend procedure.
Runtime PM is also supported by the driver but only in conjunction with the X driver, usually not useful/wanted on compute servers which are running headless with nvidia-persistenced enabled. https://download.nvidia.com/XFree86/Linux-x86_64/495.44/README/dynamicpowermanagement.html

1 Like