"BUG: unable to handle kernel paging request at 0000000000002b20"

Hello,

We have random crashes/reboots on our GPU-enabled RHEL7 servers. I believe it’s related to nvidia, but not 100% sure. Wondering if anyone else has experienced similar failures.

Thank you!

3.10.0-1160.21.1.el7.x86_64

nvidia-driver-latest-460.32.03-1.el7.x86_64

GeForce RTX 2080 Ti

    CPUS: 48
    DATE: Thu Apr  8 04:56:42 2021
  UPTIME: 4 days, 16:06:26

LOAD AVERAGE: 4.26, 4.50, 4.64
TASKS: 856
NODENAME: node916.blah
RELEASE: 3.10.0-1160.21.1.el7.x86_64
VERSION: #1 SMP Tue Mar 16 13:23:19 EDT 2021
MACHINE: x86_64 (2200 Mhz)
MEMORY: 382.6 GB
PANIC: “BUG: unable to handle kernel paging request at 0000000000002b20”
PID: 165239
COMMAND: “python”
TASK: ffff9318f43ad280 [THREAD_INFO: ffff93492fb90000]
CPU: 38
STATE: TASK_RUNNING (PANIC)

crash> bt
PID: 165239 TASK: ffff9318f43ad280 CPU: 38 COMMAND: “python”
#0 [ffff9376fd7839f0] machine_kexec at ffffffffae0662c4
#1 [ffff9376fd783a50] kimage_load_segment at ffffffffae122732
#2 [ffff9376fd783b20] __crash_kexec at ffffffffae122820
#3 [ffff9376fd783b38] oops_end at ffffffffae78d798
#4 [ffff9376fd783b60] no_context at ffffffffae075d14
#5 [ffff9376fd783bb0] __bad_area_nosemaphore at ffffffffae075fe2
#6 [ffff9376fd783c00] bad_area_nosemaphore at ffffffffae076104
#7 [ffff9376fd783c10] __do_page_fault at ffffffffae790750
#8 [ffff9376fd783c80] do_page_fault at ffffffffae790975
#9 [ffff9376fd783cb0] page_fault at ffffffffae78c778
[exception RIP: _nv036002rm+4]
RIP: ffffffffc7153664 RSP: ffff9376fd783d68 RFLAGS: 00010092
RAX: ffff9347f23e6b28 RBX: 0000000000000000 RCX: 0000000000000000
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000002b20
RBP: ffff93386b13af00 R8: 0000000000000000 R9: 0000000000000020
R10: ffff9323b0978008 R11: ffff9323b0979098 R12: ffff9347f23e6b28
R13: 0000000000000000 R14: 00000000e9dfff5e R15: 0000000000000080
ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
#10 [ffff9376fd783d68] os_get_current_tick at ffffffffc6cf415c [nvidia]
#11 [ffff9376fd783da0] _nv009219rm at ffffffffc6d25761 [nvidia]
#12 [ffff9376fd783dd0] _nv036101rm at ffffffffc6d2657c [nvidia]
#13 [ffff9376fd783df0] _nv032953rm at ffffffffc6d6f883 [nvidia]
#14 [ffff9376fd783e20] rm_run_rc_callback at ffffffffc75af4e6 [nvidia]
#15 [ffff9376fd783e40] nvidia_rc_timer_callback at ffffffffc6ce4fdc [nvidia]
#16 [ffff9376fd783e58] nv_timer_callback_typed_data at ffffffffc6ce447d [nvidia]
#17 [ffff9376fd783e68] call_timer_fn at ffffffffae0abcf8
#18 [ffff9376fd783ea0] run_timer_softirq at ffffffffae0ae30d
#19 [ffff9376fd783f18] __do_softirq at ffffffffae0a4b35
#20 [ffff9376fd783f88] call_softirq at ffffffffae7994ec
#21 [ffff9376fd783fa0] do_softirq at ffffffffae02f715
#22 [ffff9376fd783fd8] smp_apic_timer_interrupt at ffffffffae79aa88
#23 [ffff9376fd783ff0] apic_timer_interrupt at ffffffffae796fba
— —
#24 [ffff93492fb93e88] apic_timer_interrupt at ffffffffae796fba
[exception RIP: __audit_free+450]
RIP: ffffffffae13e5f2 RSP: ffff93492fb93f38 RFLAGS: 00000282
RAX: 00000000c000003e RBX: ffff93492fb94000 RCX: 0000000000000000
RDX: 0000000000000000 RSI: 0000000000000004 RDI: 00000000000000e4
RBP: ffff93492fb93f48 R8: ffffffff00000000 R9: ffff9318f43ad280
R10: 0000000000000001 R11: 0000000000000000 R12: 0000000000745d30
R13: 0000000000000293 R14: 000055a2e9e94588 R15: ffff93492fb93f48
ORIG_RAX: ffffffffffffff10 CS: 0010 SS: 0018
#25 [ffff93492fb93f50] auditsys at ffffffffae79612d
RIP: 00007ffd0ec626c2 RSP: 00007ffd0ec559d0 RFLAGS: 00000202
RAX: 00000000000000e4 RBX: 00007ffd0ec55bf0 RCX: 0000000000000004
RDX: 0000000000000000 RSI: 00007ffd0ec55ba0 RDI: 0000000000000004
RBP: 00007ffd0ec55b80 R8: 000055a2e9e94588 R9: 0000000100000000
R10: ffffffff00000000 R11: 0000000000000293 R12: 000055a2e3da7c30
R13: 0000000000000000 R14: 0000000000000001 R15: 00007ffd0ec55bf0
ORIG_RAX: 00000000000000e4 CS: 0033 SS: 002b

Thanks for reporting this. It’s being tracked in internal bug number 3279571. The bug tracker is not public but you can refer to this number in future correspondence.

Hi @aplattner , I have the same issue as describe above on multiple Centos7 systems running the versions below

NVIDIA-SMI 465.19.01 Driver Version: 465.19.01 CUDA Version: 11.3
kernel 3.10.0-1160.6.1.el7.x86_64

Would it be possible to have an update on the bug 3279571 ? any workaround or is that fix in another release ?

Thanks

I face the same problem on eulerosv2r7, Driver Version: 460.80, @aplattner does it fixed in another release?

Hi @aplattner We see this also and have for some time. At least 2080 RTI, RTX 6000, RTX 8000 and A6000 GPUs. Various driver versions, most recently 460.73.01. Centos= 3.10.0-1160.31.1.el7.x86_64 #1. Seen on different Supermicro systems.

Thanks,
Steve Nadas

Hi @aplattner, we are facing the same issue with driver ver. 460.73.01 or 460.32.03.

  1. RHEL 7.6 with Kernel v. 3.10.0-1160.31.1.el7.x86_64
  2. Driver ver. 460.73.01 or 460.32.03
  3. HPE XL270d
  4. Tesla V100 (8x GPU Cards)

We cant take the machines in production if we don’t have a fix. Delaying projects is no fun. Thanks.

Solution for me: I upgraded the system to RHEL 7.9 with 7.9 Kernel - 3.10.0-1160.31.1.el7.x86_64 and NVIDIA driver version 460.32.03 runs stable. 5 days now without crash. It is interesting that I didn’t get any help from NVIDIA (enterprise) because I don’t run a grid version and from HEP because this constellation is not supported.

Update: Problem not resolved… crashed after 5 days…

Any update about the bug: 3279571?

That bug should have been fixed in 470.94. Are you experiencing a crash? If so, please generate and attach a new bug report log.

We resolved our problem by enabling persistence mode. It may seem trivial, but we had absolutely no crashes with pm enabled (or using daemon), and as soon as pm was disabled, crashes showed up again. Re-enabled pm and no crashes for months now.

EDIT: Just to clarify, we saw crashes for months on several systems, but once we enabled pm, the crashes were gone on all systems. We disabled pm just to confirm our findings.