"BUG: unable to handle kernel paging request at 0000000000002b20"

Hello,

We have random crashes/reboots on our GPU-enabled RHEL7 servers. I believe it’s related to nvidia, but not 100% sure. Wondering if anyone else has experienced similar failures.

Thank you!

3.10.0-1160.21.1.el7.x86_64

nvidia-driver-latest-460.32.03-1.el7.x86_64

GeForce RTX 2080 Ti

    CPUS: 48
    DATE: Thu Apr  8 04:56:42 2021
  UPTIME: 4 days, 16:06:26

LOAD AVERAGE: 4.26, 4.50, 4.64
TASKS: 856
NODENAME: node916.blah
RELEASE: 3.10.0-1160.21.1.el7.x86_64
VERSION: #1 SMP Tue Mar 16 13:23:19 EDT 2021
MACHINE: x86_64 (2200 Mhz)
MEMORY: 382.6 GB
PANIC: “BUG: unable to handle kernel paging request at 0000000000002b20”
PID: 165239
COMMAND: “python”
TASK: ffff9318f43ad280 [THREAD_INFO: ffff93492fb90000]
CPU: 38
STATE: TASK_RUNNING (PANIC)

crash> bt
PID: 165239 TASK: ffff9318f43ad280 CPU: 38 COMMAND: “python”
#0 [ffff9376fd7839f0] machine_kexec at ffffffffae0662c4
#1 [ffff9376fd783a50] kimage_load_segment at ffffffffae122732
#2 [ffff9376fd783b20] __crash_kexec at ffffffffae122820
#3 [ffff9376fd783b38] oops_end at ffffffffae78d798
#4 [ffff9376fd783b60] no_context at ffffffffae075d14
#5 [ffff9376fd783bb0] __bad_area_nosemaphore at ffffffffae075fe2
#6 [ffff9376fd783c00] bad_area_nosemaphore at ffffffffae076104
#7 [ffff9376fd783c10] __do_page_fault at ffffffffae790750
#8 [ffff9376fd783c80] do_page_fault at ffffffffae790975
#9 [ffff9376fd783cb0] page_fault at ffffffffae78c778
[exception RIP: _nv036002rm+4]
RIP: ffffffffc7153664 RSP: ffff9376fd783d68 RFLAGS: 00010092
RAX: ffff9347f23e6b28 RBX: 0000000000000000 RCX: 0000000000000000
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000002b20
RBP: ffff93386b13af00 R8: 0000000000000000 R9: 0000000000000020
R10: ffff9323b0978008 R11: ffff9323b0979098 R12: ffff9347f23e6b28
R13: 0000000000000000 R14: 00000000e9dfff5e R15: 0000000000000080
ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
#10 [ffff9376fd783d68] os_get_current_tick at ffffffffc6cf415c [nvidia]
#11 [ffff9376fd783da0] _nv009219rm at ffffffffc6d25761 [nvidia]
#12 [ffff9376fd783dd0] _nv036101rm at ffffffffc6d2657c [nvidia]
#13 [ffff9376fd783df0] _nv032953rm at ffffffffc6d6f883 [nvidia]
#14 [ffff9376fd783e20] rm_run_rc_callback at ffffffffc75af4e6 [nvidia]
#15 [ffff9376fd783e40] nvidia_rc_timer_callback at ffffffffc6ce4fdc [nvidia]
#16 [ffff9376fd783e58] nv_timer_callback_typed_data at ffffffffc6ce447d [nvidia]
#17 [ffff9376fd783e68] call_timer_fn at ffffffffae0abcf8
#18 [ffff9376fd783ea0] run_timer_softirq at ffffffffae0ae30d
#19 [ffff9376fd783f18] __do_softirq at ffffffffae0a4b35
#20 [ffff9376fd783f88] call_softirq at ffffffffae7994ec
#21 [ffff9376fd783fa0] do_softirq at ffffffffae02f715
#22 [ffff9376fd783fd8] smp_apic_timer_interrupt at ffffffffae79aa88
#23 [ffff9376fd783ff0] apic_timer_interrupt at ffffffffae796fba
— —
#24 [ffff93492fb93e88] apic_timer_interrupt at ffffffffae796fba
[exception RIP: __audit_free+450]
RIP: ffffffffae13e5f2 RSP: ffff93492fb93f38 RFLAGS: 00000282
RAX: 00000000c000003e RBX: ffff93492fb94000 RCX: 0000000000000000
RDX: 0000000000000000 RSI: 0000000000000004 RDI: 00000000000000e4
RBP: ffff93492fb93f48 R8: ffffffff00000000 R9: ffff9318f43ad280
R10: 0000000000000001 R11: 0000000000000000 R12: 0000000000745d30
R13: 0000000000000293 R14: 000055a2e9e94588 R15: ffff93492fb93f48
ORIG_RAX: ffffffffffffff10 CS: 0010 SS: 0018
#25 [ffff93492fb93f50] auditsys at ffffffffae79612d
RIP: 00007ffd0ec626c2 RSP: 00007ffd0ec559d0 RFLAGS: 00000202
RAX: 00000000000000e4 RBX: 00007ffd0ec55bf0 RCX: 0000000000000004
RDX: 0000000000000000 RSI: 00007ffd0ec55ba0 RDI: 0000000000000004
RBP: 00007ffd0ec55b80 R8: 000055a2e9e94588 R9: 0000000100000000
R10: ffffffff00000000 R11: 0000000000000293 R12: 000055a2e3da7c30
R13: 0000000000000000 R14: 0000000000000001 R15: 00007ffd0ec55bf0
ORIG_RAX: 00000000000000e4 CS: 0033 SS: 002b

Thanks for reporting this. It’s being tracked in internal bug number 3279571. The bug tracker is not public but you can refer to this number in future correspondence.

Hi @aplattner , I have the same issue as describe above on multiple Centos7 systems running the versions below

NVIDIA-SMI 465.19.01 Driver Version: 465.19.01 CUDA Version: 11.3
kernel 3.10.0-1160.6.1.el7.x86_64

Would it be possible to have an update on the bug 3279571 ? any workaround or is that fix in another release ?

Thanks

I face the same problem on eulerosv2r7, Driver Version: 460.80, @aplattner does it fixed in another release?

Hi @aplattner We see this also and have for some time. At least 2080 RTI, RTX 6000, RTX 8000 and A6000 GPUs. Various driver versions, most recently 460.73.01. Centos= 3.10.0-1160.31.1.el7.x86_64 #1. Seen on different Supermicro systems.

Thanks,
Steve Nadas

Hi @aplattner, we are facing the same issue with driver ver. 460.73.01 or 460.32.03.

  1. RHEL 7.6 with Kernel v. 3.10.0-1160.31.1.el7.x86_64
  2. Driver ver. 460.73.01 or 460.32.03
  3. HPE XL270d
  4. Tesla V100 (8x GPU Cards)

We cant take the machines in production if we don’t have a fix. Delaying projects is no fun. Thanks.

Solution for me: I upgraded the system to RHEL 7.9 with 7.9 Kernel - 3.10.0-1160.31.1.el7.x86_64 and NVIDIA driver version 460.32.03 runs stable. 5 days now without crash. It is interesting that I didn’t get any help from NVIDIA (enterprise) because I don’t run a grid version and from HEP because this constellation is not supported.

Update: Problem not resolved… crashed after 5 days…