"BUG: unable to handle kernel paging request at 0000000000002b20"

advorkin · April 8, 2021, 5:23pm

Hello,

We have random crashes/reboots on our GPU-enabled RHEL7 servers. I believe it’s related to nvidia, but not 100% sure. Wondering if anyone else has experienced similar failures.

Thank you!

3.10.0-1160.21.1.el7.x86_64

nvidia-driver-latest-460.32.03-1.el7.x86_64

GeForce RTX 2080 Ti

    CPUS: 48
    DATE: Thu Apr  8 04:56:42 2021
  UPTIME: 4 days, 16:06:26

LOAD AVERAGE: 4.26, 4.50, 4.64
TASKS: 856
NODENAME: node916.blah
RELEASE: 3.10.0-1160.21.1.el7.x86_64
VERSION: #1 SMP Tue Mar 16 13:23:19 EDT 2021
MACHINE: x86_64 (2200 Mhz)
MEMORY: 382.6 GB
PANIC: “BUG: unable to handle kernel paging request at 0000000000002b20”
PID: 165239
COMMAND: “python”
TASK: ffff9318f43ad280 [THREAD_INFO: ffff93492fb90000]
CPU: 38
STATE: TASK_RUNNING (PANIC)

crash> bt
PID: 165239 TASK: ffff9318f43ad280 CPU: 38 COMMAND: “python”
#0 [ffff9376fd7839f0] machine_kexec at ffffffffae0662c4
#1 [ffff9376fd783a50] kimage_load_segment at ffffffffae122732
#2 [ffff9376fd783b20] __crash_kexec at ffffffffae122820
#3 [ffff9376fd783b38] oops_end at ffffffffae78d798
#4 [ffff9376fd783b60] no_context at ffffffffae075d14
#5 [ffff9376fd783bb0] __bad_area_nosemaphore at ffffffffae075fe2
#6 [ffff9376fd783c00] bad_area_nosemaphore at ffffffffae076104
#7 [ffff9376fd783c10] __do_page_fault at ffffffffae790750
#8 [ffff9376fd783c80] do_page_fault at ffffffffae790975
#9 [ffff9376fd783cb0] page_fault at ffffffffae78c778
[exception RIP: _nv036002rm+4]
RIP: ffffffffc7153664 RSP: ffff9376fd783d68 RFLAGS: 00010092
RAX: ffff9347f23e6b28 RBX: 0000000000000000 RCX: 0000000000000000
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000002b20
RBP: ffff93386b13af00 R8: 0000000000000000 R9: 0000000000000020
R10: ffff9323b0978008 R11: ffff9323b0979098 R12: ffff9347f23e6b28
R13: 0000000000000000 R14: 00000000e9dfff5e R15: 0000000000000080
ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
#10 [ffff9376fd783d68] os_get_current_tick at ffffffffc6cf415c [nvidia]
#11 [ffff9376fd783da0] _nv009219rm at ffffffffc6d25761 [nvidia]
#12 [ffff9376fd783dd0] _nv036101rm at ffffffffc6d2657c [nvidia]
#13 [ffff9376fd783df0] _nv032953rm at ffffffffc6d6f883 [nvidia]
#14 [ffff9376fd783e20] rm_run_rc_callback at ffffffffc75af4e6 [nvidia]
#15 [ffff9376fd783e40] nvidia_rc_timer_callback at ffffffffc6ce4fdc [nvidia]
#16 [ffff9376fd783e58] nv_timer_callback_typed_data at ffffffffc6ce447d [nvidia]
#17 [ffff9376fd783e68] call_timer_fn at ffffffffae0abcf8
#18 [ffff9376fd783ea0] run_timer_softirq at ffffffffae0ae30d
#19 [ffff9376fd783f18] __do_softirq at ffffffffae0a4b35
#20 [ffff9376fd783f88] call_softirq at ffffffffae7994ec
#21 [ffff9376fd783fa0] do_softirq at ffffffffae02f715
#22 [ffff9376fd783fd8] smp_apic_timer_interrupt at ffffffffae79aa88
#23 [ffff9376fd783ff0] apic_timer_interrupt at ffffffffae796fba
— —
#24 [ffff93492fb93e88] apic_timer_interrupt at ffffffffae796fba
[exception RIP: __audit_free+450]
RIP: ffffffffae13e5f2 RSP: ffff93492fb93f38 RFLAGS: 00000282
RAX: 00000000c000003e RBX: ffff93492fb94000 RCX: 0000000000000000
RDX: 0000000000000000 RSI: 0000000000000004 RDI: 00000000000000e4
RBP: ffff93492fb93f48 R8: ffffffff00000000 R9: ffff9318f43ad280
R10: 0000000000000001 R11: 0000000000000000 R12: 0000000000745d30
R13: 0000000000000293 R14: 000055a2e9e94588 R15: ffff93492fb93f48
ORIG_RAX: ffffffffffffff10 CS: 0010 SS: 0018
#25 [ffff93492fb93f50] auditsys at ffffffffae79612d
RIP: 00007ffd0ec626c2 RSP: 00007ffd0ec559d0 RFLAGS: 00000202
RAX: 00000000000000e4 RBX: 00007ffd0ec55bf0 RCX: 0000000000000004
RDX: 0000000000000000 RSI: 00007ffd0ec55ba0 RDI: 0000000000000004
RBP: 00007ffd0ec55b80 R8: 000055a2e9e94588 R9: 0000000100000000
R10: ffffffff00000000 R11: 0000000000000293 R12: 000055a2e3da7c30
R13: 0000000000000000 R14: 0000000000000001 R15: 00007ffd0ec55bf0
ORIG_RAX: 00000000000000e4 CS: 0033 SS: 002b

aplattner · April 8, 2021, 6:30pm

Thanks for reporting this. It’s being tracked in internal bug number 3279571. The bug tracker is not public but you can refer to this number in future correspondence.

damr · June 18, 2021, 7:36am

Hi @aplattner , I have the same issue as describe above on multiple Centos7 systems running the versions below

NVIDIA-SMI 465.19.01 Driver Version: 465.19.01 CUDA Version: 11.3
kernel 3.10.0-1160.6.1.el7.x86_64

Would it be possible to have an update on the bug 3279571 ? any workaround or is that fix in another release ?

Thanks

tankai7 · June 23, 2021, 7:13am

I face the same problem on eulerosv2r7, Driver Version: 460.80, @aplattner does it fixed in another release?

nadas1 · August 3, 2021, 12:01pm

Hi @aplattner We see this also and have for some time. At least 2080 RTI, RTX 6000, RTX 8000 and A6000 GPUs. Various driver versions, most recently 460.73.01. Centos= 3.10.0-1160.31.1.el7.x86_64 #1. Seen on different Supermicro systems.

Thanks,
Steve Nadas

edw · August 13, 2021, 11:01am

Hi @aplattner, we are facing the same issue with driver ver. 460.73.01 or 460.32.03.

RHEL 7.6 with Kernel v. 3.10.0-1160.31.1.el7.x86_64
Driver ver. 460.73.01 or 460.32.03
HPE XL270d
Tesla V100 (8x GPU Cards)

We cant take the machines in production if we don’t have a fix. Delaying projects is no fun. Thanks.

edw · August 23, 2021, 5:41am

Solution for me: I upgraded the system to RHEL 7.9 with 7.9 Kernel - 3.10.0-1160.31.1.el7.x86_64 and NVIDIA driver version 460.32.03 runs stable. 5 days now without crash. It is interesting that I didn’t get any help from NVIDIA (enterprise) because I don’t run a grid version and from HEP because this constellation is not supported.

Update: Problem not resolved… crashed after 5 days…

user131441 · January 13, 2022, 2:38pm

Any update about the bug: 3279571?

aplattner · January 13, 2022, 7:44pm

That bug should have been fixed in 470.94. Are you experiencing a crash? If so, please generate and attach a new bug report log.

normspe · January 13, 2022, 7:50pm

We resolved our problem by enabling persistence mode. It may seem trivial, but we had absolutely no crashes with pm enabled (or using daemon), and as soon as pm was disabled, crashes showed up again. Re-enabled pm and no crashes for months now.

EDIT: Just to clarify, we saw crashes for months on several systems, but once we enabled pm, the crashes were gone on all systems. We disabled pm just to confirm our findings.

sqy0727 · September 28, 2022, 7:45am

driver 515.65.01 the similar problem on cnetos = 3.10.0-1160.el7.x86_64 RTX3080 GPU

[73501.074032] BUG: unable to handle kernel paging request at 0000000050769420
[73501.081875] IP: [] cpuacct_charge+0x2d/0x50
[73501.088340] PGD 170b6f067 PUD 0
[73501.091984] Thread overran stack, or stack corrupted
[73501.097555] Oops: 0000 [#1] SMP
[73501.101200] Modules linked in: xt_NFQUEUE iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_filter nfnetlink_queue nfnetlink_log nfnetlink bluetooth rfkill nvidia_uvm(OE) net_ctl(OE) kcore(OE) binfmt_misc iTCO_wdt iTCO_vendor_support nvidia_drm(POE) nvidia_modeset(POE) nvidia(POE) nfit libnvdimm intel_powerclamp coretemp intel_rapl iosf_mbi kvm_intel kvm snd_hda_codec_hdmi irqbypass crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd pcspkr snd_hda_intel snd_hda_codec ipmi_ssif snd_hda_core snd_hwdep snd_seq snd_seq_device snd_pcm snd_timer snd joydev soundcore mei_me i2c_i801 sg mei ipmi_si ipmi_devintf ipmi_msghandler pinctrl_lewisburg pinctrl_intel acpi_cpufreq acpi_pad acpi_power_meter ip_tables xfs libcrc32c rndis_host cdc_ether
[73501.181813] usbnet mii sd_mod crc_t10dif crct10dif_generic ast drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm crct10dif_pclmul ahci drm crct10dif_common crc32c_intel igb libahci libata nvme ptp nvme_core pps_core dca drm_panel_orientation_quirks i2c_algo_bit wmi fuse
[73501.209102] CPU: 32 PID: 110544 Comm: irq/312-nvidia Kdump: loaded Tainted: P OE ------------ 3.10.0-1160.el7.x86_64 #1
[73501.222463] Hardware name: Supermicro SYS-420GP-TNR/X12DPG-OA6, BIOS 1.20 04/17/2022
[73501.231153] task: ffff9a19fdfb8000 ti: ffff9a3739e8c000 task.ti: ffff9a3739e8c000
[73501.239552] RIP: 0010:[] [] cpuacct_charge+0x2d/0x50
[73501.248745] RSP: 0018:ffff9a387e603e18 EFLAGS: 00010046
[73501.254707] RAX: 0000000000015c60 RBX: ffff9a19fdbcf080 RCX: 00000000158c3008
[73501.262713] RDX: ffffffffa405d220 RSI: 00000000000e9c89 RDI: ffff9a19fdfb8000
[73501.270721] RBP: ffff9a387e603e18 R08: 00000000000002fe R09: 000000000000040f
[73501.278727] R10: 0000000000000004 R11: 0000000000000005 R12: ffff9a19fdfb8000
[73501.286735] R13: 0000000000000020 R14: 00000000000e9c89 R15: ffff9a387e61acc0
[73501.294742] FS: 0000000000000000(0000) GS:ffff9a387e600000(0000) knlGS:0000000000000000
[73501.303824] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[73501.310272] CR2: 0000000050769420 CR3: 000000017574a000 CR4: 0000000000760fe0
[73501.318279] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[73501.326286] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[73501.334293] PKRU: 00000000
[73501.337329] Call Trace:
[73501.340071]
[73501.342233] [] update_curr_rt+0xdb/0x2c0
[73501.348694] [] task_tick_rt+0x14/0x120
[73501.354755] [] scheduler_tick+0xd4/0x160
[73501.361013] [] ? tick_sched_do_timer+0x50/0x50
[73501.367853] [] update_process_times+0x65/0x80
[73501.374587] [] tick_sched_handle+0x30/0x70
[73501.381029] [] tick_sched_timer+0x39/0x80
[73501.387372] [] __hrtimer_run_queues+0x10e/0x270
[73501.394309] [] hrtimer_interrupt+0xaf/0x1d0
[73501.400857] [] local_apic_timer_interrupt+0x3b/0x60
[73501.408186] [] smp_apic_timer_interrupt+0x43/0x60
[73501.415320] [] apic_timer_interrupt+0x16a/0x170
[73501.422254]
[73501.424730] [] ? _nv019913rm+0x88c/0xbb0 [nvidia]

handewo · July 12, 2023, 10:09am

NVIDIA-SMI 510.47.03 Driver Version: 510.47.03 CUDA Version: 11.6
### CentOS 7.9 kernel 3.10.0-1160.83.1.el7.x86_64
GPU A6000 total 8 cards
[ 3029.594520] NVRM: Xid (PCI:0000:d5:00): 62, pid=9423, 0000(0000) 00000000 00000000
[ 3029.595944] NVRM: Xid (PCI:0000:d5:00): 62, pid=9423, 0000(0000) 00000000 00000000
[ 3029.597315] NVRM: Xid (PCI:0000:d5:00): 62, pid=9423, 0000(0000) 00000000 00000000
[ 3029.597801] BUG: unable to handle kernel paging request at fffffffd97e12550
[ 3029.597836] IP: [] cpuacct_charge+0x2d/0x50
[ 3029.597863] PGD f196414067 PUD 0
[ 3029.597876] Thread overran stack, or stack corrupted
[ 3029.597890] Oops: 0000 [#1] SMP
[ 3029.597904] Modules linked in: rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache nvidia_drm(POE) nvidia_modeset(POE) nvidia_uvm(OE) nvidia(POE) iTCO_wdt iTCO_vendor_support sunrpc i10nm_edac nfit libnvdimm intel_powerclamp coretemp intel_rapl iosf_mbi kvm_intel snd_hda_codec_hdmi kvm irqbypass crc32_pclmul vfat ghash_clmulni_intel fat snd_hda_intel aesni_intel snd_hda_codec lrw gf128mul glue_helper ablk_helper cryptd snd_hda_core pcspkr snd_seq snd_seq_device snd_hwdep snd_pcm snd_timer snd soundcore joydev i2c_i801 mei_me sg mei ipmi_si ipmi_devintf ipmi_msghandler pinctrl_lewisburg pinctrl_intel acpi_pad acpi_power_meter ip_tables ext4 mbcache jbd2 sd_mod crc_t10dif crct10dif_generic ast drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm ahci drm crct10dif_pclmul
[ 3029.598173] crct10dif_common crc32c_intel igb libahci megaraid_sas libata rndis_host cdc_ether ptp usbnet pps_core dca mii drm_panel_orientation_quirks i2c_algo_bit wmi
[ 3029.598232] CPU: 63 PID: 9423 Comm: irq/240-nvidia Kdump: loaded Tainted: P OE ------------ 3.10.0-1160.83.1.el7.x86_64 #1
[ 3029.598268] Hardware name: Supermicro SYS-420GP-TNR/X12DPG-OA6, BIOS 1.20 04/17/2022
[ 3029.598291] task: ffff8c7f04741080 ti: ffff8c7f06bac000 task.ti: ffff8c7f06bac000
[ 3029.598314] RIP: 0010:[] [] cpuacct_charge+0x2d/0x50
[ 3029.598339] RSP: 0018:ffff8c7f3f5c3e18 EFLAGS: 00010046
[ 3029.598354] RAX: 0000000000015c60 RBX: ffff8c7f3de0e780 RCX: ffffffffc155818a
[ 3029.598372] RDX: ffffffff8d25d220 RSI: 00000000000f42c2 RDI: ffff8c7f04741080
[ 3029.598390] RBP: ffff8c7f3f5c3e18 R08: 0000000000000000 R09: 0000000000000000
[ 3029.598408] R10: 0000000000000004 R11: 0000000000000005 R12: ffff8c7f04741080
[ 3029.598426] R13: 000000000000003f R14: 00000000000f42c2 R15: ffff8c7f3f5dacc0
[ 3029.598444] FS: 0000000000000000(0000) GS:ffff8c7f3f5c0000(0000) knlGS:0000000000000000
[ 3029.598465] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 3029.598480] CR2: fffffffd97e12550 CR3: 000000f196410000 CR4: 0000000000760fe0
[ 3029.598498] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 3029.598519] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 3029.598540] PKRU: 00000000
[ 3029.598550] Call Trace:
[ 3029.598559]
[ 3029.598572] [] update_curr_rt+0xdb/0x2c0
[ 3029.598590] [] task_tick_rt+0x14/0x130
[ 3029.598607] [] scheduler_tick+0xd4/0x160
[ 3029.598630] [] ? tick_sched_do_timer+0x60/0x60
[ 3029.598649] [] update_process_times+0x65/0x80
[ 3029.598666] [] tick_sched_handle+0x30/0x80
[ 3029.598682] [] tick_sched_timer+0x39/0x80
[ 3029.598699] [] __hrtimer_run_queues+0x10e/0x270
[ 3029.598715] [] hrtimer_interrupt+0xaf/0x1e0
[ 3029.598733] [] local_apic_timer_interrupt+0x3b/0x70
[ 3029.598753] [] smp_apic_timer_interrupt+0x43/0x60
[ 3029.598771] [] apic_timer_interrupt+0x172/0x180
[ 3029.599564]
[ 3029.599829] [] ? _nv019943rm+0x19/0x30 [nvidia]
[ 3029.601578] [] ? _nv020079rm+0x2b/0xb0 [nvidia]