RHEL 7.7 + 430.52 : random kernel crashes

Hi,

I’ve been using the following configuration:

  • Dell PowerEdge T440 with 2* Xeon Silver 4110
  • 256gb RAM (160Gb in hugepages)
  • NVidia GTX 1050 Ti.
  • RHEL 7.7 (currently running kernel 3.10.0-1062.3.2.el7)

I’ve seen a few crashes in the past few weeks that resulted in a kernel crash.
Most of the time, the crash happens after a few days and the call trace from /var/crash//vmcore-dmesg.txt looks like this:

[658571.593178] X: page allocation failure: order:4, mode:0x1040d0
[658571.593189] CPU: 25 PID: 61292 Comm: X Kdump: loaded Tainted: P        W  OE  ------------ T 3.10.0-1062.2.1.el7.x86_64 #1
[658571.593193] Hardware name: Dell Inc. PowerEdge T440/00X7CK, BIOS 2.2.11 06/14/2019
[658571.593196] Call Trace:
[658571.593209]  [<ffffffffb23792c2>] dump_stack+0x19/0x1b
[658571.593218]  [<ffffffffb1dc23d0>] warn_alloc_failed+0x110/0x180
[658571.593227]  [<ffffffffb237485c>] __alloc_pages_slowpath+0x6b6/0x724
[658571.593234]  [<ffffffffb1dc6b84>] __alloc_pages_nodemask+0x404/0x420
[658571.593242]  [<ffffffffb1e14c68>] alloc_pages_current+0x98/0x110
[658571.593248]  [<ffffffffb1dc117e>] __get_free_pages+0xe/0x40
[658571.593253]  [<ffffffffb1e2052e>] kmalloc_order_trace+0x2e/0xa0
[658571.593257]  [<ffffffffb1e24571>] ? __kmalloc+0x211/0x230
[658571.593316]  [<ffffffffc0d67f70>] ? _nv000474kms+0x50/0x50 [nvidia_modeset]
[658571.593320]  [<ffffffffb1e24571>] __kmalloc+0x211/0x230
[658571.593345]  [<ffffffffc0d67f70>] ? _nv000474kms+0x50/0x50 [nvidia_modeset]
[658571.593368]  [<ffffffffc0d653f7>] nvkms_alloc+0x27/0x70 [nvidia_modeset]
[658571.593408]  [<ffffffffc0da2666>] _nv002494kms+0x16/0x30 [nvidia_modeset]
[658571.593444]  [<ffffffffc0d989a8>] ? _nv002596kms+0x68/0x1fe0 [nvidia_modeset]
[658571.593450]  [<ffffffffb1e14c68>] ? alloc_pages_current+0x98/0x110
[658571.593456]  [<ffffffffb1dc117e>] ? __get_free_pages+0xe/0x40
[658571.593475]  [<ffffffffb1e2052e>] ? kmalloc_order_trace+0x2e/0xa0
[658571.593479]  [<ffffffffb1e24571>] ? __kmalloc+0x211/0x230
[658571.593502]  [<ffffffffc0d67f70>] ? _nv000474kms+0x50/0x50 [nvidia_modeset]
[658571.593529]  [<ffffffffc0d68481>] ? _nv000603kms+0x31/0xe0 [nvidia_modeset]
[658571.593558]  [<ffffffffc0d67f70>] ? _nv000474kms+0x50/0x50 [nvidia_modeset]
[658571.593581]  [<ffffffffc0d69886>] ? nvKmsIoctl+0x96/0x1d0 [nvidia_modeset]
[658571.593604]  [<ffffffffc0d66012>] ? nvkms_ioctl_common+0x42/0x80 [nvidia_modeset]
[658571.593626]  [<ffffffffc0d66113>] ? nvkms_ioctl+0xc3/0x110 [nvidia_modeset]
[658571.593839]  [<ffffffffc190d083>] ? nvidia_frontend_unlocked_ioctl+0x43/0x50 [nvidia]
[658571.593847]  [<ffffffffb1e5d9e0>] ? do_vfs_ioctl+0x3a0/0x5a0
[658571.593854]  [<ffffffffb2387678>] ? __do_page_fault+0x238/0x500
[658571.593860]  [<ffffffffb1e5dc81>] ? SyS_ioctl+0xa1/0xc0
[658571.593865]  [<ffffffffb238cede>] ? system_call_fastpath+0x25/0x2a
[658571.593868] Mem-Info:
[658571.593882] active_anon:1701807 inactive_anon:927131 isolated_anon:0
 active_file:9395666 inactive_file:7989392 isolated_file:0
 unevictable:216493 dirty:942 writeback:0 unstable:0
 slab_reclaimable:650398 slab_unreclaimable:548086
 mapped:115998 shmem:61319 pagetables:38309 bounce:0

nvidia-bug-report.log.gz (489 KB)

I had another happen today:

[349452.766476] X: page allocation failure: order:4, mode:0x40d0
[349452.766481] CPU: 15 PID: 31950 Comm: X Kdump: loaded Tainted: P           OE  ------------ T 3.10.0-1062.3.2.el7.x86_64 #1
[349452.766483] Hardware name: Dell Inc. PowerEdge T440/00X7CK, BIOS 2.2.11 06/14/2019
[349452.766485] Call Trace:
[349452.766494]  [<ffffffffa9d78ba4>] dump_stack+0x19/0x1b
[349452.766499]  [<ffffffffa97c24a0>] warn_alloc_failed+0x110/0x180
[349452.766502]  [<ffffffffa97c70af>] __alloc_pages_nodemask+0x9df/0xbe0
[349452.766507]  [<ffffffffa9815258>] alloc_pages_current+0x98/0x110
[349452.766552]  [<ffffffffc13b5f70>] ? _nv000474kms+0x50/0x50 [nvidia_modeset]
[349452.766557]  [<ffffffffa97e2258>] kmalloc_order+0x18/0x40
[349452.766559]  [<ffffffffa9820786>] kmalloc_order_trace+0x26/0xa0
[349452.766562]  [<ffffffffa9824d41>] ? __kmalloc+0x211/0x230
[349452.766571]  [<ffffffffc13b5f70>] ? _nv000474kms+0x50/0x50 [nvidia_modeset]
[349452.766573]  [<ffffffffa9824d41>] __kmalloc+0x211/0x230
[349452.766582]  [<ffffffffc13b5f70>] ? _nv000474kms+0x50/0x50 [nvidia_modeset]
[349452.766590]  [<ffffffffc13b33f7>] nvkms_alloc+0x27/0x70 [nvidia_modeset]
[349452.766607]  [<ffffffffc13f0666>] _nv002495kms+0x16/0x30 [nvidia_modeset]
[349452.766620]  [<ffffffffc13e69a8>] ? _nv002597kms+0x68/0x1fe0 [nvidia_modeset]
[349452.766623]  [<ffffffffa97be7ee>] ? filemap_fault+0x17e/0x490
[349452.766625]  [<ffffffffa9815258>] ? alloc_pages_current+0x98/0x110
[349452.766634]  [<ffffffffc13b5f70>] ? _nv000474kms+0x50/0x50 [nvidia_modeset]
[349452.766636]  [<ffffffffa97e2258>] ? kmalloc_order+0x18/0x40
[349452.766645]  [<ffffffffc13b6481>] ? _nv000603kms+0x31/0xe0 [nvidia_modeset]
[349452.766656]  [<ffffffffc13b5f70>] ? _nv000474kms+0x50/0x50 [nvidia_modeset]
[349452.766665]  [<ffffffffc13b7886>] ? nvKmsIoctl+0x96/0x1d0 [nvidia_modeset]
[349452.766674]  [<ffffffffc13b4012>] ? nvkms_ioctl_common+0x42/0x80 [nvidia_modeset]
[349452.766683]  [<ffffffffc13b4113>] ? nvkms_ioctl+0xc3/0x110 [nvidia_modeset]
[349452.766849]  [<ffffffffc2a73083>] ? nvidia_frontend_unlocked_ioctl+0x43/0x50 [nvidia]
[349452.766852]  [<ffffffffa985e270>] ? do_vfs_ioctl+0x3a0/0x5a0
[349452.766856]  [<ffffffffa9d86678>] ? __do_page_fault+0x238/0x500
[349452.766858]  [<ffffffffa985e511>] ? SyS_ioctl+0xa1/0xc0
[349452.766861]  [<ffffffffa9d8bede>] ? system_call_fastpath+0x25/0x2a
[349452.766862] Mem-Info:
[349452.766870] active_anon:2362712 inactive_anon:873251 isolated_anon:0
 active_file:8409517 inactive_file:8312840 isolated_file:0
 unevictable:216389 dirty:977 writeback:0 unstable:0
 slab_reclaimable:516471 slab_unreclaimable:545556
 mapped:137162 shmem:134830 pagetables:56169 bounce:0
 free:167111 free_pcp:799 free_cma:0

Run MemTest86 for a few hours.

Then probably check your GPU VRAM as well using e.g. GitHub - ihaque/memtestCL: OpenCL memory tester for GPUs

If both pass then I’ve no idea.

Hi Bridie,
System’s memory is ECC RAM so I’d know if there had been errors. On the other hand, I do have about 60% of the system’s memory in hugepages (for running VMs).
I will try to test the GPU’s VRAM using the linux64 binary of memtestCL:

$ ./memtestCL-1.00-linux64 -f -r 0 -c 0 3072

Also,

this looks a lot like this one:

https://www.nvidia.com/en-us/geforce/forums/game-ready-drivers/13/300440/kernel-panic-with-41874-on-el7-with-kernel-3100-95/

(even the stack trace is similar)

@NVidia dev team: I have several vmcores available if you’d like to see them:

# find  /var/crash -ls
1572865    4 drwxr-xr-x   6 root     root         4096 Sep 24 14:28 /var/crash
1572995    4 drwxr-xr-x   2 root     root         4096 Sep 14 10:16 /var/crash/127.0.0.1-2019-09-14-10:16:11
1572997 4276136 -rw-------   1 root     root     4378755730 Sep 14 10:16 /var/crash/127.0.0.1-2019-09-14-10:16:11/vmcore
1572996 1020 -rw-r--r--   1 root     root      1044049 Sep 14 10:16 /var/crash/127.0.0.1-2019-09-14-10:16:11/vmcore-dmesg.txt
1573003    4 drwxr-xr-x   2 root     root         4096 Sep 24 14:29 /var/crash/127.0.0.1-2019-09-24-14:28:31
1573005 4529220 -rw-------   1 root     root     4637916098 Sep 24 14:29 /var/crash/127.0.0.1-2019-09-24-14:28:31/vmcore
1573004 1024 -rw-r--r--   1 root     root      1045773 Sep 24 14:28 /var/crash/127.0.0.1-2019-09-24-14:28:31/vmcore-dmesg.txt
1573001    4 drwxr-xr-x   2 root     root         4096 Sep 16 22:27 /var/crash/127.0.0.1-2019-09-16-22:27:19
1573002 1020 -rw-r--r--   1 root     root      1044424 Sep 16 22:27 /var/crash/127.0.0.1-2019-09-16-22:27:19/vmcore-dmesg.txt
1572998    4 drwxr-xr-x   2 root     root         4096 Sep 15 11:42 /var/crash/127.0.0.1-2019-09-15-11:41:55
1573000 3983704 -rw-------   1 root     root     4079307200 Sep 15 11:42 /var/crash/127.0.0.1-2019-09-15-11:41:55/vmcore
1572999  628 -rw-r--r--   1 root     root       640135 Sep 15 11:41 /var/crash/127.0.0.1-2019-09-15-11:41:55/vmcore-dmesg.txt

Can you please run nvidia-bug-report.sh as root and attach the resulting nvidia-bug-report.log.gz file?

Hi Aaron,
Here’s the requested file. (attached to the first message).
Thank you,
Vincent

nvidia-bug-report.log.gz (489 KB)

Is this an actual crash, or is it just a backtrace that the kernel prints when a memory allocation fails? The linked thread says “kernel panic” in its title but the message just looks like a warning rather than a panic.

If I’m decoding the backtrace correctly, the nvidia-modeset kernel module is trying to allocate a very small amount of memory and that is failing. The driver won’t be able to function if memory allocations are failing, but it should fail gracefully.

Does the problem still occur with the latest driver and/or kernel? The 3.10 kernel was released in August of 2014 so it’s pretty old at this point and it’s possible that this is a kernel bug that has since been fixed.

Hi Aaron,
This is an actual kernel crash.
Also worth noting is this: two days ago, I replaced the ASUS GTX 1050 Ti I had in the machine with an EVGA GTX 1660 Ti (a brand new card).
Just a few moments ago, I got another kernel crash with a very similar backtrace:

[228014.829717] X: page allocation failure: order:4, mode:0x40d0
[228014.829723] CPU: 9 PID: 31338 Comm: X Kdump: loaded Tainted: P           OE  ------------ T 3.10.0-1062.4.1.el7.x86_64 #1
[228014.829725] Hardware name: Dell Inc. PowerEdge T440/00X7CK, BIOS 2.2.11 06/14/2019
[228014.829726] Call Trace:
[228014.829736]  [<ffffffff93178ba4>] dump_stack+0x19/0x1b
[228014.829742]  [<ffffffff92bc24a0>] warn_alloc_failed+0x110/0x180
[228014.829745]  [<ffffffff92bc70af>] __alloc_pages_nodemask+0x9df/0xbe0
[228014.829750]  [<ffffffff92c15258>] alloc_pages_current+0x98/0x110
[228014.829792]  [<ffffffffc1237f70>] ? _nv000474kms+0x50/0x50 [nvidia_modeset]
[228014.829797]  [<ffffffff92be2258>] kmalloc_order+0x18/0x40
[228014.829800]  [<ffffffff92c20786>] kmalloc_order_trace+0x26/0xa0
[228014.829802]  [<ffffffff92c24d41>] ? __kmalloc+0x211/0x230
[228014.829809]  [<ffffffffc1237f70>] ? _nv000474kms+0x50/0x50 [nvidia_modeset]
[228014.829811]  [<ffffffff92c24d41>] __kmalloc+0x211/0x230
[228014.829818]  [<ffffffffc1237f70>] ? _nv000474kms+0x50/0x50 [nvidia_modeset]
[228014.829826]  [<ffffffffc12353f7>] nvkms_alloc+0x27/0x70 [nvidia_modeset]
[228014.829840]  [<ffffffffc1272666>] _nv002495kms+0x16/0x30 [nvidia_modeset]
[228014.829851]  [<ffffffffc12689a8>] ? _nv002597kms+0x68/0x1fe0 [nvidia_modeset]
[228014.829853]  [<ffffffff92c15258>] ? alloc_pages_current+0x98/0x110
[228014.829861]  [<ffffffffc1237f70>] ? _nv000474kms+0x50/0x50 [nvidia_modeset]
[228014.829863]  [<ffffffff92be2258>] ? kmalloc_order+0x18/0x40
[228014.829864]  [<ffffffff92c20786>] ? kmalloc_order_trace+0x26/0xa0
[228014.829865]  [<ffffffff92c24d41>] ? __kmalloc+0x211/0x230
[228014.829872]  [<ffffffffc1237f70>] ? _nv000474kms+0x50/0x50 [nvidia_modeset]
[228014.829879]  [<ffffffffc1238481>] ? _nv000603kms+0x31/0xe0 [nvidia_modeset]
[228014.829888]  [<ffffffffc1237f70>] ? _nv000474kms+0x50/0x50 [nvidia_modeset]
[228014.829896]  [<ffffffffc1239886>] ? nvKmsIoctl+0x96/0x1d0 [nvidia_modeset]
[228014.829903]  [<ffffffffc1236012>] ? nvkms_ioctl_common+0x42/0x80 [nvidia_modeset]
[228014.829910]  [<ffffffffc1236113>] ? nvkms_ioctl+0xc3/0x110 [nvidia_modeset]
[228014.830078]  [<ffffffffc14be083>] ? nvidia_frontend_unlocked_ioctl+0x43/0x50 [nvidia]
[228014.830081]  [<ffffffff92c5e270>] ? do_vfs_ioctl+0x3a0/0x5a0
[228014.830085]  [<ffffffff93186678>] ? __do_page_fault+0x238/0x500
[228014.830087]  [<ffffffff92c5e511>] ? SyS_ioctl+0xa1/0xc0
[228014.830089]  [<ffffffff9318bede>] ? system_call_fastpath+0x25/0x2a
[228014.830090] Mem-Info:
[228014.830098] active_anon:2046793 inactive_anon:853419 isolated_anon:0
 active_file:7544223 inactive_file:7545881 isolated_file:0
 unevictable:216695 dirty:73083 writeback:0 unstable:0
 slab_reclaimable:718687 slab_unreclaimable:779984
 mapped:165988 shmem:92385 pagetables:43369 bounce:0
 free:359990 free_pcp:314 free_cma:0
[228014.830102] Node 0 DMA free:15864kB min:4kB low:4kB high:4kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15980kB managed:15896kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:32kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes

The fact that it is happening with a ‘different’ GPU and a similar backtrace probably means it is a software issue.

I’m attaching the complete vmcore-dmesg.txt in which you’ll find the entire kernel panic message.
vmcore-dmesg.txt (1020 KB)

Here’s an archive with the 5 kernel crashes I’ve had so far (only the dmesg from the panics).

crashdumps.zip (369 KB)

The 3.10.0 version string in RHEL7 is only because RedHat doesn’t rebase within a major RHEL version. But most kernel bugs are fixed on a regular basis with ‘backports’. I am sure NVidia engineering is aware of this since they are working with RedHat for RHEL support on the Tesla vGPUs

Also, one more piece of information:
I am almost 100% sure that all of those crashes happened when I attempted to wake the monitor from sleep (not on every wake up but all of the crashes happened as I was trying to wake the X11 display by moving the mouse or by using the keyboard).
Most of the time, the system would wake up fine but sometimes it wouldn’t : it would crash, the screen would stay black until I see the BIOS screen several minutes later (probably because kdump isn’t doing any output).