Is the presence of hugepages on my RHEL7 system the reason why the nvidia driver crashes the entire system?

vincento4him · October 12, 2019, 12:34am

Hi All,

I’ve been chasing what’s causing my RHEL7 systems to crash on a frequent basis (every few days) and I’ve been wondering if it could be a side-effect of having enabled hugepages for my VMs on those systems.

Here’s more information:

PowerEdge T440 (16C/32T) with Xeon scalable and 256Gb DDR4 RAM.
RHEL 7.7 x86_64
First I had a GTX 1050 Ti, which I then swapped for a GTX 1660 Ti (I run a 4k head on this box).

Out of those 256Gb RAM, I had 160Gb reserved for hugepages (81920 2mb hugepages) for VMs. That left about 96Gb of normal pages for non-Hugepage processes (everything else).

Here’s what I noticed:
Running a normal workload (chrome, lots of apps) with 81920 hugepages, I’d get crashdumps once or twice per week.
With the ammount of hugepages reduced to 61440 (that’s 120Gb of hugepages), the crashdumps happened less often but still happened.

# grep -i failure */*txt|grep X:
127.0.0.1-2019-09-14-10:16:11/vmcore-dmesg.txt:[658571.593178] X: page allocation failure: order:4, mode:0x1040d0
127.0.0.1-2019-09-15-11:41:55/vmcore-dmesg.txt:[91039.059458] X: page allocation failure: order:4, mode:0x1040d0
127.0.0.1-2019-09-24-14:28:31/vmcore-dmesg.txt:[349452.766476] X: page allocation failure: order:4, mode:0x40d0
127.0.0.1-2019-10-03-11:54:19/vmcore-dmesg.txt:[228014.829717] X: page allocation failure: order:4, mode:0x40d0
127.0.0.1-2019-10-10-22:14:54/vmcore-dmesg.txt:[512663.962305] X: page allocation failure: order:4, mode:0x40d0

Most of those times, I’d get a backtrace similar to this one:

[512664.201907] Hardware name: Dell Inc. PowerEdge T440/00X7CK, BIOS 2.2.11 06/14/2019
[512664.209543] task: ffff8e3c85a25230 ti: ffff8e24786bc000 task.ti: ffff8e24786bc000
[512664.217091] RIP: 0010:[<ffffffffc0e8c620>]  [<ffffffffc0e8c620>] _nv002454kms+0x60/0x100 [nvidia_modeset]
[512664.226757] RSP: 0018:ffff8e24786bfba0  EFLAGS: 00010202
[512664.232143] RAX: 0000000000000004 RBX: 0000000000006f80 RCX: 0000000000000004
[512664.239345] RDX: ffff8e3c52b92318 RSI: 0000000000006f80 RDI: ffff8e3c52b93008
[512664.246548] RBP: 0000000000000000 R08: 0000000000000400 R09: 0000000000000000
[512664.253749] R10: 0000000000000004 R11: ffff8e24786bf5d6 R12: 0000000000006f80
[512664.260954] R13: 0000000000006f80 R14: ffff8e3c52b93008 R15: 0000000000000001
[512664.268155] FS:  00007f79f5d36a00(0000) GS:ffff8e5b7d680000(0000) knlGS:0000000000000000
[512664.276309] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[512664.282129] CR2: 0000000000006f80 CR3: 0000000b48fcc000 CR4: 00000000007627e0
[512664.289329] PKRU: 55555554
[512664.292123] Call Trace:
[512664.294666]  [<ffffffffc0e453f7>] ? nvkms_alloc+0x27/0x70 [nvidia_modeset]
[512664.301621]  [<ffffffffc0e78cea>] ? _nv002597kms+0x3aa/0x1fe0 [nvidia_modeset]
[512664.308911]  [<ffffffffb6415258>] ? alloc_pages_current+0x98/0x110
[512664.315166]  [<ffffffffc0e47f70>] ? _nv000474kms+0x50/0x50 [nvidia_modeset]
[512664.322198]  [<ffffffffb63e2258>] ? kmalloc_order+0x18/0x40
[512664.327842]  [<ffffffffb6420786>] ? kmalloc_order_trace+0x26/0xa0
[512664.334007]  [<ffffffffb6424d41>] ? __kmalloc+0x211/0x230
[512664.339488]  [<ffffffffc0e47f70>] ? _nv000474kms+0x50/0x50 [nvidia_modeset]
[512664.346527]  [<ffffffffc0e48481>] ? _nv000603kms+0x31/0xe0 [nvidia_modeset]
[512664.353566]  [<ffffffffc0e47f70>] ? _nv000474kms+0x50/0x50 [nvidia_modeset]
[512664.360601]  [<ffffffffc0e49886>] ? nvKmsIoctl+0x96/0x1d0 [nvidia_modeset]
[512664.367553]  [<ffffffffc0e46012>] ? nvkms_ioctl_common+0x42/0x80 [nvidia_modeset]
[512664.375111]  [<ffffffffc0e46113>] ? nvkms_ioctl+0xc3/0x110 [nvidia_modeset]
[512664.382300]  [<ffffffffc1fe3083>] ? nvidia_frontend_unlocked_ioctl+0x43/0x50 [nvidia]
[512664.390192]  [<ffffffffb645e270>] ? do_vfs_ioctl+0x3a0/0x5a0
[512664.395924]  [<ffffffffb644bec1>] ? __sb_end_write+0x31/0x70
[512664.401658]  [<ffffffffb645e511>] ? SyS_ioctl+0xa1/0xc0
[512664.406958]  [<ffffffffb698bede>] ? system_call_fastpath+0x25/0x2a

Is there something in the nvidia driver (latest I used was 430.50) that could make it crash when there are hugepages present in the system?

Topic		Replies	Views
RHEL7.8 + 450.57 nvkms crashdump on Quadro P2200 GPU Linux	2	804	August 13, 2020
Still experiencing random crashdumps (RHEL7.7) : X: page allocation failure: order:4, mode:0x40d0 in nvkms_ioctl (nvidia_modeset) Linux	3	2054	January 6, 2020
RHEL 7.7 + 430.52 : random kernel crashes Linux	12	2601	October 3, 2019
Still getting kernel crashdumps on RHEL7.8 + 440.82 in nvkms_alloc Linux hw , kernel , nvbugs	7	868	April 9, 2021
RHEL7.7 + 440.59 : still getting kernel crashdumps in nvkms Linux	1	708	February 20, 2020
RHEL7.7 + 440.41 : kernel crashdump in nvkms when waking up from sleep (DPMS) Linux	2	1047	February 3, 2020
RHEL 7.7 + 430.50 : random kernel panics in _nv002453kms (nvidia_modeset) Linux	5	1141	April 8, 2021
Crash with kernel 4.5 and 4.6 Linux	8	5494	May 14, 2016
"BUG: unable to handle kernel paging request at 0000000000002b20" Linux	11	5791	July 12, 2023
440.48.02: Random X.org lock ups due to kernel module crash Linux	17	7390	December 22, 2020

Is the presence of hugepages on my RHEL7 system the reason why the nvidia driver crashes the entire system?

Related topics