Is the presence of hugepages on my RHEL7 system the reason why the nvidia driver crashes the entire system?

Hi All,

I’ve been chasing what’s causing my RHEL7 systems to crash on a frequent basis (every few days) and I’ve been wondering if it could be a side-effect of having enabled hugepages for my VMs on those systems.

Here’s more information:

  • PowerEdge T440 (16C/32T) with Xeon scalable and 256Gb DDR4 RAM.
  • RHEL 7.7 x86_64
  • First I had a GTX 1050 Ti, which I then swapped for a GTX 1660 Ti (I run a 4k head on this box).

Out of those 256Gb RAM, I had 160Gb reserved for hugepages (81920 2mb hugepages) for VMs. That left about 96Gb of normal pages for non-Hugepage processes (everything else).

Here’s what I noticed:
Running a normal workload (chrome, lots of apps) with 81920 hugepages, I’d get crashdumps once or twice per week.
With the ammount of hugepages reduced to 61440 (that’s 120Gb of hugepages), the crashdumps happened less often but still happened.

# grep -i failure */*txt|grep X:[658571.593178] X: page allocation failure: order:4, mode:0x1040d0[91039.059458] X: page allocation failure: order:4, mode:0x1040d0[349452.766476] X: page allocation failure: order:4, mode:0x40d0[228014.829717] X: page allocation failure: order:4, mode:0x40d0[512663.962305] X: page allocation failure: order:4, mode:0x40d0

Most of those times, I’d get a backtrace similar to this one:

[512664.201907] Hardware name: Dell Inc. PowerEdge T440/00X7CK, BIOS 2.2.11 06/14/2019
[512664.209543] task: ffff8e3c85a25230 ti: ffff8e24786bc000 task.ti: ffff8e24786bc000
[512664.217091] RIP: 0010:[<ffffffffc0e8c620>]  [<ffffffffc0e8c620>] _nv002454kms+0x60/0x100 [nvidia_modeset]
[512664.226757] RSP: 0018:ffff8e24786bfba0  EFLAGS: 00010202
[512664.232143] RAX: 0000000000000004 RBX: 0000000000006f80 RCX: 0000000000000004
[512664.239345] RDX: ffff8e3c52b92318 RSI: 0000000000006f80 RDI: ffff8e3c52b93008
[512664.246548] RBP: 0000000000000000 R08: 0000000000000400 R09: 0000000000000000
[512664.253749] R10: 0000000000000004 R11: ffff8e24786bf5d6 R12: 0000000000006f80
[512664.260954] R13: 0000000000006f80 R14: ffff8e3c52b93008 R15: 0000000000000001
[512664.268155] FS:  00007f79f5d36a00(0000) GS:ffff8e5b7d680000(0000) knlGS:0000000000000000
[512664.276309] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[512664.282129] CR2: 0000000000006f80 CR3: 0000000b48fcc000 CR4: 00000000007627e0
[512664.289329] PKRU: 55555554
[512664.292123] Call Trace:
[512664.294666]  [<ffffffffc0e453f7>] ? nvkms_alloc+0x27/0x70 [nvidia_modeset]
[512664.301621]  [<ffffffffc0e78cea>] ? _nv002597kms+0x3aa/0x1fe0 [nvidia_modeset]
[512664.308911]  [<ffffffffb6415258>] ? alloc_pages_current+0x98/0x110
[512664.315166]  [<ffffffffc0e47f70>] ? _nv000474kms+0x50/0x50 [nvidia_modeset]
[512664.322198]  [<ffffffffb63e2258>] ? kmalloc_order+0x18/0x40
[512664.327842]  [<ffffffffb6420786>] ? kmalloc_order_trace+0x26/0xa0
[512664.334007]  [<ffffffffb6424d41>] ? __kmalloc+0x211/0x230
[512664.339488]  [<ffffffffc0e47f70>] ? _nv000474kms+0x50/0x50 [nvidia_modeset]
[512664.346527]  [<ffffffffc0e48481>] ? _nv000603kms+0x31/0xe0 [nvidia_modeset]
[512664.353566]  [<ffffffffc0e47f70>] ? _nv000474kms+0x50/0x50 [nvidia_modeset]
[512664.360601]  [<ffffffffc0e49886>] ? nvKmsIoctl+0x96/0x1d0 [nvidia_modeset]
[512664.367553]  [<ffffffffc0e46012>] ? nvkms_ioctl_common+0x42/0x80 [nvidia_modeset]
[512664.375111]  [<ffffffffc0e46113>] ? nvkms_ioctl+0xc3/0x110 [nvidia_modeset]
[512664.382300]  [<ffffffffc1fe3083>] ? nvidia_frontend_unlocked_ioctl+0x43/0x50 [nvidia]
[512664.390192]  [<ffffffffb645e270>] ? do_vfs_ioctl+0x3a0/0x5a0
[512664.395924]  [<ffffffffb644bec1>] ? __sb_end_write+0x31/0x70
[512664.401658]  [<ffffffffb645e511>] ? SyS_ioctl+0xa1/0xc0
[512664.406958]  [<ffffffffb698bede>] ? system_call_fastpath+0x25/0x2a

Is there something in the nvidia driver (latest I used was 430.50) that could make it crash when there are hugepages present in the system?