Still experiencing random crashdumps (RHEL7.7) : X: page allocation failure: order:4, mode:0x40d0 in nvkms_ioctl (nvidia_modeset)

Hi,

I’m still experiencing random crashdumps on RHEL7.7 due to page allocation failures in Xorg in the nvidia_modeset druver,
Here’s more information:

[983520.867208] Hardware name: Dell Inc. PowerEdge T440/00X7CK, BIOS 2.4.7 10/28/2019
[983520.867209] Call Trace:
[983520.867217]  [<ffffffff9797ac23>] dump_stack+0x19/0x1b
[983520.867223]  [<ffffffff973c3d70>] warn_alloc_failed+0x110/0x180
[983520.867226]  [<ffffffff973c897f>] __alloc_pages_nodemask+0x9df/0xbe0
[983520.867230]  [<ffffffff97416b28>] alloc_pages_current+0x98/0x110
[983520.867295]  [<ffffffffc1ddaf70>] ? _nv000489kms+0x50/0x50 [nvidia_modeset]
[983520.867299]  [<ffffffff973e3b28>] kmalloc_order+0x18/0x40
[983520.867302]  [<ffffffff97422056>] kmalloc_order_trace+0x26/0xa0
[983520.867304]  [<ffffffff97426611>] ? __kmalloc+0x211/0x230
[983520.867320]  [<ffffffffc1ddaf70>] ? _nv000489kms+0x50/0x50 [nvidia_modeset]
[983520.867322]  [<ffffffff97426611>] __kmalloc+0x211/0x230
[983520.867338]  [<ffffffffc1ddaf70>] ? _nv000489kms+0x50/0x50 [nvidia_modeset]
[983520.867353]  [<ffffffffc1dd83f7>] nvkms_alloc+0x27/0x70 [nvidia_modeset]
[983520.867374]  [<ffffffffc1e15866>] _nv002516kms+0x16/0x30 [nvidia_modeset]
[983520.867393]  [<ffffffffc1e0bbc8>] ? _nv002623kms+0x68/0x1f70 [nvidia_modeset]
[983520.867396]  [<ffffffff97416b28>] ? alloc_pages_current+0x98/0x110
[983520.867411]  [<ffffffffc1ddaf70>] ? _nv000489kms+0x50/0x50 [nvidia_modeset]
[983520.867415]  [<ffffffff973e3b28>] ? kmalloc_order+0x18/0x40
[983520.867417]  [<ffffffff97422056>] ? kmalloc_order_trace+0x26/0xa0
[983520.867419]  [<ffffffff97426611>] ? __kmalloc+0x211/0x230
[983520.867434]  [<ffffffffc1ddaf70>] ? _nv000489kms+0x50/0x50 [nvidia_modeset]
[983520.867450]  [<ffffffffc1ddb481>] ? _nv000618kms+0x31/0xe0 [nvidia_modeset]
[983520.867471]  [<ffffffffc1ddaf70>] ? _nv000489kms+0x50/0x50 [nvidia_modeset]
[983520.867488]  [<ffffffffc1ddc8c6>] ? nvKmsIoctl+0x96/0x1d0 [nvidia_modeset]
[983520.867504]  [<ffffffffc1dd9012>] ? nvkms_ioctl_common+0x42/0x80 [nvidia_modeset]
[983520.867520]  [<ffffffffc1dd9113>] ? nvkms_ioctl+0xc3/0x110 [nvidia_modeset]
[983520.867737]  [<ffffffffc0760083>] ? nvidia_frontend_unlocked_ioctl+0x43/0x50 [nvidia]
[983520.867741]  [<ffffffff9745fb40>] ? do_vfs_ioctl+0x3a0/0x5a0
[983520.867745]  [<ffffffff97988678>] ? __do_page_fault+0x238/0x500
[983520.867748]  [<ffffffff9745fde1>] ? SyS_ioctl+0xa1/0xc0
[983520.867750]  [<ffffffff9798dede>] ? system_call_fastpath+0x25/0x2a
[983520.867752] Mem-Info:
[983520.867760] active_anon:3493200 inactive_anon:157792 isolated_anon:0
 active_file:15156414 inactive_file:14609790 isolated_file:0
 unevictable:176373 dirty:1216487 writeback:0 unstable:0
 slab_reclaimable:1069476 slab_unreclaimable:813394
 mapped:227242 shmem:161513 pagetables:51099 bounce:0

This has been happening for the past few months under various 430.x and 440.x drivers.
The system has 384Gb memory (including 224G in hugepages, leaving about 160G in normal pages).

I’m attaching the vmcore-dmesg.txt and nvidia-bug-report.log.gz here…
nvidia-bug-report.log.gz (5.48 MB)
vmcore-dmesg.txt (1020 KB)

Please note that I’ve been experiencing these crashes with two types of GPUS:

  • GTX 1050 Ti (ASUS)
  • GTX 1660 Ti (EVGA)
    (first one was replaced by the second GPU after some time).
    I’ve also been experiencing these crashes on nvidia drivers 430.* and 440.*.

I seem to have noticed that running Google Chrome with H/W acceleration might be the culprit behind the crashes. On those systems, I have RHEL7.7 and an extensive set of Cuda libraries. So chrome is causing a memory allocation failure in the nvkms driver, which results in a crashdump.
Any ideas?

nvidia-driver-cuda-libs-440.44-1.el7.x86_64
cuda-nvgraph-10.1.243-1.el7.x86_64
cuda-cufft-10.1.243-1.el7.x86_64
cuda-nvtx-devel-10.1.243-1.el7.x86_64
cuda-npp-devel-10.1.243-1.el7.x86_64
cuda-cusolver-10.1.243-1.el7.x86_64
cuda-nvml-devel-10.1.243-1.el7.x86_64
cuda-npp-10.1.243-1.el7.x86_64
cuda-nvvp-10.1.243-1.el7.x86_64
cuda-nvtx-10.1.243-1.el7.x86_64
cuda-cufft-devel-10.1.243-1.el7.x86_64
cuda-curand-devel-10.1.243-1.el7.x86_64
cuda-cusparse-devel-10.1.243-1.el7.x86_64
cuda-extra-libs-10.1.243-1.el7.x86_64
cuda-cudnn-7.6.4.38-2.el7.x86_64
cuda-cusolver-devel-10.1.243-1.el7.x86_64
nvidia-driver-cuda-libs-440.44-1.el7.i686
cuda-cublas-10.1.243-1.el7.x86_64
cuda-cli-tools-10.1.243-1.el7.x86_64
cuda-nvrtc-10.1.243-1.el7.x86_64
cuda-10.1.243-1.el7.x86_64
cuda-cupti-devel-10.1.243-1.el7.x86_64
cuda-cupti-10.1.243-1.el7.x86_64
cuda-nvgraph-devel-10.1.243-1.el7.x86_64
cuda-nvrtc-devel-10.1.243-1.el7.x86_64
nvidia-driver-cuda-440.44-1.el7.x86_64
cuda-cublas-devel-10.1.243-1.el7.x86_64
cuda-cudnn-devel-7.6.4.38-2.el7.x86_64
cuda-libs-10.1.243-1.el7.x86_64
cuda-cusparse-10.1.243-1.el7.x86_64
cuda-docs-10.1.243-1.el7.noarch
cuda-cudart-devel-10.1.243-1.el7.x86_64
cuda-nvjpeg-devel-10.1.243-1.el7.x86_64
cuda-devel-10.1.243-1.el7.x86_64
cuda-nsight-10.1.243-1.el7.x86_64
cuda-curand-10.1.243-1.el7.x86_64
cuda-nvjpeg-10.1.243-1.el7.x86_64
cuda-cudart-10.1.243-1.el7.x86_64

This issue looks similar to the one reported here:
https://askubuntu.com/questions/775644/system-freeze-on-monitor-wake-up-after-upgrade-to-16-04

When waking up from Sleep, nvkms experiences a memory allocation failure and crashes the system.