RHEL7.8 + 450.57 nvkms crashdump on Quadro P2200 GPU

Hi There,
I got yet another crashump in nvkms on RHEL7.8 (latest kernel) using the latest stable driver (450.57).
This time the crash happened on a Quadro P2200 GPU (previous reports had been on GTX 1660 Ti GPUs).
There seems to aggravating factors:

  • Machine has 512gb RAM of which 384gb are set aside for hugepages.
  • Chrome was running on Xorg.

This issue is similar to:
Here are some of my reports on the NVidia website:

[8] : https://devtalk.nvidia.com/default/topic/1064891/linux/rhel-7-7-430-50-random-kernel-panics-in-_nv002453kms-nvidia_modeset-/post/5435164/#5435164

Again, this is on a fully patched NUMA machine and the call stack trace is identical:

[235013.032911] X: page allocation failure: order:5, mode:0x40d0
[235013.032916] CPU: 50 PID: 9887 Comm: X Kdump: loaded Tainted: P           OE  ------------ T 3.10.0-1127.18.2.el7.x86_64 #1
[235013.032918] Hardware name: Dell Inc. PowerEdge T640/04WYPY, BIOS 2.8.1 06/29/2020
[235013.032919] Call Trace:
[235013.032930]  [<ffffffffaa97ffa5>] dump_stack+0x19/0x1b
[235013.032934]  [<ffffffffaa3c4b70>] warn_alloc_failed+0x110/0x180
[235013.032936]  [<ffffffffaa97b4c0>] __alloc_pages_slowpath+0x6bb/0x729
[235013.032939]  [<ffffffffaa3c91f6>] __alloc_pages_nodemask+0x436/0x450
[235013.032943]  [<ffffffffaa418ea8>] alloc_pages_current+0x98/0x110
[235013.032946]  [<ffffffffaa3e57c8>] kmalloc_order+0x18/0x40
[235013.032949]  [<ffffffffaa424466>] kmalloc_order_trace+0x26/0xa0
[235013.032951]  [<ffffffffaa4283f1>] ? __kmalloc+0x211/0x230
[235013.032952]  [<ffffffffaa4283f1>] __kmalloc+0x211/0x230
[235013.033006]  [<ffffffffc22ee3f7>] nvkms_alloc+0x27/0x70 [nvidia_modeset]
[235013.033021]  [<ffffffffc232ce86>] _nv002654kms+0x16/0x30 [nvidia_modeset]
[235013.033034]  [<ffffffffc2324066>] ? _nv002760kms+0x66/0x1470 [nvidia_modeset]
[235013.033045]  [<ffffffffc22f1090>] ? _nv000531kms+0x50/0x50 [nvidia_modeset]
[235013.033046]  [<ffffffffaa3e57c8>] ? kmalloc_order+0x18/0x40
[235013.033047]  [<ffffffffaa424466>] ? kmalloc_order_trace+0x26/0xa0
[235013.033048]  [<ffffffffaa4283f1>] ? __kmalloc+0x211/0x230
[235013.033058]  [<ffffffffc22f1090>] ? _nv000531kms+0x50/0x50 [nvidia_modeset]
[235013.033068]  [<ffffffffc22f15a1>] ? _nv000673kms+0x31/0xe0 [nvidia_modeset]
[235013.033082]  [<ffffffffc22f1090>] ? _nv000531kms+0x50/0x50 [nvidia_modeset]
[235013.033092]  [<ffffffffc22f29f6>] ? nvKmsIoctl+0x96/0x1d0 [nvidia_modeset]
[235013.033102]  [<ffffffffc22ef022>] ? nvkms_ioctl_common+0x42/0x80 [nvidia_modeset]
[235013.033112]  [<ffffffffc22ef123>] ? nvkms_ioctl+0xc3/0x110 [nvidia_modeset]
[235013.033206]  [<ffffffffc068f083>] ? nvidia_frontend_unlocked_ioctl+0x43/0x50 [nvidia]
[235013.033210]  [<ffffffffaa462890>] ? do_vfs_ioctl+0x3a0/0x5b0
[235013.033213]  [<ffffffffaa98d678>] ? __do_page_fault+0x238/0x500
[235013.033214]  [<ffffffffaa462b41>] ? SyS_ioctl+0xa1/0xc0
[235013.033217]  [<ffffffffaa992ed2>] ? system_call_fastpath+0x25/0x2a
[235013.033218] Mem-Info:
[235013.033236] active_anon:3511544 inactive_anon:1120792 isolated_anon:0
 active_file:12973796 inactive_file:10215559 isolated_file:32
 unevictable:493405 dirty:156 writeback:0 unstable:0
 slab_reclaimable:1308485 slab_unreclaimable:1154922
 mapped:223515 shmem:318761 pagetables:65619 bounce:0
 free:1136004 free_pcp:516 free_cma:0
[235013.033242] Node 0 DMA free:15864kB min:60kB low:72kB high:88kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15980kB managed:15896kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:32kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
[235013.033246] lowmem_reserve[]: 0 1333 257060 257060
[235013.033252] Node 0 DMA32 free:1023476kB min:5428kB low:6784kB high:8140kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:1693292kB managed:1365580kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
[235013.033255] lowmem_reserve[]: 0 0 255727 255727
[235013.033260] Node 0 Normal free:1899692kB min:1041188kB low:1301484kB high:1561780kB active_anon:6640212kB inactive_anon:2113984kB active_file:27761332kB inactive_file:18884208kB unevictable:86492kB isolated(anon):0kB isolated(file):0kB present:266076160kB managed:261864516kB mlocked:86084kB dirty:336kB writeback:0kB mapped:613196kB shmem:1095788kB slab_reclaimable:3068880kB slab_unreclaimable:2756384kB kernel_stack:46944kB pagetables:149248kB unstable:0kB bounce:0kB free_pcp:1312kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[235013.033264] lowmem_reserve[]: 0 0 0 0
[235013.033269] Node 1 Normal free:1604984kB min:1050468kB low:1313084kB high:1575700kB active_anon:7405964kB inactive_anon:2369184kB active_file:24133852kB inactive_file:21978028kB unevictable:1887128kB isolated(anon):0kB isolated(file):128kB present:268435456kB managed:264198980kB mlocked:1887128kB dirty:288kB writeback:0kB mapped:280864kB shmem:179256kB slab_reclaimable:2165060kB slab_unreclaimable:1863272kB kernel_stack:41536kB pagetables:113228kB unstable:0kB bounce:0kB free_pcp:752kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[235013.033272] lowmem_reserve[]: 0 0 0 0
[235013.033274] Node 0 DMA: 0*4kB 1*8kB (U) 1*16kB (U) 1*32kB (U) 1*64kB (U) 1*128kB (U) 1*256kB (U) 0*512kB 1*1024kB (U) 1*2048kB (M) 3*4096kB (M) = 15864kB
[235013.033281] Node 0 DMA32: 5*4kB (UM) 6*8kB (UM) 5*16kB (UM) 5*32kB (UM) 5*64kB (UM) 7*128kB (UM) 2*256kB (UM) 3*512kB (M) 4*1024kB (UM) 2*2048kB (UM) 247*4096kB (M) = 1023476kB
[235013.033287] Node 0 Normal: 196675*4kB (UEM) 102907*8kB (UEM) 9099*16kB (UEM) 2016*32kB (UEM) 818*64kB (UEM) 137*128kB (UEM) 31*256kB (UEM) 6*512kB (UEM) 0*1024kB 0*2048kB 0*4096kB = 19
00948kB

vmcore-dmesg.txt (1017.8 KB)

nvidia-bug-report.log.bz2.gz (3.1 MB)
(I bzip2’d the bugreport file and -then- gzip’ed it again to get your site to accept it as an attachement)

I’m working with RedHat through a support case. Here is some information gathered from the case:

  • vm.overcommit_memory was set to 0. Memory was fragmented, this explains why the driver couldn’t allocate a 4th or 5th order block of contiguous memory.
  • page allocation failures below order 3 result in oom, page 4 and above default to a panic (hence the crashdump).
  • RH support is wondering this:
    “I’m not sure why the driver needs physical memory in this context. It might be possible for nvidia to change to driver to use vmalloc instead of kmalloc here.”
  • The page allocation failure happens when waking up from sleep (so the code path seems isolated there).

In the meantime, I’ve taken those mitigation measures:

  • drop hugepages from 384gb to 320gb.
  • set vm.overcommit_kbytes to ‘1’ instead of ‘0’. Something I never wanted to do because overcommitting is not really a good practice.

I’ll keep posting updates here.