Still getting kernel crashdumps on RHEL7.8 + 440.82 in nvkms_alloc

Hi,
I’m still experiencing crashdumps on RHEL7.8 + 440.82 nvidia driver.
System info: Dell PowerEdge T640, 512gb RAM, 72cores, NVidia GTX 1660Ti.

The kernel crashdump shows this:

[
[679428.470206] X: page allocation failure: order:4, mode:0x40d0
[679428.470211] CPU: 8 PID: 14300 Comm: X Kdump: loaded Tainted: P W OE ------------ T 3.10.0-1127.8.2.el7.x86_64 #1
[679428.470212] Hardware name: Dell Inc. PowerEdge T640/04WYPY, BIOS 2.5.4 01/14/2020
[679428.470213] Call Trace:
[679428.470224] [] dump_stack+0x19/0x1b
[679428.470229] [] warn_alloc_failed+0x110/0x180
[679428.470231] [] __alloc_pages_slowpath+0x6bb/0x729
[679428.470234] [] __alloc_pages_nodemask+0x436/0x450
[679428.470238] [] alloc_pages_current+0x98/0x110
[679428.470293] [] ? _nv000491kms+0x50/0x50 [nvidia_modeset]
[679428.470295] [] kmalloc_order+0x18/0x40
[679428.470299] [] kmalloc_order_trace+0x26/0xa0
[679428.470301] [] ? __kmalloc+0x211/0x230
[679428.470309] [] ? _nv000491kms+0x50/0x50 [nvidia_modeset]
[679428.470310] [] __kmalloc+0x211/0x230
[679428.470317] [] ? _nv000491kms+0x50/0x50 [nvidia_modeset]
[679428.470325] [] nvkms_alloc+0x27/0x70 [nvidia_modeset]
[679428.470338] [] _nv002521kms+0x16/0x30 [nvidia_modeset]
[679428.470349] [] ? _nv002628kms+0x68/0x1f70 [nvidia_modeset]
[679428.470350] [] ? __alloc_pages_nodemask+0x90/0x450
[679428.470352] [] ? alloc_pages_current+0x98/0x110
[679428.470359] [] ? _nv000491kms+0x50/0x50 [nvidia_modeset]
[679428.470360] [] ? kmalloc_order+0x18/0x40
[679428.470361] [] ? kmalloc_order_trace+0x26/0xa0
[679428.470362] [] ? __kmalloc+0x211/0x230
[679428.470369] [] ? _nv000491kms+0x50/0x50 [nvidia_modeset]
[679428.470377] [] ? _nv000620kms+0x31/0xe0 [nvidia_modeset]
[679428.470387] [] ? _nv000491kms+0x50/0x50 [nvidia_modeset]
[679428.470395] [] ? nvKmsIoctl+0x96/0x1d0 [nvidia_modeset]
[679428.470402] [] ? nvkms_ioctl_common+0x42/0x80 [nvidia_modeset]
[679428.470410] [] ? nvkms_ioctl+0xc3/0x110 [nvidia_modeset]
[679428.470560] [] ? nvidia_frontend_unlocked_ioctl+0x43/0x50 [nvidia]
[679428.470565] [] ? do_vfs_ioctl+0x3a0/0x5b0
[679428.470569] [] ? __do_page_fault+0x238/0x500
[679428.470570] [] ? SyS_ioctl+0xa1/0xc0
[679428.470572] [] ? system_call_fastpath+0x25/0x2a
]

The related mem-info is as follows:

[679428.470573] Mem-Info:
[679428.470588] active_anon:5634519 inactive_anon:2811465 isolated_anon:64
active_file:6801803 inactive_file:4778559 isolated_file:0
unevictable:352533 dirty:532 writeback:0 unstable:0
slab_reclaimable:1346419 slab_unreclaimable:921230
mapped:4523211 shmem:4444282 pagetables:62269 bounce:0
free:1728559 free_pcp:10125 free_cma:0
[679428.470593] Node 0 DMA free:14780kB min:28kB low:32kB high:40kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15980kB managed:15896kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:32kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
[679428.470598] lowmem_reserve: 0 1334 257063 257063
[679428.470602] Node 0 DMA32 free:1021324kB min:2716kB low:3392kB high:4072kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:1694152kB managed:1366440kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
[679428.470606] lowmem_reserve: 0 0 255729 255729
[679428.470610] Node 0 Normal free:778208kB min:520592kB low:650740kB high:780888kB active_anon:11358364kB inactive_anon:5102140kB active_file:15021592kB inactive_file:12824032kB unevictable:1362396kB isolated(anon):0kB isolated(file):0kB present:266076160kB managed:261866948kB mlocked:1362400kB dirty:1600kB writeback:0kB mapped:5404608kB shmem:5167296kB slab_reclaimable:2361348kB slab_unreclaimable:1840428kB kernel_stack:50112kB pagetables:160520kB unstable:0kB bounce:0kB free_pcp:21648kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[679428.470614] lowmem_reserve: 0 0 0 0
[679428.470618] Node 1 Normal free:5099924kB min:525232kB low:656540kB high:787848kB active_anon:11179712kB inactive_anon:6143720kB active_file:12185620kB inactive_file:6290204kB unevictable:47736kB isolated(anon):256kB isolated(file):0kB present:268435456kB managed:264201412kB mlocked:47736kB dirty:528kB writeback:0kB mapped:12688236kB shmem:12609832kB slab_reclaimable:3024328kB slab_unreclaimable:1844460kB kernel_stack:32448kB pagetables:88556kB unstable:0kB bounce:0kB free_pcp:18852kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[679428.470620] lowmem_reserve: 0 0 0 0
[679428.470622] Node 0 DMA: 14kB (U) 18kB (U) 116kB (U) 132kB (U) 064kB 1128kB (U) 1256kB (U) 0512kB 01024kB 12048kB (M) 34096kB (M) = 14780kB
[679428.470628] Node 0 DMA32: 7
4kB (UM) 68kB (UM) 416kB (UM) 432kB (UM) 664kB (UM) 4128kB (UM) 3256kB (M) 5512kB (UM) 51024kB (UM) 22048kB (UM) 2464096kB (M) = 1021324kB
[679428.470634] Node 0 Normal: 391254kB (UEM) 775628kB (UE) 016kB 032kB 064kB 0128kB 0256kB 0512kB 01024kB 02048kB 04096kB = 776996kB
[679428.470639] Node 1 Normal: 324973
4kB (UEM) 2425278kB (UEM) 10601016kB (UEM) 440532kB (UEM) 28864kB (UEM) 25128kB (UM) 1256kB (M) 0512kB 01024kB 02048kB 04096kB = 5099116kB
[679428.470645] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
[679428.470646] Node 0 hugepages_total=98304 hugepages_free=82974 hugepages_surp=0 hugepages_size=2048kB
[679428.470647] Node 1 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
[679428.470648] Node 1 hugepages_total=98304 hugepages_free=90082 hugepages_surp=0 hugepages_size=2048kB
[679428.470649] 16053659 total pagecache pages
[679428.470652] 0 pages in swap cache
[679428.470653] Swap cache stats: add 0, delete 0, find 0/0
[679428.470653] Free swap = 8388604kB
[679428.470654] Total swap = 8388604kB
[679428.470655] 134055437 pages RAM
[679428.470656] 0 pages HighMem/MovableOnly
[679428.470656] 2192763 pages reserved
[679428.470679] BUG: unable to handle kernel paging request at 0000000000006f80
[679428.477752] IP: [] _nv002476kms+0x60/0x100 [nvidia_modeset]
[679428.484991] PGD 0
[679428.487108] Oops: 0000 [#1] SMP

Here’s the call trace:

[679428.705013]
[679428.705213] CPU: 8 PID: 14300 Comm: X Kdump: loaded Tainted: P W OE ------------ T 3.10.0-1127.8.2.el7.x86_64 #1
[679428.716217] Hardware name: Dell Inc. PowerEdge T640/04WYPY, BIOS 2.5.4 01/14/2020
[679428.723764] task: ffff95b590b262a0 ti: ffff95b58fa24000 task.ti: ffff95b58fa24000
[679428.731313] RIP: 0010:[] [] _nv002476kms+0x60/0x100 [nvidia_modeset]
[679428.740969] RSP: 0000:ffff95b58fa27ba0 EFLAGS: 00010202
[679428.746356] RAX: 0000000000000004 RBX: 0000000000006f80 RCX: 0000000000000004
[679428.753557] RDX: ffff9577b380d318 RSI: 0000000000006f80 RDI: ffff9577b380f008
[679428.760759] RBP: 0000000000000000 R08: 0000000000000400 R09: 0000000000000000
[679428.767962] R10: 0000000000000004 R11: ffff95b58fa2756e R12: 0000000000006f80
[679428.775164] R13: 0000000000006f80 R14: ffff9577b380f008 R15: 0000000000000001
[679428.782364] FS: 00007fe4110b1a00(0000) GS:ffff95a7ae100000(0000) knlGS:0000000000000000
[679428.790519] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[679428.796336] CR2: 0000000000006f80 CR3: 0000000e03b3a000 CR4: 00000000007627e0
[679428.803538] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[679428.810739] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[679428.817942] PKRU: 55555554
[679428.820732] Call Trace:
[679428.823269] [] ? nvkms_alloc+0x27/0x70 [nvidia_modeset]
[679428.830221] [] ? _nv002628kms+0x3aa/0x1f70 [nvidia_modeset]
[679428.837509] [] ? alloc_pages_current+0x98/0x110
[679428.843762] [] ? _nv000491kms+0x50/0x50 [nvidia_modeset]
[679428.850786] [] ? kmalloc_order+0x18/0x40
[679428.856434] [] ? kmalloc_order_trace+0x26/0xa0
[679428.862597] [] ? __kmalloc+0x211/0x230
[679428.868076] [] ? _nv000491kms+0x50/0x50 [nvidia_modeset]
[679428.875111] [] ? _nv000620kms+0x31/0xe0 [nvidia_modeset]
[679428.882145] [] ? _nv000491kms+0x50/0x50 [nvidia_modeset]
[679428.889180] [] ? nvKmsIoctl+0x96/0x1d0 [nvidia_modeset]
[679428.896129] [] ? nvkms_ioctl_common+0x42/0x80 [nvidia_modeset]
[679428.903678] [] ? nvkms_ioctl+0xc3/0x110 [nvidia_modeset]
[679428.910785] [] ? nvidia_frontend_unlocked_ioctl+0x43/0x50 [nvidia]
[679428.918681] [] ? do_vfs_ioctl+0x3a0/0x5b0
[679428.924414] [] ? __do_page_fault+0x238/0x500
[679428.930404] [] ? SyS_ioctl+0xa1/0xc0
[679428.935704] [] ? system_call_fastpath+0x25/0x2a
[679428.941952] Code: 2a eb 40 0f 1f 84 00 00 00 00 00 48 c7 03 00 00 00 00 c6 43 08 00 41 8b 86 d8 00 00 00 83 c5 01 48 81 c3 d0 03 00 00 39 e8 76 18 <48> 8b 3b 48 85 ff 74 ea 80 7b 08 00 75 d2 e8 ed d2 ff ff eb cb
[679428.962297] RIP [] _nv002476kms+0x60/0x100 [nvidia_modeset]
[679428.969628] RSP
[679428.973199] CR2: 0000000000006f80

Here is the nvidia-bug-report.log…

Also seeing this same issue with latest Nvidia driver.

Hello,

We are having a similar problem and I’m wondering if you ever found a solution?

Thank you.

Hi,
I’m currently on EL7.9 with NVidia stable driver 460.67 and I have not had crashes since december of 2020.
What kernel/EL and driver version are you on?
Vincent

Kernel is 3.10.0-1160.21.1.el7.x86_64

Thank you for responding. We are running 3.10.0-1160.21.1.el7.x86_64 and 460.32.03.