I’m jealous. I’ve been reporting nvkms crashdumps in the ‘stable’ driver for over a -year- and you guys got NVidia to fix the issue in less than 5 months!
Hi There,
I got yet another crashump in nvkms on RHEL7.8 (latest kernel) using the latest stable driver (450.57).
This time the crash happened on a Quadro P2200 GPU (previous reports had been on GTX 1660 Ti GPUs).
There seems to aggravating factors:
Machine has 512gb RAM of which 384gb are set aside for hugepages.
Chrome was running on Xorg.
This issue is similar to:
Here are some of my reports on the NVidia website:
[8] : RHEL 7.7 + 430.50 : random kernel panics in _nv002453…
Hi,
I’m still experiencing crashdumps on RHEL7.8 + 440.82 nvidia driver.
System info: Dell PowerEdge T640, 512gb RAM, 72cores, NVidia GTX 1660Ti.
The kernel crashdump shows this:
[
[679428.470206] X: page allocation failure: order:4, mode:0x40d0
[679428.470211] CPU: 8 PID: 14300 Comm: X Kdump: loaded Tainted: P W OE ------------ T 3.10.0-1127.8.2.el7.x86_64 #1
[679428.470212] Hardware name: Dell Inc. PowerEdge T640/04WYPY, BIOS 2.5.4 01/14/2020
[679428.470213] Call Trace:
[679…
Hi NVidia,
I just got another crashdump on RHEL7.7 (latest patches) + driver 440.59.
The backtrace shows:
[1171872.727617] CPU: 28 PID: 201742 Comm: X Kdump: loaded Tainted: P W OE ------------ T 3.10.0-1062.12.1.el7.x86_64 #1
[1171872.727621] Hardware name: Dell Inc. PowerEdge T440/00X7CK, BIOS 2.4.8 11/27/2019
[1171872.727624] Call Trace:
[1171872.727637] [<ffffffff8997ac43>] dump_stack+0x19/0x1b
[1171872.727647] [<ffffffff893c3d90>] warn_alloc_failed+0x110/0x180
[1171872.727653]…
Here is yet another crashdump (been reporting these since september of 2019) on RHEL7 with the NVidia driver:
[810353.231443] X: page allocation failure: order:4, mode:0x40d0
[810353.231456] CPU: 0 PID: 29465 Comm: X Kdump: loaded Tainted: P OE ------------ T
3.10.0-1062.9.1.el7.x86_64 #1
[810353.231461] Hardware name: Dell Inc. PowerEdge T440/00X7CK, BIOS 2.4.8 11/27/2019
[810353.231464] Call Trace:
[810353.231483] [<ffffffff84d7ac23>] dump_stack+0x19/0x1b
[810353.231496] [<fffff…
Hi,
I’m still experiencing random crashdumps on RHEL7.7 due to page allocation failures in Xorg in the nvidia_modeset druver,
Here’s more information:
[983520.867208] Hardware name: Dell Inc. PowerEdge T440/00X7CK, BIOS 2.4.7 10/28/2019
[983520.867209] Call Trace:
[983520.867217] [<ffffffff9797ac23>] dump_stack+0x19/0x1b
[983520.867223] [<ffffffff973c3d70>] warn_alloc_failed+0x110/0x180
[983520.867226] [<ffffffff973c897f>] __alloc_pages_nodemask+0x9df/0xbe0
[983520.867230] [<ffffffff97416…
The 3.10.0 version string in RHEL7 is only because RedHat doesn’t rebase within a major RHEL version. But most kernel bugs are fixed on a regular basis with ‘backports’. I am sure NVidia engineering is aware of this since they are working with RedHat for RHEL support on the Tesla vGPUs
@amrits @aplattner
1 Like