Gfx crash using libreoffice-impress gives Xid errors 32 and 69 on 440.36-1 on Fedora 31

I’m having issues with libreoffice-impress crashing Xorg/nvidia drv on Fedora 31.

Hardware: Dell Precision 5540 with Nvidia Quadro P2000
Bios fw 1.14.0 First updated using fwupdmgr upgrade and rebooted

Kernel 5.4.2-300.fc31.x86_64 (tested 5.3.15 aswell, same thing)
xorg-x11-server-Xorg-1.20.6-1.fc31.x86_64
xorg-x11-drv-nvidia-kmodsrc-440.36-1.fc31.x86_64
libreoffice-impress-6.3.3.2-7.fc31.x86_64

When browsing through a pptx presentation, graphics lock up with the below errors.
I thought it was related to bios upgrade, but the system is stable until I run LibreOffice-Impress.

Dec 06 12:50:33 srl-torel01 kernel: NVRM: GPU at PCI:0000:01:00: GPU-a191f561-bd9f-34c4-531c-32f9e11d7474
Dec 06 12:50:33 srl-torel01 kernel: NVRM: Xid (PCI:0000:01:00): 32, pid=2092, Channel ID 00000018 intr 00040000
Dec 06 12:50:33 srl-torel01 kernel: NVRM: Xid (PCI:0000:01:00): 32, pid=14685, Channel ID 00000018 intr 00040000
Dec 06 12:50:33 srl-torel01 kernel: NVRM: Xid (PCI:0000:01:00): 32, pid=2187, Channel ID 0000001b intr 00040000
Dec 06 12:50:33 srl-torel01 kernel: NVRM: Xid (PCI:0000:01:00): 32, pid=14685, Channel ID 0000001b intr 00040000
Dec 06 12:50:33 srl-torel01 kernel: NVRM: Xid (PCI:0000:01:00): 69, pid=14685, Class Error: ChId 001b, Class 0000902d, Offset 00000250, Data 00005643, ErrorCode 0000000c
Dec 06 12:50:36 srl-torel01 kernel: NVRM: Xid (PCI:0000:01:00): 32, pid=14685, Channel ID 0000001b intr 00040000
Dec 06 12:50:36 srl-torel01 kernel: NVRM: Xid (PCI:0000:01:00): 69, pid=14685, Class Error: ChId 001b, Class 0000902d, Offset 00000250, Data 00005643, ErrorCode 0000000c
Dec 06 12:50:39 srl-torel01 kernel: NVRM: Xid (PCI:0000:01:00): 32, pid=14685, Channel ID 0000001b intr 00040000

sometimes intel wireless iwlwifi (AC-9260 rev 29) crashes around the same time. Probably due to gfx crash?

iwlwifi 0000:3b:00.0: loaded firmware version 46.6bf1df06.0 op_mode iwlmv

fwupdmgr get-updates

No upgrades for Thunderbolt controller in Dell dock, current is 40.00: 40.00=same
No upgrades for Package level of Dell dock, current is 01.00.04.01: 01.00.04.01=same
No upgrades for RTS5413 in Dell dock, current is 01.21: 01.21=same
No upgrades for RTS5487 in Dell dock, current is 01.47: 01.47=same
No upgrades for VMM5331 in Dell dock, current is 05.03.10: 05.03.10=same
No upgrades for WD19TB, current is 01.00.00.00: 01.00.00.00=same
No upgrades for System Firmware, current is 0.1.14.0: 0.1.14.0=same, 0.1.13.0=older, 0.1.12.0=older, 0.1.11.2=older, 0.1.10.1=older

Nvidia
[ 40.488] (II) NVIDIA GLX Module 440.36 Tue Nov 12 08:15:30 UTC 2019
[ 40.494] (II) NVIDIA: The X server supports PRIME Render Offload.
[ 40.495] (II) NVIDIA(0): NVIDIA GPU Quadro P2000 (GP107GL-A) at PCI:1:0:0 (GPU-0)
[ 40.495] (–) NVIDIA(0): Memory: 4194304 kBytes
[ 40.495] (–) NVIDIA(0): VideoBIOS: 86.07.63.00.24
[ 40.495] (II) NVIDIA(0): Detected PCI Express Link width: 16X

Known issue?  

Brgds,
Tor
nvidia-bug-report.log.gz (1.87 MB)

nvidia-bug-report.sh attached.

The previous crash had a bit more info, therefore I add it.

Dec 10 16:25:37 srl-torel01 kernel: NVRM: GPU at PCI:0000:01:00: GPU-a191f561-bd9f-34c4-531c-32f9e11d7474
Dec 10 16:25:37 srl-torel01 kernel: NVRM: Xid (PCI:0000:01:00): 32, pid=2488, Channel ID 00000018 intr 00040000
Dec 10 16:25:37 srl-torel01 kernel: NVRM: Xid (PCI:0000:01:00): 32, pid=16351, Channel ID 00000018 intr 00040000
Dec 10 16:25:37 srl-torel01 kernel: NVRM: Xid (PCI:0000:01:00): 31, pid=16351, Ch 00000019, intr 10000000. MMU Fault: ENGINE GRAPHICS GPCCLIENT_PROP_0 faulted @ 0x1_01320000. Fault is of type FAULT_PTE ACCESS_TYPE_WRITE
Dec 10 16:25:37 srl-torel01 kernel: show_signal_msg: 35 callbacks suppressed

I have had 3 Xid errors, 13, 31 and 32.

It is a bit strange that the problem coincided with Bios upgrade from 1.13.0 to 1.14.0, but I have to admit that I haven’t run libreoffice-impress in a while. Had no issues on Nvidia 440.31 on older kernels and Bios 1.13.0.

Any suggestions?
nvidia-bug-report.log2.gz (1.86 MB)

That looks like defective system memory, would also explain why your wifi driver is also crashing. Please check your memory, either by removing memory modules or memtest86.

Don’t think it is the memory. Machine passes free memtest86 v8.3 from Passmark. https://www.memtest86.com/ Intel wireless AC-9260 probably is a separate issue with the firmware, had issues before with intel-wireless-firmware.
Anyway, I used an external 4K monitor in addition to the internal, and ran fine all day with many applications except libreoffice-impress, including multiple libreoffice-calc sessions. Memory pretty much full, no issues. It definitely seems like a driver issue. Version 440.31 worked fine.

Could it be related to bug number 2432712? https://devtalk.nvidia.com/default/topic/1058707/linux/will-the-fault_pde-access_type_read-bug-in-the-nvidia-driver-ever-be-fixed-/ ?

No. That bug was caused by vmem full. The XIDs 31,13,69 you’re getting are all just subsequent errors of the XID 32, meaning corrupt data is sent to the gpu. Most often caused by defective memory or a 1st gen Ryzen.
In case this is software/driver related, did you check if reverting to the 440.31 driver fixes the issue?

Had another slightly different error msg in bold. Still think I should replace SODIMMs?

Dec 12 11:16:40 srl-torel01 kernel: mce: CPU0: Core temperature/speed normal
Dec 12 11:21:13 srl-torel01 kernel: [drm:intel_pipe_update_end [i915]] ERROR Atomic update failure on pipe A (start=280398 end=280399) time 655 us, min 2146, max 2159, scanline start 2079, end 2080
Dec 12 11:21:45 srl-torel01 kernel: NVRM: GPU at PCI:0000:01:00: GPU-a191f561-bd9f-34c4-531c-32f9e11d7474
Dec 12 11:21:45 srl-torel01 kernel: NVRM: Xid (PCI:0000:01:00): 32, pid=2354, Channel ID 00000018 intr 00040000
Dec 12 11:21:45 srl-torel01 kernel: NVRM: Xid (PCI:0000:01:00): 32, pid=16474, Channel ID 00000018 intr 00040000
Dec 12 11:21:45 srl-torel01 kernel: NVRM: Xid (PCI:0000:01:00): 32, pid=16474, Channel ID 00000018 intr1 00000008 HCE_DBG0 00002a28 HCE_DBG1 0000158a

Installed Nvidia driver 440.44. Tested LibreOffice-Impress and got Xid # 69 Graphics Engine class error. Strange that it is always one app triggering issue. Will order two new SODIMMs and replace entire memory and see if it helps. Thanks.

journalctl -k -b -4 | egrep -e “NVIDIA|NVRM|Xid|intel_pipe” | sort

Dec 13 19:42:04 srl-torel01 kernel: [drm:intel_pipe_update_end [i915]] ERROR Atomic update failure on pipe B (start=7253 end=7254) time 655 us, min 2146, max 2159, scanline start 2080, end 2167
Dec 13 19:42:44 srl-torel01 kernel: NVRM: GPU at PCI:0000:01:00: GPU-a191f561-bd9f-34c4-531c-32f9e11d7474
Dec 13 19:42:44 srl-torel01 kernel: NVRM: Xid (PCI:0000:01:00): 69, pid=2391, Class Error: ChId 0018, Class 0000c197, Offset 00001614, Data 00000000, ErrorCode 0000000d
Dec 13 19:44:33 srl-torel01 kernel: NVRM: Persistence mode is deprecated and will be removed in a future release. Please use nvidia-persistenced instead.
Dec 13 20:39:36 srl-torel01 kernel: nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms 440.44 Sun Dec 8 03:29:48 UTC 2019