Sudden black screen, massive number of Xid errors on new GeForce 3090 GPU

Hello,

I’m having problems with my NVIDIA GeForce 3090 FE. My monitor suddenly goes dark and nothing but a hard reset fixes the problem. This has happened twice in the last two weeks while normal web browsing, the GPU was not under any stress. The last time it happened was today and I’m starting to get a bit nervous due to the repeated nature of the problem.

Upon rebooting and running “nvidia-bug-report.sh” and extracting the log file, I’m getting a massive “nvidia-bug-report.log” file with over one million lines of this Xid error repeated:
“Jul 04 00:03:42 rockio kernel: NVRM: Xid (PCI:0000:01:00): 45, pid=173078, Ch 0000003f”

In particular, I’m not sure what “Xid (PCI:0000:01:00): 45” is telling me. Is this a critical error? Why so many of them?

This being a relatively new and expensive product, I’m wondering if this is hardware issue meriting an RMA of the product. Any feedback would be greatly appreciated.

Thanks for reading!

My specs are:
Product Name : NVIDIA GeForce RTX 3090
Driver Version : 465.27
CUDA Version : 11.3
Memory : 32 GB RAM
SSD Drive : 2 TB, over 1 TB free
CPU : Intel Core i7 10700k
OS: : Ubuntu 20.04

nvidia-bug-report.log.gz (3.5 MB)

The XID 45 is only a subsequent error, the real errors that trigger this are XID 31,62 and 32. This points to something memory related but from which source is plain guessing. First of all, before the crash there were a lot of suspend/resume cycles so maybe gpu memory got corrupted on one of those. Did the crash before also happen after a suspend?

1 Like

“First of all, before the crash there were a lot of suspend/resume cycles so maybe gpu memory got corrupted on one of those”

This is possible. I generally suspend/resume my computer; only rebooting about once a month for software updates. The crashes happen after a suspend, however, not immediately after. I resume the PC in the mornings and suspend at night. The crashes so far happened at night, after the computer has been on for the entire day.

“the real errors that trigger this are XID 31,62 and 32”

Oh, good catch! I’m seeing Xid 31,32, errors as well in the log file:

NVRM: Xid (PCI:0000:01:00): 62, pid=0, 0000(0000) 00000000 00000000

NVRM: Xid (PCI:0000:01:00): 31, pid=126815, Ch 00000063, intr 00000000. MMU Fault: ENGINE GRAPHICS GPCCLIENT_T1_0 faulted @ 0x1_b0fde000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ

I’m seeing an actual error description: “FAULT_PDE ACCESS_TYPE_VIRT_READ” Could this indicate hardware error? Or is it more likely to be a driver issue?

Can be a hardware issue but from the errors you’re getting, I’d say very unlikely.
Please check for a bios update first.
Since you say it happened while webbrowsing, this might be just the known driver bug regarding chrome and hardware acceleration enabled, please check.

1 Like

“Can be a hardware issue but from the errors you’re getting, I’d say very unlikely.”

It’s good to hear this.

Upon further recollection, I may have been running a Gazebo simulator in the background. I’m starting to lean towards some CUDA related issue. It wasn’t too heavy on the GPU, but most likely was using a significant amount of memory.

I think I’ll upgrade my BIOS, check drivers, and keep closer notes on when/if this happens again.

Thanks so much for your help generix.

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.