Reproducible Xid 31 error

CPU: Ryzen 5800X
MB: X570 chipset
GPU: GTX 1660 Ti

OS: Fedora 36/Linux 5.19.4/X.org/NVIDIA 515.65.01/XFCE 4 without compositing

Steps to reproduce:

  1. Open Google Chrome
  2. Switch to Linux console (Ctrl + Alt + F2), switch between them for a while (Alt + F2/F3/F4/etc)
  3. Return to X.org (Alt + F1/F7)

Result:

NVRM: GPU at PCI:0000:08:00: GPU-UID
NVRM: Xid (PCI:0000:08:00): 31, pid=169800, name=chrome, Ch 00000023, intr 00000000. MMU Fault: ENGINE CE0 HUBCLIENT_HSCE0 faulted @ 0x0_0abe0000. Fault is of type FAULT_PTE ACCESS_TYPE_VIRT_READ
NVRM: Xid (PCI:0000:08:00): 31, pid=174421, name=chrome, Ch 00000023, intr 00000000. MMU Fault: ENGINE CE0 HUBCLIENT_HSCE0 faulted @ 0x0_0abe0000. Fault is of type FAULT_PTE ACCESS_TYPE_VIRT_READ

Nothing breaks as a result of this error but I’d love not to get it anyways.

P.S. Mozilla Firefox is also running. Not sure if they are both needed to trigger the bug.

I have filed a bug 3785853 internally for tracking purpose.
Shall try for local repro and will get back to you in required addition information.

Hi birdie. Can you please attach a bug report log? Also, what specific versions of Chrome and Firefox are you running?

nvidia-bug-report.log.bz2 (445.0 KB)

Firefox 104.0.2 (Official build)
Chrome Version 105.0.5195.102 (Official Build) (64-bit)

There’s an extra thing I’ve noticed.

After switching from console back to X.org, WebGL in Google Chrome crashes, e.g. this demo.

It looks like the NVreg_PreserveVideoMemoryAllocations setting is disabled, which is currently expected to result in errors in Chrome and Firefox when VT switching. Can you please try enabling vidmem preservation and trying again?

I know it seems counterintuitive, but having working vidmem preservation for suspend & resume allows the driver to use a lighter-weight mechanism for VT switching too.

1 Like

I’ll try this, thanks. Will report back if it’s helped or not. I presume it’s an option for the kernel module and it must be NVreg_PreserveVideoMemoryAllocations=1?

Also it looks like this option must be enabled by default, why it’s not?

It’s an nvidia.ko option. You can set it via a file in /etc/modprobe.d with

options nvidia NVreg_PreserveVideoMemoryAllocations=1

or (I think) by setting nvidia.NVreg_PreserveVideoMemoryAllocations=1 on the kernel command line.

It’s not enabled by default because it requires S3 suspend & resume to go through the nvidia-sleep.sh script. That works if the nvidia-suspend, nvidia-resume, and nvidia-hibernate systemd units are installed and enabled and the system is suspended through systemd, but not all Linux distributions use systemd and enabling vidmem preservation by default on those would cause suspend to fail and be perceived as a regression.

Distributions that install the driver through a package are in a better position to ensure that the appropriate systemd units are enabled and can be more confident about enabling it by default. Looking at your bug report log, it does look like the systemd units are already enabled and activating during suspend, so enabling NVreg_PreserveVideoMemoryAllocations=1 should work fine.

1 Like

On a first attempt the error is gone but WebGL demos in Google Chrome still crash.

This is easy to reproduce:

  1. In Google Chrome open Shader - Shadertoy BETA
  2. Switch to Linux console
  3. Switch back to Xorg (XFCE, no compositing): Chrome pop-ups: WebGL has crashed.

Oh, and all Firefox client (HTML) areas are painted black and require you to minimize/maximum the web browser.

Can you please grep PreserveVideoMemoryAllocations /proc/driver/nvidia/params and check that it’s actually set to 1 now?

grep PreserveVideoMemoryAllocations /proc/driver/nvidia/params
PreserveVideoMemoryAllocations: 0

And that’s weird because I have this:

cat /etc/modprobe.d/nvidia.conf
#WRONG options nvidia-drm modeset=1 NVreg_PreserveVideoMemoryAllocations=1

Damn, I’m stupid, this is wrong.

It must have been:

options nvidia-drm modeset=1
options nvidia NVreg_PreserveVideoMemoryAllocations=1

I will retest ASAP. Sorry :-(

Too many modules and options nowadays, Aaron :-) I remember back in the day there was just nvidia.ko and no options at all.

Edit: everything seemingly works. Thanks a ton!

1 Like

Haha, the sad part is I’ve done that twice and I should know better. I’m glad to hear it’s working now, thanks for confirming.

My final questions:

  1. I don’t think Windows drivers use any storage for suspend/resume, yet the Linux drivers need this feature. How come Windows doesn’t fall apart on suspend/resume?

  2. My GPU supports Video Memory Self Refresh which means I can enable NVreg_EnableS0ixPowerManagement=1. What are the pitfulls of using it if I don’t care about the increased power consumption in suspend? I don’t think it’s more than 1W for my GTX 1660 Ti.

I’m not familiar with the details so take this with a grain of salt, but my understanding is that on Windows, the WDDM core code takes care of saving and restoring video memory data itself.

This vidmem preservation code migrates all video memory contents to system memory that is (depending on the filesystem specified by NVreg_TemporaryFilePath) backed by physical storage pages. So if you have a lot more vidmem allocated than you have free system RAM to store it, and your filesystem is backed by a slow disk, it could make suspend & resume significantly slower. On the other hand, this mechanism also works with system hibernate since the vidmem contents are copied to non-volatile storage.

S0ix should achieve the same end result without having to transfer anything by keeping the GPU’s video RAM contents where they are. So if the extra power consumption is okay with you, then you should be able to achieve faster suspend / resume times.

1 Like