Will the FAULT_PDE ACCESS_TYPE_READ bug in the Nvidia driver ever be fixed?

I’m not even sure that Nvidia is aware about this, but there has been a big community behind Valve’s new Proton software that allows to play Windows games on Linux. Sadly, the Nvidia driver on Linux is still not stable enough for many of those games that would work easily on AMD. The author of DXVK (DirectX11 comp. Layer for Linux) or other community members will never be able to fix the bugs in Nvidia’s driver. The Community is waiting for over a year now, but as far as I know there has never been an actual acknowledgement from Nvidia. Many bugreports have been opened on the Github issue tracker, but there is nothing we can do as long as Nvidia is not willing to help.

So my questions are: Is Nvidia aware about this? If yes, why has there been no progress over the last year? Is there even someone working on fixing the segmentations faults in the driver? Or does Nvidia simply not care about the gaming/Linux community enough?

I hope that some day, Nvidia will be as stable as AMD is on Linux, since i’d like to continue buying your hardware. However, if there is no response or progress after more than a year, I guess it’s my own fault for thinking i can use your hardware on Linux in the first place.

Github issues for reference:

And here’s the error message that the driver writes into dmesg:
NVRM: Xid (PCI:0000:09:00): 31, Ch 0000004b, intr 10000000. MMU Fault: ENGINE GRAPHICS GPCCLIENT_T1_4 faulted @ 0x0_00000000. Fault is of type FAULT_PDE ACCESS_TYPE_READ

Note that this error freezes the whole system. Only way to get to a working desktop is to SSH into the PC and kill the process that uses the Nvidia driver.

We’ve been dealing with this issue for over two years now on our ffmpeg transcoding servers running linux. It’s to the point that we’ve scripted a way to monitor the /var/log/dmesg for the kernal fault and do a hard reset on the whole server as soon as it happens. It’s forced us to migrate to multiple nodes in the swarm to ‘somewhat’ tolerate this crash and the data loss associated with it. Pretty ridiculous when we run a small GPU cluster all running a mix of Quadro P4000 and P5000s. If AMD had any sort of decent support with ffmpeg it would’ve made us move over, but currently we’re stuck. I pray for a driver fix daily when I get alerts of a driver fault and the server has been force restarted.

NVRM: Xid (PCI:0000:03:00): 31, Ch 00000018, engmask 00008100, intr 10000000. MMU Fault: ENGINE NVDEC HUBCLIENT_NVDEC faulted @ 0xff_fffff000. Fault is of type FAULT_PDE ACCESS_TYPE_READ

There’s been another report here as well - https://devtalk.nvidia.com/default/topic/1042835/linux/nvidia-docker-based-host-hangs-when-gpu-memory-exceeded-with-ffmpeg-transcodes/post/5289719/

It has been almost two months and no developer answered.
Is really this the kind of support we should expect?
Anyway, i join the question.

nvidia-bug-report.log.gz (1.13 MB)

We’re not shoveling millions of dollars to nvidia so they could care less. Pretty unfortunate as we’ve spent over $50k in GPUs during the last two years…

Currently in development for moving to AMD.

I asked around and it sounds like this issue is being investigated and tracked in bug number 2432712.

tugohugo, your issue sounds different. Do you have a bug number associated with your problem? If not, please file one through the partner site. (If you’re not set up to file bugs I can put you in touch with the developer relations folks)

Great to hear that this bug is already tracked! Is there a public bugtracker where the progress is listed? I couldn’t find one

The bug tracker is not public, sorry. This particular bug is still open for investigation.

Hi,
I found this bug on a number of our Linux workstation using Nvidia cards.
The error does not seem to be triggered by any program or operation in particular, although we run several OpenGL applications.

At the first occurrence of the error in the syslog (i.e., it would appear in dmesg) the Xorg server is in an unstable state, and all unsaved work is basically lost.

We tried switching cards (three so far), suspecting hardware issues, but the problem persisted.
The latest driver we tested was 418.74.

The state of the Nvidia drivers under Linux is in a terribly sad state, after being rather reliable for some years.

Please let us know if we can provide more feedback to speed up the solution of the problem.

1 Like

i encounter the same issue on vlc .
on other player like qmplay2 kodi …no issue

turn off tripple buffering and paste this to xorg.conf : Option “metamodes” “nvidia-auto-select +0+0 {ForceCompositionPipeline=On, ForceFullCompositionPipeline=On}”
another point i disabled serial port on bios .
no freeze anymore no headaches .

Hi there, I am having same random problems. No particular app. This is with latest linux kernel v5.9.10 on ubuntu 20.04 with xorg ubuntu budgie. And the 455.45.01 driver. It’s been happening ever since from upgrading 19.10 to 20.04. And with these newer nvidia driver versions.

My hardware is gt1030 (pascal) + 8700k. I have not been able to get into my bios to check the serial port setting yet. Because my apple keyboard doesn’t recognize at boot time.

Please keep us apprised / updated for this issue. And if you can tell us what hardware + software you are trying to reproduce with. Can you reproduce this bug reliably, internally? And narrow / regression test the previous versions? Thanks.

Similar problem exists in Cyberpunk 2077 running under Steam Proton on 455.46.02 drivers (also occurred on 455.45.01).

dmesg:

[ 3198.971541] NVRM: Xid (PCI:0000:01:00): 31, pid=86684, Ch 00000046, intr 10000000. MMU Fault: ENGINE GRAPHICS GPCCLIENT_T1_2 faulted @ 0x1_f4fd5000. Fault is of type FAULT_PDE ACCESS_TYPE_READ
# and another ...
 Xid (PCI:0000:01:00): 31, pid=291362, Ch 0000004e, intr 10000000. MMU Fault: ENGINE GRAPHICS GPCCLIENT_T1_8 faulted @ 0x2_0db70000. Fault is of type FAULT_PTE ACCESS_TYPE_READ

I am happy to provide further info if it will help, but I’m not sure if I can get the game to run with debug instrumentation (e.g. cuda-memcheck suggested in documentation).

Same problem here, running nvCaffe on pair of 2080ti, Ubuntu 18.04.4 LTS (GNU/Linux 4.15.0-118-generic x86_64).

Problem only started when I updated to NVIDIA-Linux-x86_64-450.80.02 driver.

Kubuntu 20.04 LTS, using nvidia proprietary driver 460.32.03

I was watching videos on Plex, using the web player in the Brave web browser, when my system locked up completely. I was unable to SSH into it from another system, so had to press the reset button.

When I logged back in, I checked the crash logs, and found this…
NVRM: Xid (PCI:0000:01:00): 31, pid=278, Ch 00000002, intr 00000000. MMU Fault: ENGINE HOST0 HUBCLIENT_HOST faulted @ 0x10_00000000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ NVRM: Xid (PCI:0000:01:00): 31, pid=5096, Ch 00000050, intr 00000000. MMU Fault: ENGINE HOST0 HUBCLIENT_HOST faulted @ 0x21_02b07000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ

I’ve been messing with Linux for a good while now, (a few years), and have never seen this before. A google search brought me here. I’m using a Geforce RTX 2080 SUPER. I was trying to migrate to Linux from Windows, but if my expensive hardware won’t work there, because of bad drivers, I’m forced to stick with Windows.

2 Likes

Same problem here on a Lenovo ThinkPad P53, RHEL8.3
NVIDIA-SMI 460.32.03 Driver Version: 460.32.03 CUDA Version: 11.2

Most videos in Firefox cause a system hang. Sometimes they cause a full lockup requiring me to kill X. Crash logs show the same error.

NVRM: Xid (PCI:0000:01:00): 31, pid=119156, Ch 00000079, intr 00000000. MMU Fault: ENGINE NVDEC0 HUBCLIENT_NVDEC0 faulted @ 0x1_04c43000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ