Multiple CUDA/RTX/Vulkan application crashing with Xid (13,109) errors

Unfortunately the only open source is only the kernel module of the driver which is already somewhat utilized by nvk(not fully ready afaik, but it shows high potential).
According to your bug report we have the same GPU so that is very bizarre that Xid triggers without logical explanation, hmm.
Might not help out and it will also prevent you from using DLSS, but have you tried to hide the NVIDIA gpu with this variable for Proton: PROTON_HIDE_NVIDIA_GPU=1

I agree that it’s not logical. For me this problem occurs with any game that slightly taxes my system. It feels like it should be more wide spread than it is. It honestly surprises me that this thread isn’t packed with ā€œme tooā€ responses.

At any rate I’ve found no combinations, or lack thereof, of Proton environmental variables that help this issue.

nvidia-bug-report.log.gz (897.2 KB)

It’s happening on Xorg directly for me, doesn’t seem linked to proton at all.

I have like 50 hours played on WRC 23. Always worked fine, had no problems (well, restricting the scope to this issue and it worked better after the 1.4.0 patch regarding precompiled shaders). But that specific track would trigger the XID error (i got it like three times retrying that track). I can try to reproduce it even externally record it just in case (not sure what would trigger it, maybe it’s when rendering a specific frame, so recording it would notice which frame an why is it different, and increase chances of reproducing it consistently)

I think on the nvidia log should be the hardware details

I’m on Exherbo (a Gentoo like distro), it’s a 1660ti (mobile) on an Acer predator helios 300 PH315-52-78VL, kernel 6.6.4, driver 545.29.06, 16gb RAM, i7-9750H

Not sure which details do you want

I’m using external devices like a Thrustmaster T300RS, the th8a shifter and a local provider of handbrake, i can do test without them connected too.

Edit, i’ve just reproduced it again. Did many tracks and i keep playing with no problems at except of this track that triggers the XID

Steps to reproduce:

  • Create a custom rally
  • Select RALLYE MONTE-CARLO
  • Season: Spring
  • Add Stage
  • Select Les Borels 8,6km, and all stock options
  • Confirm, Confirm
  • Start
  • Select Subaru Impreza 1995
  • Play until XID happens

I’m not sure if it’s a specific frame. I also think performance on this specific track is relatively poor

Can you try Forza Horizon 5 instead ? I think it will crash much more with this issue

I don’t have Forza Horizon 5, as for the WRC 23, I’ll try the stage and update later with the results.

1 Like

I’ve just added more clear repro steps. I’ve two videos

Game config

Gameplay until XID error:

This last video ends as obs throws that the HVENC codec is taking too long

Tried the stage with the same car etc… No issues. Cant replicate it.
I don’t see any major performance fluctuations either. All stages have pretty similar framerates for me. Crowded areas with a bit less fps and forests/fields more.

Been playing the game for nearly 60 hours now with zero crashes/freezes.

Ryzen 5800X3D, RTX 3080 … currently 535.43.20 vulkan dev drivers but played with 545.29 also

edit: what happens if you cap the powerlimit of the gpu to be a bit lower?
maybe that stage uses more cpu also and you hit the laptop powerbrick limits and the driver then just gives up?
random thought.

2 Likes

Managed to get it to trigger on arch with the latest Xorg and stock DWM with nothing but the latest firefox running. This cannot get more basic.

Dec 17 17:46:11 arch kernel: NVRM: Xid (PCI:0000:01:00): 109, pid=544, name=Xorg, Ch 0000000a, errorString CTX SWITCH TIMEOUT, Info 0x34007
Dec 17 17:45:57 arch kernel: NVRM: Xid (PCI:0000:01:00): 109, pid=544, name=Xorg, Ch 0000000a, errorString CTX SWITCH TIMEOUT, Info 0x34007
Dec 17 17:45:44 arch kernel: NVRM: Xid (PCI:0000:01:00): 109, pid=544, name=Xorg, Ch 0000000a, errorString CTX SWITCH TIMEOUT, Info 0x34007
Dec 17 17:45:31 arch kernel: NVRM: Xid (PCI:0000:01:00): 109, pid=544, name=Xorg, Ch 0000000a, errorString CTX SWITCH TIMEOUT, Info 0x34007
Dec 17 17:45:17 arch kernel: NVRM: Xid (PCI:0000:01:00): 109, pid=544, name=Xorg, Ch 0000000a, errorString CTX SWITCH TIMEOUT, Info 0x34007
Dec 17 17:45:04 arch kernel: NVRM: Xid (PCI:0000:01:00): 109, pid=544, name=Xorg, Ch 0000000a, errorString CTX SWITCH TIMEOUT, Info 0x34007
Dec 17 17:44:51 arch kernel: NVRM: Xid (PCI:0000:01:00): 109, pid=544, name=Xorg, Ch 0000000a, errorString CTX SWITCH TIMEOUT, Info 0x34007
1 Like

Thanks for your testing

Which proton version ?
Is it the latest WRC ?
It should say 1.4.0 somewhere at the game start

Seems setting power.limit is locked on some driver versions

sudo nvidia-smi --power-limit 75
Changing power management limit is not supported for GPU: 00000000:01:00.0.
Treating as warning and moving on.
All done.

Related issues:

I’ll look what can i do

Would you mind sharing the result of running nvidia-bug-report.sh ?

I can repro x109 error for game Pioneers of Pagonia with NVIDIA GeForce RTX 3070 + Driver 535.146.02.
4425951 has been filed locally for tracking purpose.

2 Likes

I assume it has to be run in the same session that caused the crash? Since it causes my computer to shut down I’m not sure it’s possible. If I reboot and run it is it still useful? I’ll see what I can do next time it happens.

I’m experiencing this issue almost on any access to gpu. It started this month only after some updates to my Ubuntu 22.04.1 (Kernel 6.2.0-39). I don’t play games but do AI work. My desktop has 2x3090. With any python access through conda or even whilst starting pycharm, Ubuntu (gnome) freezes. And here is teh snippet from dmesg:

[  884.191814] [drm:nv_drm_master_set [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00005c00] Failed to grab modeset ownership
[  884.191876] [drm:nv_drm_master_set [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00002100] Failed to grab modeset ownership
[  884.766884] retire_capture_urb: 43 callbacks suppressed
[ 2236.090323] NVRM: GPU at PCI:0000:5c:00: GPU-bf881eec-e206-0714-7afe-17c8cb11520c
[ 2236.090331] NVRM: Xid (PCI:0000:5c:00): 109, pid=11593, name=gnome-shell, Ch 00000010, errorString CTX SWITCH TIMEOUT, Info 0x3c007

[ 5257.450801] NVRM: Xid (PCI:0000:5c:00): 109, pid=11438, name=Xorg, Ch 00000018, errorString CTX SWITCH TIMEOUT, Info 0x11c003

[ 5499.487154] NVRM: Xid (PCI:0000:5c:00): 109, pid=11438, name=Xorg, Ch 00000008, errorString CTX SWITCH TIMEOUT, Info 0x11c002

[ 5768.720340] [drm:nv_drm_master_set [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00005c00] Failed to grab modeset ownership
[ 5768.720457] [drm:nv_drm_master_set [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00002100] Failed to grab modeset ownership
[ 5768.720546] [drm:nv_drm_master_set [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00005c00] Failed to grab modeset ownership

My system is unusable now. Did CONDA_OVERRIDES_CUDA=12.2 to make conda work, but unable to make pycharm work.

[40226.813612] NVRM: GPU at PCI:0000:01:00: GPU-3440fcd9-ad72-5684-052e-87619260bcbf
[40226.813615] NVRM: Xid (PCI:0000:01:00): 109, pid=55550, name=Warframe.x64.ex, Ch 0000003e, errorString CTX SWITCH TIMEOUT, Info 0x3c01b

Xid 109 is back for me- kernel 6.6.7, Nvidia driver 545.29.06 using an RTX 2060 Super. This happens reliably after 10-20 minutes in games.

nvidia-bug-report.log.gz (929.8 KB)

Solved my problem and it is not NVIDIA card/driver. As some people mentioned I did clean install OS multiple times but nothing worked. Still conda info made system to freeze. Only change I made to my system this month was adding a 10G dual port NIC. I removed that from the system and all working fine and no issues. It was the NIC installed on PCIex8 lane caused the system to freeze and for some reason NVIDIA got teh error Xid: 109.

I have only an Nvidia card installed as PCIe, unless the NVMe SSD counts. I am, however, having better luck with the nvidia-open-dkms open-source kernel modules over the proprietary driver. I haven’t experienced a crash in over a day.

+1 on this issue.

Affects all graphics/cuda workloads, only recently realized this is what caused everything to crash. Reliably can trigger with RE village:
NVRM: Xid (PCI:0000:01:00): 109, pid=307016, name=re8.exe, Ch 00000046, errorString CTX SWITCH TIMEOUT, Info 0x2c022

Debian 12 bookworm
Linux 6.1.0-13-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.55-1 (2023-09-29) x86_64 GNU/Linux
Tested drivers in bookworm 525.147.05-4~deb12u1, experimental 535.43.02-1, and nvidia installed 545.23.08-1. All hang in the exact same way, 545 seems to be the worst.
RTX 2070 Super
Solved by downgrading to 470-tesla driver, kernel 6.1, seems stable for now, but missing alot of needed driver functionality…

I snapped this bug report during a big freeze moment, when the GPU locked up for a good ~20 seconds then recovered. Kernel 6.6.8, Nvidia driver 545.29.06, RTX 2060 Super

nvidia-bug-report.log.gz (423.0 KB)

Found this guy (not me) has the same issue, with the 520.56.06 driver, but on an rtx 4090. Can’t confirm, as I don’t have the hardware. Crashes the same as the cuda workloads here though: Random CUBLAS_STATUS_INTERNAL_ERROR crashes during training with RTX 4090 - PyTorch Forums