Multiple CUDA/RTX/Vulkan application crashing with Xid (13,109) errors

kodatarule · December 17, 2023, 7:56am

Unfortunately the only open source is only the kernel module of the driver which is already somewhat utilized by nvk(not fully ready afaik, but it shows high potential).
According to your bug report we have the same GPU so that is very bizarre that Xid triggers without logical explanation, hmm.
Might not help out and it will also prevent you from using DLSS, but have you tried to hide the NVIDIA gpu with this variable for Proton: PROTON_HIDE_NVIDIA_GPU=1

oceand · December 17, 2023, 2:20pm

I agree that it’s not logical. For me this problem occurs with any game that slightly taxes my system. It feels like it should be more wide spread than it is. It honestly surprises me that this thread isn’t packed with “me too” responses.

At any rate I’ve found no combinations, or lack thereof, of Proton environmental variables that help this issue.

nvidia-bug-report.log.gz (897.2 KB)

rektek249 · December 17, 2023, 4:28pm

It’s happening on Xorg directly for me, doesn’t seem linked to proton at all.

snorkellingcactus · December 17, 2023, 6:40pm

I have like 50 hours played on WRC 23. Always worked fine, had no problems (well, restricting the scope to this issue and it worked better after the 1.4.0 patch regarding precompiled shaders). But that specific track would trigger the XID error (i got it like three times retrying that track). I can try to reproduce it even externally record it just in case (not sure what would trigger it, maybe it’s when rendering a specific frame, so recording it would notice which frame an why is it different, and increase chances of reproducing it consistently)

I think on the nvidia log should be the hardware details

I’m on Exherbo (a Gentoo like distro), it’s a 1660ti (mobile) on an Acer predator helios 300 PH315-52-78VL, kernel 6.6.4, driver 545.29.06, 16gb RAM, i7-9750H

Not sure which details do you want

I’m using external devices like a Thrustmaster T300RS, the th8a shifter and a local provider of handbrake, i can do test without them connected too.

Edit, i’ve just reproduced it again. Did many tracks and i keep playing with no problems at except of this track that triggers the XID

Steps to reproduce:

Create a custom rally
Select RALLYE MONTE-CARLO
Season: Spring
Add Stage
Select Les Borels 8,6km, and all stock options
Confirm, Confirm
Start
Select Subaru Impreza 1995
Play until XID happens

I’m not sure if it’s a specific frame. I also think performance on this specific track is relatively poor

snorkellingcactus · December 17, 2023, 6:43pm

Can you try Forza Horizon 5 instead ? I think it will crash much more with this issue

kodatarule · December 17, 2023, 7:33pm

I don’t have Forza Horizon 5, as for the WRC 23, I’ll try the stage and update later with the results.

snorkellingcactus · December 17, 2023, 7:39pm

I’ve just added more clear repro steps. I’ve two videos

Game config

Gameplay until XID error:

This last video ends as obs throws that the HVENC codec is taking too long

Xpander · December 17, 2023, 10:09pm

Tried the stage with the same car etc… No issues. Cant replicate it.
I don’t see any major performance fluctuations either. All stages have pretty similar framerates for me. Crowded areas with a bit less fps and forests/fields more.

Been playing the game for nearly 60 hours now with zero crashes/freezes.

Ryzen 5800X3D, RTX 3080 … currently 535.43.20 vulkan dev drivers but played with 545.29 also

edit: what happens if you cap the powerlimit of the gpu to be a bit lower?
maybe that stage uses more cpu also and you hit the laptop powerbrick limits and the driver then just gives up?
random thought.

rektek249 · December 17, 2023, 11:08pm

Managed to get it to trigger on arch with the latest Xorg and stock DWM with nothing but the latest firefox running. This cannot get more basic.

Dec 17 17:46:11 arch kernel: NVRM: Xid (PCI:0000:01:00): 109, pid=544, name=Xorg, Ch 0000000a, errorString CTX SWITCH TIMEOUT, Info 0x34007
Dec 17 17:45:57 arch kernel: NVRM: Xid (PCI:0000:01:00): 109, pid=544, name=Xorg, Ch 0000000a, errorString CTX SWITCH TIMEOUT, Info 0x34007
Dec 17 17:45:44 arch kernel: NVRM: Xid (PCI:0000:01:00): 109, pid=544, name=Xorg, Ch 0000000a, errorString CTX SWITCH TIMEOUT, Info 0x34007
Dec 17 17:45:31 arch kernel: NVRM: Xid (PCI:0000:01:00): 109, pid=544, name=Xorg, Ch 0000000a, errorString CTX SWITCH TIMEOUT, Info 0x34007
Dec 17 17:45:17 arch kernel: NVRM: Xid (PCI:0000:01:00): 109, pid=544, name=Xorg, Ch 0000000a, errorString CTX SWITCH TIMEOUT, Info 0x34007
Dec 17 17:45:04 arch kernel: NVRM: Xid (PCI:0000:01:00): 109, pid=544, name=Xorg, Ch 0000000a, errorString CTX SWITCH TIMEOUT, Info 0x34007
Dec 17 17:44:51 arch kernel: NVRM: Xid (PCI:0000:01:00): 109, pid=544, name=Xorg, Ch 0000000a, errorString CTX SWITCH TIMEOUT, Info 0x34007

snorkellingcactus · December 18, 2023, 12:17am

Thanks for your testing

Which proton version ?
Is it the latest WRC ?
It should say 1.4.0 somewhere at the game start

Seems setting power.limit is locked on some driver versions

sudo nvidia-smi --power-limit 75
Changing power management limit is not supported for GPU: 00000000:01:00.0.
Treating as warning and moving on.
All done.

Related issues:

github.com/NVIDIA/open-gpu-kernel-modules

Unable to change power limit with nvidia-smi

opened 03:52PM - 01 Apr 23 UTC

machinedgod

bug

### NVIDIA Open GPU Kernel Modules Version 530.41.03-1 ### Does this happen wi…th the proprietary driver (of the same version) as well? Yes ### Operating System and Version Arch Linux ### Kernel Release Linux 6.2.8-arch1-1 #1 SMP PREEMPT_DYNAMIC Wed, 22 Mar 2023 22:52:35 +0000 x86_64 GNU/Linux ### Hardware: GPU GPU 0: NVIDIA GeForce RTX 2060 (UUID: GPU-2f685ce6-33f4-db75-ce05-81d1723a6ddb) ### Describe the bug Before recent update, I was able to execute: ```~$ sudo nvidia-smi --power-limit 60``` and have it work as expected. After update, this is the output: ``` Changing power management limit is not supported for GPU: 00000000:01:00.0. Treating as warning and moving on. All done. ``` Changing power limits caused observable changes both in temperature and in performance, so I am pretty sure my GPU supports it. For context: I found that default power limit of 80W tends to heat up the GPU enough that it starts throttling itself and cause stuttering - 60W seemed to work perfectly and keep it under 63C during everything, without boosting my fan. The computer itself is a laptop (which explains issues with heat dissipation), a Lenovo Legion Y740, and I have a pretty good cooling pad to help out. ### To Reproduce ```~# nvidia-smi --power-limit 60``` ### Bug Incidence Always ### nvidia-bug-report.log.gz [nvidia-bug-report.log.gz](https://github.com/NVIDIA/open-gpu-kernel-modules/files/11129980/nvidia-bug-report.log.gz) ### More Info Just a bit more context: you may notice that in my Xorg config, I use option to skip EDID check for HDMI-0 output - the reason why, is because either driver doesn't recognize my (about 7y old) monitor, or the checksum that monitor outputs is invalid, and then it wouldn't let me use FullHD resolution on that monitor. No idea if this is in any way related (I presume its) not, but this setup worked for over 5-6 months, so I doubt its related.

I’ll look what can i do

snorkellingcactus · December 18, 2023, 12:19am

Would you mind sharing the result of running nvidia-bug-report.sh ?

amrits · December 18, 2023, 3:25pm

I can repro x109 error for game Pioneers of Pagonia with NVIDIA GeForce RTX 3070 + Driver 535.146.02.
4425951 has been filed locally for tracking purpose.

rektek249 · December 18, 2023, 9:15pm

I assume it has to be run in the same session that caused the crash? Since it causes my computer to shut down I’m not sure it’s possible. If I reboot and run it is it still useful? I’ll see what I can do next time it happens.

rajabm · December 19, 2023, 3:23pm

I’m experiencing this issue almost on any access to gpu. It started this month only after some updates to my Ubuntu 22.04.1 (Kernel 6.2.0-39). I don’t play games but do AI work. My desktop has 2x3090. With any python access through conda or even whilst starting pycharm, Ubuntu (gnome) freezes. And here is teh snippet from dmesg:

[  884.191814] [drm:nv_drm_master_set [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00005c00] Failed to grab modeset ownership
[  884.191876] [drm:nv_drm_master_set [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00002100] Failed to grab modeset ownership
[  884.766884] retire_capture_urb: 43 callbacks suppressed
[ 2236.090323] NVRM: GPU at PCI:0000:5c:00: GPU-bf881eec-e206-0714-7afe-17c8cb11520c
[ 2236.090331] NVRM: Xid (PCI:0000:5c:00): 109, pid=11593, name=gnome-shell, Ch 00000010, errorString CTX SWITCH TIMEOUT, Info 0x3c007

[ 5257.450801] NVRM: Xid (PCI:0000:5c:00): 109, pid=11438, name=Xorg, Ch 00000018, errorString CTX SWITCH TIMEOUT, Info 0x11c003

[ 5499.487154] NVRM: Xid (PCI:0000:5c:00): 109, pid=11438, name=Xorg, Ch 00000008, errorString CTX SWITCH TIMEOUT, Info 0x11c002

[ 5768.720340] [drm:nv_drm_master_set [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00005c00] Failed to grab modeset ownership
[ 5768.720457] [drm:nv_drm_master_set [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00002100] Failed to grab modeset ownership
[ 5768.720546] [drm:nv_drm_master_set [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00005c00] Failed to grab modeset ownership

My system is unusable now. Did CONDA_OVERRIDES_CUDA=12.2 to make conda work, but unable to make pycharm work.

catfishk · December 20, 2023, 5:22pm

[40226.813612] NVRM: GPU at PCI:0000:01:00: GPU-3440fcd9-ad72-5684-052e-87619260bcbf
[40226.813615] NVRM: Xid (PCI:0000:01:00): 109, pid=55550, name=Warframe.x64.ex, Ch 0000003e, errorString CTX SWITCH TIMEOUT, Info 0x3c01b

Xid 109 is back for me- kernel 6.6.7, Nvidia driver 545.29.06 using an RTX 2060 Super. This happens reliably after 10-20 minutes in games.

nvidia-bug-report.log.gz (929.8 KB)

rajabm · December 21, 2023, 2:49pm

Solved my problem and it is not NVIDIA card/driver. As some people mentioned I did clean install OS multiple times but nothing worked. Still conda info made system to freeze. Only change I made to my system this month was adding a 10G dual port NIC. I removed that from the system and all working fine and no issues. It was the NIC installed on PCIex8 lane caused the system to freeze and for some reason NVIDIA got teh error Xid: 109.

catfishk · December 21, 2023, 4:30pm

I have only an Nvidia card installed as PCIe, unless the NVMe SSD counts. I am, however, having better luck with the nvidia-open-dkms open-source kernel modules over the proprietary driver. I haven’t experienced a crash in over a day.

vlinuz · December 24, 2023, 1:07am

+1 on this issue.

Affects all graphics/cuda workloads, only recently realized this is what caused everything to crash. Reliably can trigger with RE village:
NVRM: Xid (PCI:0000:01:00): 109, pid=307016, name=re8.exe, Ch 00000046, errorString CTX SWITCH TIMEOUT, Info 0x2c022

Debian 12 bookworm
Linux 6.1.0-13-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.55-1 (2023-09-29) x86_64 GNU/Linux
Tested drivers in bookworm 525.147.05-4~deb12u1, experimental 535.43.02-1, and nvidia installed 545.23.08-1. All hang in the exact same way, 545 seems to be the worst.
RTX 2070 Super
Solved by downgrading to 470-tesla driver, kernel 6.1, seems stable for now, but missing alot of needed driver functionality…

catfishk · December 25, 2023, 7:20pm

I snapped this bug report during a big freeze moment, when the GPU locked up for a good ~20 seconds then recovered. Kernel 6.6.8, Nvidia driver 545.29.06, RTX 2060 Super

nvidia-bug-report.log.gz (423.0 KB)

vlinuz · December 26, 2023, 8:27am

Found this guy (not me) has the same issue, with the 520.56.06 driver, but on an rtx 4090. Can’t confirm, as I don’t have the hardware. Crashes the same as the cuda workloads here though: Random CUBLAS_STATUS_INTERNAL_ERROR crashes during training with RTX 4090 - PyTorch Forums

Topic		Replies	Views
Xid errors on GTX 1070 @ linux Linux	11	3449	May 24, 2019
X Server 1.13.1 deadlocks randomly on GeForce GTX680 Linux	6	3102	January 4, 2013
Deciphering an NVRM: Xid message? CUDA Programming and Performance	27	78166	April 1, 2012
GTX 970 with KDE/KWIN :NVRM: Xid (PCI:0000:01:00): 31, Ch 00000028, engmask 0000... Linux	32	7478	May 3, 2018
Xid 61 with 319.32/325.08 on GTX 650. Linux	9	2868	January 7, 2014
465.27 NVRM: Xid errors on a Quadro RTX 3000 Mobile / Max-Q Linux	0	508	May 9, 2021
Hung/frozen machine with X370 board, GTX 1060 card, Ryzen 5 CPU - Xid 32 & 69 - all driver versions Linux	13	1616	December 29, 2018
Frequent Freeze/Crash of Xorg with drivers 310.19 with GTS 250 on 3.2.0-4-amd64 Linux	20	15962	June 25, 2013
NVRM Xid error 59 with Kepler card (CUDA) on 4th PCIe 3.0 port Linux	6	4954	July 2, 2013
[Solved] XServer Freezes during gaming - Attempted to yield the CPU while in atomic or interrupt con Linux	8	5273	March 10, 2016

Multiple CUDA/RTX/Vulkan application crashing with Xid (13,109) errors

Related topics