Multiple CUDA/RTX/Vulkan application crashing with Xid (13,109) errors

@mattm458 @PeterWhidden
It looks like running pytorch training results is same Xid errors but in the background, it is pointing to different issue.
Can you please help to share reliable repro steps so that we have exact same repro and can be used for debugging purpose.

Hi amrits,

This thread is the only mention of Xid 109 error I could find online, it doesn’t appear to be listed in nvidias documentation.

The pytorch code runs fine in a loop for a random amount of time before crashing with:
CUBLAS_STATUS_INTERNAL_ERROR

Unfortunately I have not been able to reproduce the error quickly or simply yet, it occurs randomly anywhere from 10 minutes to 10 hours into the program running.

I have tried drivers 520.56 and 525.89, and cuda 11.8 and 12 as well as different versions of pytorch.
Running dmesg after the error shows Xid error 109:

NVRM: Xid (PCI:0000:01:00): 109, pid=4124, name=python, Ch 00000028, errorString CTX SWITCH TIMEOUT, Info 0x2c014

Any insight on how I might narrow down or debug this issue would be greatly appreciated, thanks!

@PeterWhidden
I would need the sample code or repro steps in order to repro issue locally which will help further to root cause it.

1 Like

yeah its very consistent, both the modded diablo 2 and jedi fallen order makes it instantly Xid on launch, dropping back to 520.56.06 it runs but with that nv_drm_fence_context_create_ioctl upon launch , any performance drops ive found so far has been with hogwarts legacy and it seems to be something like the VRAM Allocation Issues - #11 by an9949an once it reaches to high vram usage it begins slowing down until its rather unplayable until i reboot/restart the game and get a few more hours out of it.

seems using 525.89.02 im getting this on running hogwarts legacy aswell, so from what i can gather games using vkd3d causes it. perhaps some vulkan extension thats being used triggers it? because i cant get this to happend with native things like vkcube, unigine-heaven benchmarks etc.

feb 27 17:19:16 tom-acer kernel: NVRM: GPU at PCI:0000:01:00: GPU-58e586ab-a95c-b7fb-4f87-143605fb6aa2
feb 27 17:19:16 tom-acer kernel: NVRM: GPU Board Serial Number: 0
feb 27 17:19:16 tom-acer kernel: NVRM: Xid (PCI:0000:01:00): 56, pid='<unknown>', name=<unknown>, CMDre 00000001 00000200 00000001 00000005 0000001d

okey so i did some driver version bisecting. since the xid errors are consistent. this is on running hogwarts legacy through steam and proton.

525.89.02 Xid 56 on launch. always.

525.85.05 Xid 56 on launch. always.

525.78.01 hogwarts launches but crashes on shader compilation, a wine/game engine? window appears "Not enough video memory to allocate a render" on second launch. Xid 56.

525.60.11 gives a different Xid on launch.
NVRM: Xid (PCI:0000:01:00): 32, pid=2724, name=HogwartsLegacy., Channel ID 00000028 intr1 00000008 HCE_DBG0 00001b00 HCE_DBG1 00000001
NVRM: Xid (PCI:0000:01:00): 32, pid=2724, name=HogwartsLegacy., Channel ID 00000028 intr1 00000008 HCE_DBG0 00001b04 HCE_DBG1 00ce8010

520.56.06 runs the game, and no xid errors on neither hogwarts nor diablo2, jedi fallen order.
however this appears in dmesg on launch.
[drm:nv_drm_fence_context_create_ioctl [nvidia_drm]] ERROR [nvidia-drm] [GPU ID 0x00000100] Failed to allocate fence signaling event

but the games do run on 520.56.06

nvidia bugreport from 520.56.06.
nvidia-bug-report.log.gz (286.1 KB)

Hello. I usually get those errors, sometimes about 10mins of playtime, sometimes after an hour or so. These are exactly the CTX SWITCH errors mentioned above, Xid 109 and Xid 13.
The games run with every driver version, only crashes occur after some playtime using D3DVK (tried 2.6 to 2.8). Any version of DXVK is fine.
On 525.89.02 version, the latest one.
I tried older 520.xx and 515.xx driver versions, the games still crashed the same way, but then I got Xid 31 errors instead, like for example this:
NVRM: Xid (PCI:0000:01:00): 31, pid=5273, name=Renderer, Ch 00000040, intr 00000000. MMU Fault: ENGINE HOST0 HUBCLIENT_ESC faulted @ 0x0_00000000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ

I can gather logs using the bug report tool if still necessary.
Any D3DVK game has this problem, Forza Horizons 5, Hogwarts Legacy etc

GTX1660

tried the 530.30.02 beta that released today, seeing it had prime/wayland fixes when using an amd igpu. no dice. Xid 56

Hello
 got the same Problems with Metro Exodus (Linux Native). The Game just crash after the Intro.

Distro: openSUSE Tumbleweed
Kernel: 6.1.12-1-default (64-bit)
DE: Plasma 5.27.1 (X11)
NVIDIA Driver Version: 525.89.02
NVIDIA GeForce RTX 3060 Laptop GPU

Here my dmesg:

[   68.807051] NVRM: Xid (PCI:0000:01:00): 13, pid='<unknown>', name=<unknown>, Graphics SM Warp Exception on (GPC 1, TPC 0, SM 1): Illegal Instruction Parameter
[   68.807065] NVRM: Xid (PCI:0000:01:00): 13, pid='<unknown>', name=<unknown>, Graphics Exception: ESR 0x50c7b0=0x1e000b 0x50c7b4=0x0 0x50c7a8=0xf812b60 0x50c7ac=0x1104
[   76.324243] NVRM: Xid (PCI:0000:01:00): 109, pid=3531, name=MetroExodus, Ch 00000016, errorString CTX SWITCH TIMEOUT, Info 0x1c00e

Tested with 530.30.02, same issues. Attaching the log archive.
nvidia-bug-report.log.gz (1.5 MB)
dmesg:
[ 6192.440687] NVRM: GPU at PCI:0000:01:00: GPU-50ea39f8-76d4-57dd-9d58-004667e5725b
[ 6192.440690] NVRM: Xid (PCI:0000:01:00): 109, pid=4447, name=ForzaHorizon5.e, Ch 000000a6, errorString CTX SWITCH TIMEOUT, Info 0x3dc05e

Distro: Arch
Kernel: 6.2.1-zen1-1-zen
DE: Plasma 5.27 (X11)
GTX1660

no idea why, but running the games with gamescope as in, gamescope -f -h 1440 -w 2560 -r 144 -- prime-run %command% , they dont Xid for me anymore. “prime-run” is just a bash script setting the environment variables to run on the dgpu. this is with the 530.30.02 beta driver

Must be prime-run or one of those lucky occasions where things do work.
Tried running Metro Exodus with gamescope as well but the issue still appears.

Also I found Horizon: Zero Dawn suffers from a similar issue as Metro but with XID 31:

NVRM: Xid (PCI:0000:26:00): 31, pid=2548, name=HorizonZeroDawn, Ch 00000036, intr 00000000. MMU Fault: ENGINE GRAPHICS GPCCLIENT_GCC faulted @ 0x0_00000000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ

Attached the nvidia-bug report after the Horizon freeze as well.

nvidia-bug-report.log.gz (323.2 KB)

Mar 04 15:15:09 okay kernel: NVRM: Xid (PCI:0000:05:00): 109, pid=12195, name=eldenring.exe, Ch 0000002b, errorString CTX SWITCH TIMEOUT, Info 0x37c02a
Mar 04 15:15:09 okay kernel: NVRM: GPU at PCI:0000:05:00: GPU-ba73bc75-4c91-6012-1365-c8e673737f6b

Just had my first crash with seem to be the same issue as mentioned here.
OBS was running with nvenc replay buffer in the background.
I don’t remember having this kind of crashes (sometimes just very long hangs, like 30s+) at all before kernel 6.2 update.

Arch, 525.89.02 (open module)
4k screen, game in window at 1440p, VRAM, GPU and Encoder usage, all is under 80%.
(uploading log shows error for some reason)

Same issue with Forza: Horizon, Arch Linux, driver version 525.89.02. It always happens after jumping off the plane and taking few corners, very easy to reproduce.

[119051.285397] NVRM: GPU at PCI:0000:2b:00: GPU-9eda0c23-be23-45e0-c970-a7bba9e143d3
[119051.285402] NVRM: Xid (PCI:0000:2b:00): 109, pid=883196, name=ForzaHorizon5.e, Ch 0000000e, errorString CTX SWITCH TIMEOUT, Info 0x22c010
1 Like

I can confirm the crash still persists for Metro Exodus Enhanced

[775.063140] NVRM: Xid (PCI:0000:0a:00): 13, pid='<unknown>', name=<unknown>, Graphics SM Warp Exception on (GPC 3, TPC 1, SM 0): Illegal Instruction Parameter
[775.063152] NVRM: Xid (PCI:0000:0a:00): 13, pid='<unknown>', name=<unknown>, Graphics Exception: ESR 0x51cf30=0xb 0x51cf34=0x0 0x51cf28=0xf812b60 0x51cf2c=0x1104
[779.303335] NVRM: Xid (PCI:0000:0a:00): 109, pid=4740, name=MetroExodus.exe, Ch 000000ae, errorString CTX SWITCH TIMEOUT, Info 0x43c053

Distro: Manjaro
Kernel: 6.1.12-1
Nvidia Driver: 525.89.02
Proton: Experimental
Game: Metro Exodus Enhanced
GPU: RTX 3070
nvidia-bug-report.log (301.1 KB)

I removed VKD3D_CONFIG=no_upload_hvv and haven’t had this issue for more than a week now. I don’t remember having this issue prior to adding this line either (Elden Ring performs better without it by the way).

Note: I have ReBar enabled.

This issue seems to affect VKD3D titles and one that consistently gets the Xid error(whether loading just the first stage or 5-6 after that/going back to menu and loading different stage/) is WRC Generations which was just made to work with Proton Experimental.

EDIT: Forgot to mention that the game uses different input/also if you want to make use of DLSS in it/ and requires this launch command PROTON_ENABLE_NVAPI=1 WINEDLLOVERRIDES="xinput1_3=n,b" %command%

No the Linux Native Version of Metro Exodus which uses Vulkan, also stuffers from this issue.

But I agree to that point, that I didn’t found any DXVK titles affected by this.

Having same issue with WRC Generations (requires proton-experimental bleeding edge branch currently)

WINEDLLOVERRIDES="xinput1_3=n,b" %command% launch option also needed for input.

[Tue Mar 21 20:27:43 2023] NVRM: Xid (PCI:0000:0a:00): 109, pid=1897252, name=Kt-Main, Ch 0000002e, errorString CTX SWITCH TIMEOUT, Info 0x2c01a

and when using PROTON_NO_FSYNC=1 then i get

[Wed Mar 22 11:45:34 2023] NVRM: Xid (PCI:0000:0a:00): 13, pid='<unknown>', name=<unknown>, Graphics Exception: Shader Program Header 11 Error
[Wed Mar 22 11:45:34 2023] NVRM: Xid (PCI:0000:0a:00): 13, pid='<unknown>', name=<unknown>, Graphics Exception: Shader Program Header 18 Error
[Wed Mar 22 11:45:34 2023] NVRM: Xid (PCI:0000:0a:00): 13, pid='<unknown>', name=<unknown>, Graphics Exception: ESR 0x405840=0xa0040800
[Wed Mar 22 11:45:34 2023] NVRM: Xid (PCI:0000:0a:00): 13, pid='<unknown>', name=<unknown>, Graphics Exception: ESR 0x405848=0x80000000
[Wed Mar 22 11:45:34 2023] NVRM: Xid (PCI:0000:0a:00): 13, pid=2885445, name=Kt-Main, Graphics Exception: ChID 0036, Class 0000c797, Offset 00000000, Data 00000000

525.47.13 and 530.30.02 drivers tested

edit: Seems this PR fixes the issue for WRC Generations:

no more Xid 109 hangs.

Fix is only available in driver 520.56.06 so far.
Current releases in branch 525 and 530 do not have the fix incorporated, hence issue is still observed.
Shall update once it is incorporated in future drivers.

1 Like