Multiple CUDA/RTX/Vulkan application crashing with Xid (13,109) errors

amrits · February 27, 2023, 4:36am

@mattm458 @PeterWhidden
It looks like running pytorch training results is same Xid errors but in the background, it is pointing to different issue.
Can you please help to share reliable repro steps so that we have exact same repro and can be used for debugging purpose.

PeterWhidden · February 27, 2023, 6:19am

Hi amrits,

This thread is the only mention of Xid 109 error I could find online, it doesn’t appear to be listed in nvidias documentation.

The pytorch code runs fine in a loop for a random amount of time before crashing with:
CUBLAS_STATUS_INTERNAL_ERROR

Unfortunately I have not been able to reproduce the error quickly or simply yet, it occurs randomly anywhere from 10 minutes to 10 hours into the program running.

I have tried drivers 520.56 and 525.89, and cuda 11.8 and 12 as well as different versions of pytorch.
Running dmesg after the error shows Xid error 109:

NVRM: Xid (PCI:0000:01:00): 109, pid=4124, name=python, Ch 00000028, errorString CTX SWITCH TIMEOUT, Info 0x2c014

Any insight on how I might narrow down or debug this issue would be greatly appreciated, thanks!

amrits · February 27, 2023, 12:31pm

@PeterWhidden
I would need the sample code or repro steps in order to repro issue locally which will help further to root cause it.

gulafaran · February 27, 2023, 1:52pm

yeah its very consistent, both the modded diablo 2 and jedi fallen order makes it instantly Xid on launch, dropping back to 520.56.06 it runs but with that nv_drm_fence_context_create_ioctl upon launch , any performance drops ive found so far has been with hogwarts legacy and it seems to be something like the VRAM Allocation Issues - #11 by an9949an once it reaches to high vram usage it begins slowing down until its rather unplayable until i reboot/restart the game and get a few more hours out of it.

gulafaran · February 27, 2023, 4:23pm

seems using 525.89.02 im getting this on running hogwarts legacy aswell, so from what i can gather games using vkd3d causes it. perhaps some vulkan extension thats being used triggers it? because i cant get this to happend with native things like vkcube, unigine-heaven benchmarks etc.

feb 27 17:19:16 tom-acer kernel: NVRM: GPU at PCI:0000:01:00: GPU-58e586ab-a95c-b7fb-4f87-143605fb6aa2
feb 27 17:19:16 tom-acer kernel: NVRM: GPU Board Serial Number: 0
feb 27 17:19:16 tom-acer kernel: NVRM: Xid (PCI:0000:01:00): 56, pid='<unknown>', name=<unknown>, CMDre 00000001 00000200 00000001 00000005 0000001d

gulafaran · February 27, 2023, 5:35pm

okey so i did some driver version bisecting. since the xid errors are consistent. this is on running hogwarts legacy through steam and proton.

525.89.02 Xid 56 on launch. always.

525.85.05 Xid 56 on launch. always.

525.78.01 hogwarts launches but crashes on shader compilation, a wine/game engine? window appears "Not enough video memory to allocate a render" on second launch. Xid 56.

525.60.11 gives a different Xid on launch.
NVRM: Xid (PCI:0000:01:00): 32, pid=2724, name=HogwartsLegacy., Channel ID 00000028 intr1 00000008 HCE_DBG0 00001b00 HCE_DBG1 00000001
NVRM: Xid (PCI:0000:01:00): 32, pid=2724, name=HogwartsLegacy., Channel ID 00000028 intr1 00000008 HCE_DBG0 00001b04 HCE_DBG1 00ce8010

520.56.06 runs the game, and no xid errors on neither hogwarts nor diablo2, jedi fallen order.
however this appears in dmesg on launch.
[drm:nv_drm_fence_context_create_ioctl [nvidia_drm]] ERROR [nvidia-drm] [GPU ID 0x00000100] Failed to allocate fence signaling event

but the games do run on 520.56.06

nvidia bugreport from 520.56.06.
nvidia-bug-report.log.gz (286.1 KB)

groove6j · February 27, 2023, 10:04pm

Hello. I usually get those errors, sometimes about 10mins of playtime, sometimes after an hour or so. These are exactly the CTX SWITCH errors mentioned above, Xid 109 and Xid 13.
The games run with every driver version, only crashes occur after some playtime using D3DVK (tried 2.6 to 2.8). Any version of DXVK is fine.
On 525.89.02 version, the latest one.
I tried older 520.xx and 515.xx driver versions, the games still crashed the same way, but then I got Xid 31 errors instead, like for example this:
NVRM: Xid (PCI:0000:01:00): 31, pid=5273, name=Renderer, Ch 00000040, intr 00000000. MMU Fault: ENGINE HOST0 HUBCLIENT_ESC faulted @ 0x0_00000000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ

I can gather logs using the bug report tool if still necessary.
Any D3DVK game has this problem, Forza Horizons 5, Hogwarts Legacy etc…
GTX1660

gulafaran · February 28, 2023, 4:22pm

tried the 530.30.02 beta that released today, seeing it had prime/wayland fixes when using an amd igpu. no dice. Xid 56

beef · February 28, 2023, 7:23pm

Hello… got the same Problems with Metro Exodus (Linux Native). The Game just crash after the Intro.

Distro: openSUSE Tumbleweed
Kernel: 6.1.12-1-default (64-bit)
DE: Plasma 5.27.1 (X11)
NVIDIA Driver Version: 525.89.02
NVIDIA GeForce RTX 3060 Laptop GPU

Here my dmesg:

[   68.807051] NVRM: Xid (PCI:0000:01:00): 13, pid='<unknown>', name=<unknown>, Graphics SM Warp Exception on (GPC 1, TPC 0, SM 1): Illegal Instruction Parameter
[   68.807065] NVRM: Xid (PCI:0000:01:00): 13, pid='<unknown>', name=<unknown>, Graphics Exception: ESR 0x50c7b0=0x1e000b 0x50c7b4=0x0 0x50c7a8=0xf812b60 0x50c7ac=0x1104
[   76.324243] NVRM: Xid (PCI:0000:01:00): 109, pid=3531, name=MetroExodus, Ch 00000016, errorString CTX SWITCH TIMEOUT, Info 0x1c00e

groove6j · March 1, 2023, 5:22pm

Tested with 530.30.02, same issues. Attaching the log archive.
nvidia-bug-report.log.gz (1.5 MB)
dmesg:
[ 6192.440687] NVRM: GPU at PCI:0000:01:00: GPU-50ea39f8-76d4-57dd-9d58-004667e5725b
[ 6192.440690] NVRM: Xid (PCI:0000:01:00): 109, pid=4447, name=ForzaHorizon5.e, Ch 000000a6, errorString CTX SWITCH TIMEOUT, Info 0x3dc05e

Distro: Arch
Kernel: 6.2.1-zen1-1-zen
DE: Plasma 5.27 (X11)
GTX1660

gulafaran · March 3, 2023, 6:27pm

no idea why, but running the games with gamescope as in, gamescope -f -h 1440 -w 2560 -r 144 -- prime-run %command% , they dont Xid for me anymore. “prime-run” is just a bash script setting the environment variables to run on the dgpu. this is with the 530.30.02 beta driver

Vortex_Acherontic · March 3, 2023, 8:30pm

Must be prime-run or one of those lucky occasions where things do work.
Tried running Metro Exodus with gamescope as well but the issue still appears.

Also I found Horizon: Zero Dawn suffers from a similar issue as Metro but with XID 31:

NVRM: Xid (PCI:0000:26:00): 31, pid=2548, name=HorizonZeroDawn, Ch 00000036, intr 00000000. MMU Fault: ENGINE GRAPHICS GPCCLIENT_GCC faulted @ 0x0_00000000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ

Attached the nvidia-bug report after the Horizon freeze as well.

nvidia-bug-report.log.gz (323.2 KB)

ewbteewbte · March 4, 2023, 2:06pm

Mar 04 15:15:09 okay kernel: NVRM: Xid (PCI:0000:05:00): 109, pid=12195, name=eldenring.exe, Ch 0000002b, errorString CTX SWITCH TIMEOUT, Info 0x37c02a
Mar 04 15:15:09 okay kernel: NVRM: GPU at PCI:0000:05:00: GPU-ba73bc75-4c91-6012-1365-c8e673737f6b

Just had my first crash with seem to be the same issue as mentioned here.
OBS was running with nvenc replay buffer in the background.
I don’t remember having this kind of crashes (sometimes just very long hangs, like 30s+) at all before kernel 6.2 update.

Arch, 525.89.02 (open module)
4k screen, game in window at 1440p, VRAM, GPU and Encoder usage, all is under 80%.
(uploading log shows error for some reason)

IvanV · March 10, 2023, 3:45pm

Same issue with Forza: Horizon, Arch Linux, driver version 525.89.02. It always happens after jumping off the plane and taking few corners, very easy to reproduce.

[119051.285397] NVRM: GPU at PCI:0000:2b:00: GPU-9eda0c23-be23-45e0-c970-a7bba9e143d3
[119051.285402] NVRM: Xid (PCI:0000:2b:00): 109, pid=883196, name=ForzaHorizon5.e, Ch 0000000e, errorString CTX SWITCH TIMEOUT, Info 0x22c010

BlackEye · March 16, 2023, 1:45am

I can confirm the crash still persists for Metro Exodus Enhanced

[775.063140] NVRM: Xid (PCI:0000:0a:00): 13, pid='<unknown>', name=<unknown>, Graphics SM Warp Exception on (GPC 3, TPC 1, SM 0): Illegal Instruction Parameter
[775.063152] NVRM: Xid (PCI:0000:0a:00): 13, pid='<unknown>', name=<unknown>, Graphics Exception: ESR 0x51cf30=0xb 0x51cf34=0x0 0x51cf28=0xf812b60 0x51cf2c=0x1104
[779.303335] NVRM: Xid (PCI:0000:0a:00): 109, pid=4740, name=MetroExodus.exe, Ch 000000ae, errorString CTX SWITCH TIMEOUT, Info 0x43c053

Distro: Manjaro
Kernel: 6.1.12-1
Nvidia Driver: 525.89.02
Proton: Experimental
Game: Metro Exodus Enhanced
GPU: RTX 3070
nvidia-bug-report.log (301.1 KB)

ewbteewbte · March 16, 2023, 5:30pm

I removed VKD3D_CONFIG=no_upload_hvv and haven’t had this issue for more than a week now. I don’t remember having this issue prior to adding this line either (Elden Ring performs better without it by the way).

Note: I have ReBar enabled.

kodatarule · March 22, 2023, 7:46am

This issue seems to affect VKD3D titles and one that consistently gets the Xid error(whether loading just the first stage or 5-6 after that/going back to menu and loading different stage/) is WRC Generations which was just made to work with Proton Experimental.

EDIT: Forgot to mention that the game uses different input/also if you want to make use of DLSS in it/ and requires this launch command PROTON_ENABLE_NVAPI=1 WINEDLLOVERRIDES="xinput1_3=n,b" %command%

Vortex_Acherontic · March 22, 2023, 8:29am

No the Linux Native Version of Metro Exodus which uses Vulkan, also stuffers from this issue.

But I agree to that point, that I didn’t found any DXVK titles affected by this.

Xpander · March 22, 2023, 10:06am

Having same issue with WRC Generations (requires proton-experimental bleeding edge branch currently)

WINEDLLOVERRIDES="xinput1_3=n,b" %command% launch option also needed for input.

[Tue Mar 21 20:27:43 2023] NVRM: Xid (PCI:0000:0a:00): 109, pid=1897252, name=Kt-Main, Ch 0000002e, errorString CTX SWITCH TIMEOUT, Info 0x2c01a

and when using PROTON_NO_FSYNC=1 then i get

[Wed Mar 22 11:45:34 2023] NVRM: Xid (PCI:0000:0a:00): 13, pid='<unknown>', name=<unknown>, Graphics Exception: Shader Program Header 11 Error
[Wed Mar 22 11:45:34 2023] NVRM: Xid (PCI:0000:0a:00): 13, pid='<unknown>', name=<unknown>, Graphics Exception: Shader Program Header 18 Error
[Wed Mar 22 11:45:34 2023] NVRM: Xid (PCI:0000:0a:00): 13, pid='<unknown>', name=<unknown>, Graphics Exception: ESR 0x405840=0xa0040800
[Wed Mar 22 11:45:34 2023] NVRM: Xid (PCI:0000:0a:00): 13, pid='<unknown>', name=<unknown>, Graphics Exception: ESR 0x405848=0x80000000
[Wed Mar 22 11:45:34 2023] NVRM: Xid (PCI:0000:0a:00): 13, pid=2885445, name=Kt-Main, Graphics Exception: ChID 0036, Class 0000c797, Offset 00000000, Data 00000000

525.47.13 and 530.30.02 drivers tested

edit: Seems this PR fixes the issue for WRC Generations:

github.com/HansKristian-Work/vkd3d-proton

Workaround spurious GPU hangs on NV with concurrent submissions to different queues

HansKristian-Work:master ← HansKristian-Work:nv-concurrent-signal-workarounds

opened 06:47PM - 22 Mar 23 UTC

HansKristian-Work

+118 -11

`VKD3D_TEST_FILTER=test_concurrent_signal_stress VKD3D_CONFIG=skip_driver_workar…ounds ./tests/d3d12` ``` test_concurrent_signal_stress:1662:Test 2: Test failed: Failed to wait for event. GPU likely hung. On driver 530.30.02 with RTX 3070 I get: [21600.846408] NVRM: Xid (PCI:0000:08:00): 32, pid=225396, name=d3d12, Channel ID 00000036 intr0 00040000 [21600.846870] NVRM: Xid (PCI:0000:08:00): 32, pid=225396, name=d3d12, Channel ID 00000036 intr0 00040000 ``` The test completes fine with workaround. Similar issue can be observed in test_fence_wait_multiple.

no more Xid 109 hangs.

amrits · March 24, 2023, 6:48pm

Fix is only available in driver 520.56.06 so far.
Current releases in branch 525 and 530 do not have the fix incorporated, hence issue is still observed.
Shall update once it is incorporated in future drivers.

Topic		Replies	Views
Reproducible: NVRM: GPU at 0000:01:00.0 has fallen off the bus. -- Both screens black, Xorg at 100% Linux	24	50905	December 16, 2015
Xid109 CTX SWITCH TIMEOUT Driver Crashes In Many Applications Linux driver , linux-driver-solutions	10	1230	November 3, 2024
X hangs using 100% CPU, WAIT and mieq overflowing errors in logs Linux	67	23555	June 28, 2014
GTX970 346.35 & 346.47 Linux Mint 17.1 Steam CSGO Segfaults during play crash the game Linux	20	6403	March 20, 2015
Recovered GPU Errors in nvidia-settings Linux	10	19982	October 10, 2014
Random Xid 61 and Xorg lock-up Linux	406	31549	January 8, 2023
Will the FAULT_PDE ACCESS_TYPE_READ bug in the Nvidia driver ever be fixed? Linux	16	16181	October 12, 2021
Linux Vulkan Dawn of War 3 alt tab crash system Linux	48	5789	October 5, 2018
396.18.02, Neon - sddm crash on boot - Xid 62 - NVRM: rm_init_adapter failed for device bearing min... Linux	46	16734	July 16, 2018
Frequent Freeze/Crash of Xorg with drivers 310.19 with GTS 250 on 3.2.0-4-amd64 Linux	20	15914	June 25, 2013

Multiple CUDA/RTX/Vulkan application crashing with Xid (13,109) errors

Related topics