Multiple CUDA/RTX/Vulkan application crashing with Xid (13,109) errors

525.85.05 issue is present as well:

Jan 27 10:16:19 z004 kernel: NVRM: GPU at PCI:0000:26:00: GPU-6f98b267-20cc-5347-51dc-8bad07fd2ad0
Jan 27 10:16:19 z004 kernel: NVRM: Xid (PCI:0000:26:00): 13, pid='<unknown>', name=<unknown>, Graphics SM Warp Exception on (GPC 0, TPC 0, SM 0): Illegal Instruction Parameter
Jan 27 10:16:19 z004 kernel: NVRM: Xid (PCI:0000:26:00): 13, pid='<unknown>', name=<unknown>, Graphics Exception: ESR 0x504730=0x1b000b 0x504734=0x0 0x504728=0xf812b60 0x50472c=0x1104
Jan 27 10:16:23 z004 kernel: NVRM: Xid (PCI:0000:26:00): 109, pid=14282, name=MetroExodus.exe, Ch 00000066, errorString CTX SWITCH TIMEOUT, Info 0x43c040

nvidia-bug-report.log.gz (334.6 KB)

I have filed a bug 3959156 internally for tracking purpose.
Shall try to reproduce issue locally and if needed any additional information, will get back.

2 Likes

Hi All,
I tried playing game Metro Exodus (Linux native game) for around 30 minutes on couple of notebooks which have RTX 3070 Ti and RTX 2060 but could not observed any XID errors.
I would like to know repro frequency at your end and is there any other way also to reproduce issue consistently.

The issue happens with the Windows version of Metro Exodus when it’s ran through Proton (the log says “name=MetroExodus.exe”). The Windows version runs much smoother so it’s better than the native. Before it worked almost fine except you had to disable hairworks (otherwise it freezes in intro), the rest was okay. Now it freezes on the title screen right before showing the main menu, the log reports the Xid errors as stated in posts above.

My game settings are everything to max except Hairwork which I disabled.

What I observed though is that this issue does not happen if you start Metro Exodus in safe-mode (after crash) or the first time post install and then set everything to max (except hairworks) and start playing without rebooting the game.

It happens on the 2nd start after all settings have been turned up and the game was shut-down entirely.

This however happens on both the native Metro Exodus and Metro Exodus PC Enhanced Edition via Proton and VKD3D

PC Enhanced Edition Settings I get the crash with:

  • Resolution: 1920x1080
  • Quality: Extreme
  • VSync: Full
  • Motion blur: High
  • Raytracing: Ultra
  • NVidia DLSS: Quality
  • Reflections: Raytraced
  • VRS: 4X
  • Hairwoks: Off
  • Advanced Physx: On
  • Tesselation: On
  • Field of View:

Alight … I think I found the issue. For some unknown reason it’s the resolution.
Running the above settings but on 720p all is fine, setting my resolution to 1080p makes the game crash before the main menu on the next game start.

My desktop config is two 1920x1080 (60Hz) displays which makes my primary resolution 1080p and can’t get higher.
So it may be the issue that setting the game resolution to the primary desktop resolution crashes it?

Thanks for sharing the information, I am able to reproduce issue locally now and will keep posted on the same.

1 Like

Hi All,
Can you please try with driver 520.56.06 and share test results.

Re-Doing the same task with 520.56.06 worked fine.

  • Started game in safe mode
  • Made settings as outlined above
  • Restarted the game
  • Loaded a save game and walked a few meters

In case it holds any valuable information I also attached the “bug-report” archive for 520.56.06 even though no bug seems to have happened:

nvidia-bug-report_520.56.06.log.gz (294.4 KB)

not sure if im hitting the same issue but. on a prime setup i get

tom-acer kernel: NVRM: GPU at PCI:0000:01:00: GPU-58e586ab-a95c-b7fb-4f87-143605fb6aa2
tom-acer kernel: NVRM: GPU Board Serial Number: 0
tom-acer kernel: NVRM: Xid (PCI:0000:01:00): 56, pid='<unknown>', name=<unknown>, CMDre 00000001 00000200 00000001 00000005 0000001d

when i try to run diablo2 with median xl patches and GitHub - bolrog/d2dx: D2DX is a complete solution to make Diablo II run well on modern PCs, with high fps and better resolutions. so it in turn is a dx11 title and running it fullscreen on an external monitor. windowed or even just running on the internal it works. but as fast as i try to run it fullscreen on the external monitor this Xid happends. and a reboot is required. this is on kwin 5.27 wayland, and nvidia 525.89.02, tried downgrading various things since i think this was working before. but didnt go as long back as 520.56.06 , it can occur with other various titles when trying to run them fullscreen on the external monitor in wine aswell

yep managed to find an old archive of 520.56.06 and those runs the games just fine aswell. no Xid 56, but at the point of where it usually froze. it prints this to dmesg [drm:nv_drm_fence_context_create_ioctl [nvidia_drm]] ERROR [nvidia-drm] [GPU ID 0x00000100] Failed to allocate fence signaling event , if thats anything related or just an other issue that simply was fixed later.

I’m experiencing a similar issue using an RTX 4090. Training runs with pytorch start fine but randomly fail anywhere from 1 to 10 hours into training, with the Xid 109 CTX SWITCH TIMEOUT error.
The difficult part is that I haven’t found a way to quickly reproduce the issue, it only occurs randomly, usually after an hour or so.

Various configurations I’ve tested:
WSL
Native Ubuntu
Power Limiting GPU to 50%
Limiting memory usage to 50%

Were you able to find a fix for this?

Thanks for sharing the test results and it looks like you are no longer facing same issue with driver 520.56.06

Thanks @gulafaran for sharing test results, you are no longer experiencing the original issue with driver 520.56.06.
However, you are seeing different error messages, can you please confirm if it’s consistent and you are seeing any performance drop or application crashing or any other functional issue.

@mattm458 @PeterWhidden
It looks like running pytorch training results is same Xid errors but in the background, it is pointing to different issue.
Can you please help to share reliable repro steps so that we have exact same repro and can be used for debugging purpose.

Hi amrits,

This thread is the only mention of Xid 109 error I could find online, it doesn’t appear to be listed in nvidias documentation.

The pytorch code runs fine in a loop for a random amount of time before crashing with:
CUBLAS_STATUS_INTERNAL_ERROR

Unfortunately I have not been able to reproduce the error quickly or simply yet, it occurs randomly anywhere from 10 minutes to 10 hours into the program running.

I have tried drivers 520.56 and 525.89, and cuda 11.8 and 12 as well as different versions of pytorch.
Running dmesg after the error shows Xid error 109:

NVRM: Xid (PCI:0000:01:00): 109, pid=4124, name=python, Ch 00000028, errorString CTX SWITCH TIMEOUT, Info 0x2c014

Any insight on how I might narrow down or debug this issue would be greatly appreciated, thanks!

@PeterWhidden
I would need the sample code or repro steps in order to repro issue locally which will help further to root cause it.

1 Like

yeah its very consistent, both the modded diablo 2 and jedi fallen order makes it instantly Xid on launch, dropping back to 520.56.06 it runs but with that nv_drm_fence_context_create_ioctl upon launch , any performance drops ive found so far has been with hogwarts legacy and it seems to be something like the VRAM Allocation Issues - #11 by an9949an once it reaches to high vram usage it begins slowing down until its rather unplayable until i reboot/restart the game and get a few more hours out of it.

seems using 525.89.02 im getting this on running hogwarts legacy aswell, so from what i can gather games using vkd3d causes it. perhaps some vulkan extension thats being used triggers it? because i cant get this to happend with native things like vkcube, unigine-heaven benchmarks etc.

feb 27 17:19:16 tom-acer kernel: NVRM: GPU at PCI:0000:01:00: GPU-58e586ab-a95c-b7fb-4f87-143605fb6aa2
feb 27 17:19:16 tom-acer kernel: NVRM: GPU Board Serial Number: 0
feb 27 17:19:16 tom-acer kernel: NVRM: Xid (PCI:0000:01:00): 56, pid='<unknown>', name=<unknown>, CMDre 00000001 00000200 00000001 00000005 0000001d

okey so i did some driver version bisecting. since the xid errors are consistent. this is on running hogwarts legacy through steam and proton.

525.89.02 Xid 56 on launch. always.

525.85.05 Xid 56 on launch. always.

525.78.01 hogwarts launches but crashes on shader compilation, a wine/game engine? window appears "Not enough video memory to allocate a render" on second launch. Xid 56.

525.60.11 gives a different Xid on launch.
NVRM: Xid (PCI:0000:01:00): 32, pid=2724, name=HogwartsLegacy., Channel ID 00000028 intr1 00000008 HCE_DBG0 00001b00 HCE_DBG1 00000001
NVRM: Xid (PCI:0000:01:00): 32, pid=2724, name=HogwartsLegacy., Channel ID 00000028 intr1 00000008 HCE_DBG0 00001b04 HCE_DBG1 00ce8010

520.56.06 runs the game, and no xid errors on neither hogwarts nor diablo2, jedi fallen order.
however this appears in dmesg on launch.
[drm:nv_drm_fence_context_create_ioctl [nvidia_drm]] ERROR [nvidia-drm] [GPU ID 0x00000100] Failed to allocate fence signaling event

but the games do run on 520.56.06

nvidia bugreport from 520.56.06.
nvidia-bug-report.log.gz (286.1 KB)