Xid109 CTX SWITCH TIMEOUT Driver Crashes In Many Applications

Cannot use Linux for anything GPU heavy reliably in the last month or so… I have met many other users facing same issue and would like to bring it to light

Example of errors, always Xid 109:
NVRM: Xid (PCI:0000:01:00): 109, pid=168149, name=r5apex_dx12.exe, Ch 00000076, errorString CTX SWITCH TIMEOUT, Info 0x3c046
NVRM: Xid (PCI:0000:01:00): 109, pid=23382, name=cs2, Ch 000000b6, errorString CTX SWITCH TIMEOUT, Info 0x25c05d
NVRM: Xid (PCI:0000:01:00): 109, pid=‘’, name=, Ch 000000a6, errorString CTX SWITCH TIMEOUT, Info 0x26c058

Can consistently reproduce by playing ~1-2 games of CS2 Arms Race, the map Baggage will crash 90% of the time mid-game after a few minutes. Also has occured in compute heavy AI stuff, and in games like Apex Legends running through proton (interestingly, once Apex crashes after 10-45 mins, the game will not run for longer than 5 without another Xid 109 happening). Occasionally X11/KDE Plasma won’t recover from the crash and a full hard reboot on crash is required. This is so consistent that I can reboot, open nothing but Steam/Counter Strike 2, and have the game crash with Xid109 within 10 minutes, so testing fixes is easy.

Attempts to Debug:
-Went back to various kernel versions, that were stable for GPU usage when I used them last
-Tried 545.29.06, the beta 550.40.07, and the latest Vulkan Dev driver ( 535.43.09)
-Ensured things like power management, ReBar, etc. had no effect on reproducing the issue
-Had a friend with a 3060ti and near identical arch install (besides a Ryzen vs. my Intel, everything like driver version, graphics settings, resolution, vulkan/mesa stuff, and kernel were all the same between us) try to reproduce, and they could not
-Discussed with others also having the issue, they have tried countless other kernels, and have a variety of platforms that also are affected (AMD Ryzen, 40xx series as well, etc.), so my specific hardware is not the culprit
-Ensured my GPU is stable and in fully functional condition (passed GPU memory stress test with flying colors, can run heavy loads all night in Windows , ran stress tests, etc.)

Description of Crash
When the crash happens the screen freezes but audio, etc. continues to play in the background, and it takes ~15 seconds for the system to recover enough to alt-tab or switch terminals most of the time, with a hard (reset button) restart required occasionally. Sometimes in Proton apps the screen will freeze, then render a few frames after a few seconds, then freeze again, always with Xid 109 in dmesg after the crash. This happens independent of whether an app is run with DX11 or DX12 in Proton (all dxvk in the end), and with native Vulkan games like CS2. I have only had it happen during CUDA loads a few times but have not recently done any work with compute lately.

Bug report attached! I ran the bug tool immediately after reproducing the crash issue.
nvidia-bug-report.log.gz (937.6 KB)

I would really like to use my GPU again, so anything else I can do to help solve this would be greatly appreciated. I know there is a similar thread for this, however it is two years old and lacking any updates for this issue that renders Linux useless for the majority of my work and leisure activities.

Because I can consistently and quickly reproduce the crash, hopefully I can be of assistance pinpointing this issue, I am experienced with lowlevel debugging if I can get any dumps etc. that might help?

System info:

Arch Linux kernel 6.7.5, (other 6.6.x kernels also cause issue)
Nvidia Driver v.545.29.06 (other drivers also cause issue)
Plasma 5.27.10 through KWin
RTX 3090
MSI Z690A, 32gb DDR5,

cat /proc/cmdline                                                                                                                                                                                                                                                                                       ~
BOOT_IMAGE=/boot/vmlinuz-linux root=UUID=c1c6146b-63dc-46ff-84f3-e7661fed204d rw quiet loglevel=3 ibt=off split_lock_detect=off nvidia_drm.modeset=1

cat /proc/driver/nvidia/params                                                                                                                                                                                                                                                                           ~
ResmanDebugLevel: 4294967295
RmLogonRC: 1
ModifyDeviceFiles: 1
DeviceFileUID: 0
DeviceFileGID: 0
DeviceFileMode: 438
InitializeSystemMemoryAllocations: 1
UsePageAttributeTable: 1
EnableMSI: 1
EnablePCIeGen3: 0
MemoryPoolSize: 0
KMallocHeapMaxSize: 0
VMallocHeapMaxSize: 0
IgnoreMMIOCheck: 0
TCEBypassMode: 0
EnableStreamMemOPs: 0
EnableUserNUMAManagement: 1
NvLinkDisable: 0
RmProfilingAdminOnly: 1
PreserveVideoMemoryAllocations: 0
EnableS0ixPowerManagement: 0
S0ixPowerManagementVideoMemoryThreshold: 256
DynamicPowerManagement: 3
DynamicPowerManagementVideoMemoryThreshold: 200
RegisterPCIDriver: 1
EnablePCIERelaxedOrderingMode: 0
EnableResizableBar: 1
EnableGpuFirmware: 18
EnableGpuFirmwareLogs: 2
EnableDbgBreakpoint: 0
OpenRmEnableUnsupportedGpus: 1
DmaRemapPeerMmio: 1
RegistryDwords: ""
RegistryDwordsPerDevice: ""
RmMsg: ""
GpuBlacklist: ""
TemporaryFilePath: ""
ExcludedGpus: ""

Thank you for any assistance, this is becoming incredibly frustrating.

Tried updated driver 545.29.06-20.
Can reproduce issue within 5 minutes of playing CS2.

NVRM: Xid (PCI:0000:01:00): 109, pid=5408, name=cs2, Ch 00000096, errorString CTX SWITCH TIMEOUT, Info 0x56c05f

Bug report from immediately after crash attached.
nvidia-bug-report.log.gz (742.7 KB)

Because of my ability to reproduce this issue I was hoping to hear some potential solutions or versions to try as I am easily able to confirm if they are effective in remedying these XID 109 driver crashes.

And on latest driver, 550.54.14, can reproduce just as easily. Kernel 6.7.6-arch1-1.

Xid (PCI:0000:01:00): 109, pid='<unknown>', name=<unknown>, Ch 0000008e, errorString CTX SWITCH TIMEOUT, Info 0x26c047

This time I ran the bug report tool before killing the offending GPU using app (CS2)
nvidia-bug-report.log.gz (795.3 KB)

I just experienced the same crash here in CS2. Running 550 driver in Ubuntu 23.10.

My card is a brand new 4070 Super, that will be used mostly for OpenCL stuff related to photo editing, but so far all heavy GPU tasks have caused failures.

When OpenCL fails I se errors like this:
[ 266.228441] NVRM: GPU at PCI:0000:0a:00: GPU-617ca489-a0c6-4820-a5d8-bb47f1f232bf
[ 266.228448] NVRM: Xid (PCI:0000:0a:00): 31, pid=8469, name=worker 3, Ch 00000008, intr 00000000. MMU Fault: ENGINE GRAPHICS GPCCLIENT_T1_2 faulted @ 0x500_00233000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_READ
[36272.668229] NVRM: Xid (PCI:0000:0a:00): 13, pid=‘’, name=, Graphics SM Warp Exception on (GPC 3, TPC 1, SM 0): Out Of Range Address
[36272.668249] NVRM: Xid (PCI:0000:0a:00): 13, pid=‘’, name=, Graphics Exception: ESR 0x51cf30=0x101000e 0x51cf34=0x20 0x51cf28=0xf81eb60 0x51cf2c=0x1174
[36272.668882] NVRM: Xid (PCI:0000:0a:00): 43, pid=20472, name=test_basic, Ch 00000030
38704.375178] NVRM: Xid (PCI:0000:0a:00): 31, pid=‘’, name=, Ch 00000038, intr 00000000. MMU Fault: ENGINE GRAPHICS GPCCLIENT_T1_2 faulted @ 0x7fba_1cac2000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_WRITE

Processing: nvidia-bug-report.log.gz…