Cannot use Linux for anything GPU heavy reliably in the last month or so… I have met many other users facing same issue and would like to bring it to light
Example of errors, always Xid 109:
NVRM: Xid (PCI:0000:01:00): 109, pid=168149, name=r5apex_dx12.exe, Ch 00000076, errorString CTX SWITCH TIMEOUT, Info 0x3c046
NVRM: Xid (PCI:0000:01:00): 109, pid=23382, name=cs2, Ch 000000b6, errorString CTX SWITCH TIMEOUT, Info 0x25c05d
NVRM: Xid (PCI:0000:01:00): 109, pid=‘’, name=, Ch 000000a6, errorString CTX SWITCH TIMEOUT, Info 0x26c058
Can consistently reproduce by playing ~1-2 games of CS2 Arms Race, the map Baggage will crash 90% of the time mid-game after a few minutes. Also has occured in compute heavy AI stuff, and in games like Apex Legends running through proton (interestingly, once Apex crashes after 10-45 mins, the game will not run for longer than 5 without another Xid 109 happening). Occasionally X11/KDE Plasma won’t recover from the crash and a full hard reboot on crash is required. This is so consistent that I can reboot, open nothing but Steam/Counter Strike 2, and have the game crash with Xid109 within 10 minutes, so testing fixes is easy.
Attempts to Debug:
-Went back to various kernel versions, that were stable for GPU usage when I used them last
-Tried 545.29.06, the beta 550.40.07, and the latest Vulkan Dev driver ( 535.43.09)
-Ensured things like power management, ReBar, etc. had no effect on reproducing the issue
-Had a friend with a 3060ti and near identical arch install (besides a Ryzen vs. my Intel, everything like driver version, graphics settings, resolution, vulkan/mesa stuff, and kernel were all the same between us) try to reproduce, and they could not
-Discussed with others also having the issue, they have tried countless other kernels, and have a variety of platforms that also are affected (AMD Ryzen, 40xx series as well, etc.), so my specific hardware is not the culprit
-Ensured my GPU is stable and in fully functional condition (passed GPU memory stress test with flying colors, can run heavy loads all night in Windows , ran stress tests, etc.)
Description of Crash
When the crash happens the screen freezes but audio, etc. continues to play in the background, and it takes ~15 seconds for the system to recover enough to alt-tab or switch terminals most of the time, with a hard (reset button) restart required occasionally. Sometimes in Proton apps the screen will freeze, then render a few frames after a few seconds, then freeze again, always with Xid 109 in dmesg after the crash. This happens independent of whether an app is run with DX11 or DX12 in Proton (all dxvk in the end), and with native Vulkan games like CS2. I have only had it happen during CUDA loads a few times but have not recently done any work with compute lately.
Bug report attached! I ran the bug tool immediately after reproducing the crash issue.
nvidia-bug-report.log.gz (937.6 KB)
I would really like to use my GPU again, so anything else I can do to help solve this would be greatly appreciated. I know there is a similar thread for this, however it is two years old and lacking any updates for this issue that renders Linux useless for the majority of my work and leisure activities.
Because I can consistently and quickly reproduce the crash, hopefully I can be of assistance pinpointing this issue, I am experienced with lowlevel debugging if I can get any dumps etc. that might help?
System info:
Arch Linux kernel 6.7.5, (other 6.6.x kernels also cause issue)
Nvidia Driver v.545.29.06 (other drivers also cause issue)
Plasma 5.27.10 through KWin
i7-12700k,
RTX 3090
MSI Z690A, 32gb DDR5,
cat /proc/cmdline ~
BOOT_IMAGE=/boot/vmlinuz-linux root=UUID=c1c6146b-63dc-46ff-84f3-e7661fed204d rw quiet loglevel=3 ibt=off split_lock_detect=off nvidia_drm.modeset=1
cat /proc/driver/nvidia/params ~
ResmanDebugLevel: 4294967295
RmLogonRC: 1
ModifyDeviceFiles: 1
DeviceFileUID: 0
DeviceFileGID: 0
DeviceFileMode: 438
InitializeSystemMemoryAllocations: 1
UsePageAttributeTable: 1
EnableMSI: 1
EnablePCIeGen3: 0
MemoryPoolSize: 0
KMallocHeapMaxSize: 0
VMallocHeapMaxSize: 0
IgnoreMMIOCheck: 0
TCEBypassMode: 0
EnableStreamMemOPs: 0
EnableUserNUMAManagement: 1
NvLinkDisable: 0
RmProfilingAdminOnly: 1
PreserveVideoMemoryAllocations: 0
EnableS0ixPowerManagement: 0
S0ixPowerManagementVideoMemoryThreshold: 256
DynamicPowerManagement: 3
DynamicPowerManagementVideoMemoryThreshold: 200
RegisterPCIDriver: 1
EnablePCIERelaxedOrderingMode: 0
EnableResizableBar: 1
EnableGpuFirmware: 18
EnableGpuFirmwareLogs: 2
EnableDbgBreakpoint: 0
OpenRmEnableUnsupportedGpus: 1
DmaRemapPeerMmio: 1
RegistryDwords: ""
RegistryDwordsPerDevice: ""
RmMsg: ""
GpuBlacklist: ""
TemporaryFilePath: ""
ExcludedGpus: ""
Thank you for any assistance, this is becoming incredibly frustrating.