Hi amrits,
This thread is the only mention of Xid 109 error I could find online, it doesn’t appear to be listed in nvidias documentation.
The pytorch code runs fine in a loop for a random amount of time before crashing with:
CUBLAS_STATUS_INTERNAL_ERROR
Unfortunately I have not been able to reproduce the error quickly or simply yet, it occurs randomly anywhere from 10 minutes to 10 hours into the program running.
I have tried drivers 520.56 and 525.89, and cuda 11.8 and 12 as well as different versions of pytorch.
Running dmesg after the error shows Xid error 109:
NVRM: Xid (PCI:0000:01:00): 109, pid=4124, name=python, Ch 00000028, errorString CTX SWITCH TIMEOUT, Info 0x2c014
Any insight on how I might narrow down or debug this issue would be greatly appreciated, thanks!