Problem Symptoms
I’m getting a Xid error when running various quick start deep learning examples. The training starts and then crashes after some time, freezing the system for about a minute and pushing up the GPU usage to 100%.
The exceptions vary a bit, but look mostly like this:
RuntimeError: cuda runtime error (6) : the launch timed out and was terminated at /pytorch/aten/src/THC/generic/THCStorage.cpp:36
At the same time, a Xid is generated:
[ 647.295636] NVRM: Xid (PCI:0000:65:00): 8, Channel 00000010
What I already tried
- (Driver Error) I tested different driver versions from 387 to 396.51 and cuda versions from 8.0 to 9.2 with same error. If it's a driver error, I think it is still persistent in the newest versions
- (Thermal Issue) The error appears shortly after starting the training, at 60 degrees celsius GPU temp
- (Bus Error) I made different stability tests from gpu_burn to unigine heaven. All stable. I also sent my graphics card to MSI that found no device failure
- (User App Error) I tried running several quick start examples from different deep learning frameworks (e.g. from tensorflow and torch) that all yield this error. The code runs fine on CPU
- (Power Supply) I tried a much more powerful power supply unit (Corsair) than I have now, the error stays the same
- (RAM) memtest shows no errors after several hours
- (BIOS) Is flashed to the most recent version
- (GPU BIOS) The updater says there is no newer version for my GTX 1080 Ti
- (Intel Microcode) I flashed it manually to the most recent version
Any help on spotting the cause of this Xid is highly appreciated.
nvidia-bug-report.log.gz (141 KB)