Xid 8 in various CUDA deep learning applications for Nvidia GTX 1080 Ti

Problem Symptoms
I’m getting a Xid error when running various quick start deep learning examples. The training starts and then crashes after some time, freezing the system for about a minute and pushing up the GPU usage to 100%.

The exceptions vary a bit, but look mostly like this:
RuntimeError: cuda runtime error (6) : the launch timed out and was terminated at /pytorch/aten/src/THC/generic/THCStorage.cpp:36

At the same time, a Xid is generated:
[ 647.295636] NVRM: Xid (PCI:0000:65:00): 8, Channel 00000010

What I already tried

  • (Driver Error) I tested different driver versions from 387 to 396.51 and cuda versions from 8.0 to 9.2 with same error. If it's a driver error, I think it is still persistent in the newest versions
  • (Thermal Issue) The error appears shortly after starting the training, at 60 degrees celsius GPU temp
  • (Bus Error) I made different stability tests from gpu_burn to unigine heaven. All stable. I also sent my graphics card to MSI that found no device failure
  • (User App Error) I tried running several quick start examples from different deep learning frameworks (e.g. from tensorflow and torch) that all yield this error. The code runs fine on CPU
  • (Power Supply) I tried a much more powerful power supply unit (Corsair) than I have now, the error stays the same
  • (RAM) memtest shows no errors after several hours
  • (BIOS) Is flashed to the most recent version
  • (GPU BIOS) The updater says there is no newer version for my GTX 1080 Ti
  • (Intel Microcode) I flashed it manually to the most recent version

Any help on spotting the cause of this Xid is highly appreciated.
nvidia-bug-report.log.gz (141 KB)

You pretty much ruled out everything, so the only advise I can give is that memtest is ineffective to check for a system memory fault. Please remove all but one memory module and check if the issue reappers, then check with the next memory module.

Thanks for your answer. I forgot to mention that I also tried putting the GPU in another PC and the crash happened there too. I think this narrows it down to

  1. Driver error
  2. User App Error across multiple frameworks
  3. Still a GPU hardware error

I believe 3. is not very likely, because I assume/hope MSI has appropriate tools to detect hardware defects. 2. seems possible, but I cannot asses how likely a bug over multiple frameworks is. My current favorite is 1.

Do you have an idea how to differentiate a driver bug from another issue or how to contact a Nvidia developer to look at it?

Ok, I have a suspicion, https://devtalk.nvidia.com/default/topic/483643/cuda-the-launch-timed-out-and-was-terminated/
The problem might be that the gpu is also used to drive the display, running X. The cuda kernel then takes too much time and gets kicked by the driver so the display can be updated. Try stopping X and then running the samples.

This sounds promising. I added

Option “Interactive” “0”

to the Device section of my xorg.conf, will test stability for a few days with and without running X and report back.

Thanks, the kernel timeout was the problem. I’ll try to train in smaller chunks.