Xid 8 in various CUDA deep learning applications for Nvidia GTX 1080 Ti

gordie.acree · October 20, 2018, 9:45am

Problem Symptoms
I’m getting a Xid error when running various quick start deep learning examples. The training starts and then crashes after some time, freezing the system for about a minute and pushing up the GPU usage to 100%.

The exceptions vary a bit, but look mostly like this:
RuntimeError: cuda runtime error (6) : the launch timed out and was terminated at /pytorch/aten/src/THC/generic/THCStorage.cpp:36

At the same time, a Xid is generated:
[ 647.295636] NVRM: Xid (PCI:0000:65:00): 8, Channel 00000010

What I already tried

(Driver Error) I tested different driver versions from 387 to 396.51 and cuda versions from 8.0 to 9.2 with same error. If it's a driver error, I think it is still persistent in the newest versions
(Thermal Issue) The error appears shortly after starting the training, at 60 degrees celsius GPU temp
(Bus Error) I made different stability tests from gpu_burn to unigine heaven. All stable. I also sent my graphics card to MSI that found no device failure
(User App Error) I tried running several quick start examples from different deep learning frameworks (e.g. from tensorflow and torch) that all yield this error. The code runs fine on CPU
(Power Supply) I tried a much more powerful power supply unit (Corsair) than I have now, the error stays the same
(RAM) memtest shows no errors after several hours
(BIOS) Is flashed to the most recent version
(GPU BIOS) The updater says there is no newer version for my GTX 1080 Ti
(Intel Microcode) I flashed it manually to the most recent version

Any help on spotting the cause of this Xid is highly appreciated.
nvidia-bug-report.log.gz (141 KB)

generix · October 20, 2018, 12:16pm

You pretty much ruled out everything, so the only advise I can give is that memtest is ineffective to check for a system memory fault. Please remove all but one memory module and check if the issue reappers, then check with the next memory module.

gordie.acree · October 20, 2018, 2:17pm

Thanks for your answer. I forgot to mention that I also tried putting the GPU in another PC and the crash happened there too. I think this narrows it down to

Driver error
User App Error across multiple frameworks
Still a GPU hardware error

I believe 3. is not very likely, because I assume/hope MSI has appropriate tools to detect hardware defects. 2. seems possible, but I cannot asses how likely a bug over multiple frameworks is. My current favorite is 1.

Do you have an idea how to differentiate a driver bug from another issue or how to contact a Nvidia developer to look at it?

generix · October 20, 2018, 6:23pm

Ok, I have a suspicion, [url]https://devtalk.nvidia.com/default/topic/483643/cuda-the-launch-timed-out-and-was-terminated/[/url]
The problem might be that the gpu is also used to drive the display, running X. The cuda kernel then takes too much time and gets kicked by the driver so the display can be updated. Try stopping X and then running the samples.

gordie.acree · October 21, 2018, 1:51pm

This sounds promising. I added

Option “Interactive” “0”

to the Device section of my xorg.conf, will test stability for a few days with and without running X and report back.

gordie.acree · October 24, 2018, 3:37pm

Thanks, the kernel timeout was the problem. I’ll try to train in smaller chunks.

Topic		Replies	Views
Deciphering an NVRM: Xid message? CUDA Programming and Performance	27	78137	April 1, 2012
410.66 crash and system freeze under heavy load (Xid 8, Xid 38) Linux	13	1985	November 15, 2018
Latest Driver for GTX 1080Ti blocks Tensoflow processes? CUDA Programming and Performance	3	1233	August 10, 2017
Rtx2080 ti - err - xid 61 Linux	3	1419	May 22, 2021
Random Xid 62 error on ML workloads - Titan RTX Linux	0	721	July 8, 2020
problems with cuda on linux CUDA Programming and Performance	13	22221	May 16, 2007
1080 Ti always dies shortly after strarting training, cuda 11.5, driver 495.29.05 Drivers - Linux, Windows, MacOS cuda	2	758	January 31, 2022
Xid errors on GTX 1070 @ linux Linux	11	3412	May 24, 2019
Multiple CUDA/RTX/Vulkan application crashing with Xid (13,109) errors Linux	462	42953	July 13, 2025
GPUs give ERR! with NVRM: Xid (PCI:0000:b5:00): 61 Linux	2	1199	July 22, 2019

Xid 8 in various CUDA deep learning applications for Nvidia GTX 1080 Ti

Related topics