2080Ti got ERR soon after starting DL training

DL_worker_UEC1 · January 24, 2019, 2:44am

My 2080Ti got ERR in the “nvidia-smi” status soon after starting DL training.
Sometime the first one, and at the other time second one.

Do you think this comes from the Micron GDDR6 VRAM issue ?

I’m using the latest driver, CUDA10 and CUDNN 7.4.2 with Pytorch and Keras with Tensorflow-backend.

generix · January 24, 2019, 9:33am

Use cuda-memtest and gpu-burn to test them.
Please run nvidia-bug-report.sh as root and attach the resulting .gz file to your post. Hovering the mouse over an existing post of yours will reveal a paperclip icon.
[url]https://devtalk.nvidia.com/default/topic/1043347/announcements/attaching-files-to-forum-topics-posts/[/url]

DL_worker_UEC1 · January 25, 2019, 12:51pm

Thanks for the comment.

I ran gpu-burn.

I checked four 2080ti of mine one by one.

In addition, I added nvidia-bug-report.sh.gz to this thread.
(The first and second logs were recorded under the two-GPU setting, and the third one was recorded after
booting with only No.4 GPU.)

I hope this information helps investigate the causes of the GPU problem.

For 2080TI no.1
(After reaching 70C, errors happened.)

GPU 0: GeForce RTX 2080 Ti (UUID: GPU-e63b2eba-1c12-2cc8-8203-cccb8c25ee44)
10.8% proc’d: 9030 (12497 Gflop/s) errors: 0 temps: 56 C
Summary at: Fri Jan 25 18:10:37 JST 2019

21.7% proc’d: 18662 (12408 Gflop/s) errors: 0 temps: 60 C
Summary at: Fri Jan 25 18:10:50 JST 2019

32.5% proc’d: 27692 (12402 Gflop/s) errors: 0 temps: 62 C
Summary at: Fri Jan 25 18:11:03 JST 2019

43.3% proc’d: 37324 (12322 Gflop/s) errors: 0 temps: 65 C
Summary at: Fri Jan 25 18:11:16 JST 2019

53.3% proc’d: 45752 (12282 Gflop/s) errors: 0 temps: 66 C
Summary at: Fri Jan 25 18:11:28 JST 2019

64.2% proc’d: 55384 (12275 Gflop/s) errors: 0 temps: 68 C
Summary at: Fri Jan 25 18:11:41 JST 2019

75.0% proc’d: 64414 (12258 Gflop/s) errors: 0 temps: 69 C
Summary at: Fri Jan 25 18:11:54 JST 2019

85.8% proc’d: 73444 (12187 Gflop/s) errors: 36 (WARNING!) temps: 70 C
Summary at: Fri Jan 25 18:12:07 JST 2019

96.7% proc’d: 83076 (12180 Gflop/s) errors: 22 (WARNING!) temps: 71 C
Summary at: Fri Jan 25 18:12:20 JST 2019

100.0% proc’d: 86086 (12165 Gflop/s) errors: 41 (WARNING!) temps: 71 C
Killing processes… done

Tested 1 GPUs:
GPU 0: FAULTY

For 2080Ti no.2
(After reaching 60C, errors happened.)

GPU 0: GeForce RTX 2080 Ti (UUID: GPU-650a6f70-3c29-0fa8-6837-e5adcad6a0b9)
10.8% proc’d: 9030 (12450 Gflop/s) errors: 0 temps: 45 C
Summary at: Fri Jan 25 18:42:02 JST 2019

21.7% proc’d: 18662 (12360 Gflop/s) errors: 0 temps: 51 C
Summary at: Fri Jan 25 18:42:15 JST 2019

32.5% proc’d: 27692 (12346 Gflop/s) errors: 0 temps: 54 C
Summary at: Fri Jan 25 18:42:28 JST 2019

43.3% proc’d: 37324 (12287 Gflop/s) errors: 0 temps: 57 C
Summary at: Fri Jan 25 18:42:41 JST 2019

53.3% proc’d: 45752 (13642 Gflop/s) errors: 80488701 (WARNING!) temps: 60 C
Summary at: Fri Jan 25 18:42:53 JST 2019

64.2% proc’d: 56588 (13650 Gflop/s) errors: 117371138 (WARNING!) temps: 60 C
Summary at: Fri Jan 25 18:43:06 JST 2019

75.0% proc’d: 66822 (13648 Gflop/s) errors: 2910964 (WARNING!) temps: 60 C
Summary at: Fri Jan 25 18:43:19 JST 2019

85.8% proc’d: 77056 (13641 Gflop/s) errors: 5792098 (WARNING!) temps: 72 C
Summary at: Fri Jan 25 18:43:32 JST 2019

96.7% proc’d: 87290 (13629 Gflop/s) errors: 37369313 (WARNING!) temps: 72 C
Summary at: Fri Jan 25 18:43:45 JST 2019

100.0% proc’d: 91504 (13629 Gflop/s) errors: 17788159 (WARNING!) temps: 72 C

For 2080Ti No.3
(When going up over 66C, error happened and the temperature suddenly got down. (sensor error ?))

GPU 0: GeForce RTX 2080 Ti (UUID: GPU-a44a70ed-c3c6-a2d1-a727-bfe251278934)
10.8% proc’d: 9030 (12423 Gflop/s) errors: 0 temps: 49 C
Summary at: Fri Jan 25 21:02:25 JST 2019

21.7% proc’d: 18060 (12320 Gflop/s) errors: 0 temps: 54 C
Summary at: Fri Jan 25 21:02:38 JST 2019

32.5% proc’d: 27692 (12272 Gflop/s) errors: 0 temps: 57 C
Summary at: Fri Jan 25 21:02:51 JST 2019

43.3% proc’d: 36722 (12208 Gflop/s) errors: 0 temps: 60 C
Summary at: Fri Jan 25 21:03:04 JST 2019

53.3% proc’d: 45150 (12165 Gflop/s) errors: 0 temps: 62 C
Summary at: Fri Jan 25 21:03:16 JST 2019

64.2% proc’d: 54782 (12121 Gflop/s) errors: 0 temps: 65 C
Summary at: Fri Jan 25 21:03:29 JST 2019

75.0% proc’d: 63210 (12131 Gflop/s) errors: 0 temps: 66 C
Summary at: Fri Jan 25 21:03:42 JST 2019

100.0% proc’d: 70434 (12600 Gflop/s) errors: 56450387 (WARNING!) temps: 57 C
Summary at: Fri Jan 25 21:04:21 JST 2019

For 2080Ti No.4
(The GPU was not recognized by CUDA. It seems to be completely broken.)[url][/url]

No devices found.

nvidia-bug-report.log.gz (484 KB)
nvidia-bug-report.log.old.gz (1.56 MB)
nvidia-bug-report.log.no4.gz (458 KB)

generix · January 25, 2019, 1:13pm

That doesn’t look good. The fourth, missing card is falling off the bus with an XID 79 though, meaning either overheating or insufficient power supply. Since the temperatures look normal, maybe also the flakeyness of the other gpus comes from an unstable power source, not taking prolonged high power draw, idk. Of course, there’s also the chance of having 4 duds out of 4. So before filling out an RMA form, you should check them one by one in another system with a large enough and stable power supply.

DL_worker_UEC1 · January 25, 2019, 1:30pm

Thank you for the suggestion.

I will check them one by one next week using the system with the CORSAIR 1600W power supply which is expected
to be the most stable one among the power supplies I have.

generix · January 25, 2019, 1:33pm

What’s the brand of the psu in your current system?

DL_worker_UEC1 · January 25, 2019, 1:44pm

It is Cooler Master 1200V Platinum.

I bought this two years ago, and have used this system intensively
with dual TITAN-X (Pascal).

On the other hand, the CORSAIR 1600W is the psu I bought last month,
which must be more stable.

generix · January 25, 2019, 2:16pm

Haven’t heard about any specific issue with that one, so happy testing.

DL_worker_UEC1 · January 31, 2019, 7:12am

I have just check them with CORSAIR 1600W one by one.

The results were the same. In fact, the situation got worse, probably because of the GPU BURN test.
One more 2080Ti got error during booting.

2080Ti No.1 → error over 67C when the GPU BURN is running
2080Ti No.2 → error during booting
2080Ti No.3 → error over 70C when the GPU BURN is running
2080Ti No.4 → error during booting

2 2080ti got unrecognized (I mean the error happened in the booting process.) as follows:

[ 25.256152] nvidia: module license ‘NVIDIA’ taints kernel.
[ 25.256154] Disabling lock debugging due to kernel taint
[ 25.265790] nvidia-nvlink: Nvlink Core is being initialized, major device number 248
[ 25.266030] vgaarb: device changed decodes: PCI:0000:03:00.0,olddecodes=io+mem,decodes=none:owns=none
[ 25.283127] init: failsafe main process (787) killed by TERM signal
[ 25.367512] NVRM: loading NVIDIA UNIX x86_64 Kernel Module 415.25 Wed Dec 12 10:22:08 CST 2018 (using threaded interrupts)
[ 25.576514] nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms 415.25 Wed Dec 12 10:02:42 CST 2018
[ 25.580708] [drm] [nvidia-drm] [GPU ID 0x00000300] Loading driver
[ 26.142920] nvidia-uvm: Loaded the UVM driver in 8 mode, major device number 246
[ 26.180250] nvidia 0000:03:00.0: irq 87 for MSI/MSI-X
[ 26.974762] type=1400 audit(1548916214.748:11): apparmor=“STATUS” operation=“profile_replace” profile=“unconfined” name=“/sbin/dhclient” pid=1091 comm=“apparmor_parser”
[ 27.450661] NVRM: RmInitAdapter failed! (0x26:0xffff:1098)
[ 27.450725] NVRM: rm_init_adapter failed for device bearing minor number 0
[ 28.280434] init: cups-browsed pre-start process (1178) terminated with status 1
[ 28.449272] ixgbe 0000:09:00.1: registered PHC device on eth5
[ 28.462982] init: alsa-restore main process (1232) terminated with status 99
[ 28.738053] init: nvidia-prime main process (1193) terminated with status 127
[ 29.158630] init: plymouth-upstart-bridge main process ended, respawning
[ 29.652301] NFSD: Using /var/lib/nfs/v4recovery as the NFSv4 state recovery directory
[ 29.653424] NFSD: starting 90-second grace period (net ffffffff81cdad40)
[ 30.181068] nvidia 0000:03:00.0: irq 87 for MSI/MSI-X
[ 34.454569] NVRM: RmInitAdapter failed! (0x26:0xffff:1098)
[ 34.454614] NVRM: rm_init_adapter failed for device bearing minor number 0
[ 34.591211] init: plymouth-stop pre-start process (1590) terminated with status 1
[ 114.871538] NFS: Registering the id_resolver key type
[ 114.871542] Key type id_resolver registered
[ 114.871543] Key type id_legacy registered
[ 114.880415] RPC: AUTH_GSS upcall failed. Please check user daemon is running.
[ 114.928727] cgroup: systemd-logind (605) created nested cgroup for controller “memory” which has incomplete hierarchy support. Nested cgroups may change behavior in the future.
[ 114.928728] cgroup: “memory” requires setting use_hierarchy to 1 on the root.
[ 124.580989] nvidia 0000:03:00.0: irq 87 for MSI/MSI-X
[ 136.827948] NVRM: RmInitAdapter failed! (0x26:0xffff:1098)
[ 136.827986] NVRM: rm_init_adapter failed for device bearing minor number 0

On the other hand, for the rest of ones, the following messages were shown (this means 2080Ti was recognized as GPU successfully.)
[ 20.819679] nvidia-uvm: Loaded the UVM driver in 8 mode, major device number 246
[ 20.862886] nvidia 0000:03:00.0: irq 87 for MSI/MSI-X
[ 20.935471] type=1400 audit(1548917674.712:8): apparmor=“STATUS” operation=“profile_replace” profile=“unconfined” name=“/sbin/dhclient” pid=921 comm=“apparmor_parser”
[ 20.935475] type=1400 audit(1548917674.712:9): apparmor=“STATUS” operation=“profile_replace” profile=“unconfined” name=“/usr/lib/NetworkManager/nm-dhcp-client.action” pid=921 comm=“apparmor_parser”
[ 20.935476] type=1400 audit(1548917674.712:10): apparmor=“STATUS” operation=“profile_replace” profile=“unconfined” name=“/usr/lib/connman/scripts/dhclient-script” pid=921 comm=“apparmor_parser”
[ 20.940604] type=1400 audit(1548917674.720:11): apparmor=“STATUS” operation=“profile_load” profile=“unconfined” name=“/usr/sbin/cups-browsed” pid=925 comm=“apparmor_parser”
[ 21.092958] ixgbe 0000:09:00.1: registered PHC device on eth4
[ 21.196890] init: cups-browsed pre-start process (1004) terminated with status 1
[ 21.227381] init: nvidia-prime main process (1016) terminated with status 127
[ 21.238883] init: alsa-restore main process (1061) terminated with status 19
[ 21.249032] input: HDA Intel Front Headphone as /devices/pci0000:00/0000:00:1f.3/sound/card0/input13
[ 21.249381] input: HDA Intel Line Out CLFE as /devices/pci0000:00/0000:00:1f.3/sound/card0/input12
[ 21.249639] input: HDA Intel Line Out Surround as /devices/pci0000:00/0000:00:1f.3/sound/card0/input11
[ 21.249824] input: HDA Intel Line Out Front as /devices/pci0000:00/0000:00:1f.3/sound/card0/input10
[ 21.250035] input: HDA Intel Line as /devices/pci0000:00/0000:00:1f.3/sound/card0/input9
[ 21.250189] input: HDA Intel Rear Mic as /devices/pci0000:00/0000:00:1f.3/sound/card0/input8
[ 21.250382] input: HDA Intel Front Mic as /devices/pci0000:00/0000:00:1f.3/sound/card0/input7
[ 21.252229] hda_intel: Disabling MSI
[ 21.252234] hda-intel 0000:03:00.1: Handle VGA-switcheroo audio client
[ 21.252283] hda-intel 0000:03:00.1: Disabling 64bit DMA
[ 21.256349] hda-intel 0000:03:00.1: Enable delay in RIRB handling
[ 21.532582] autoconfig: line_outs=0 (0x0/0x0/0x0/0x0/0x0) type:line
[ 21.532585] speaker_outs=0 (0x0/0x0/0x0/0x0/0x0)
[ 21.532586] hp_outs=0 (0x0/0x0/0x0/0x0/0x0)
[ 21.532586] mono: mono_out=0x0
[ 21.532587] dig-out=0x4/0x5
[ 21.532588] inputs:
[ 21.616303] input: HDA NVidia HDMI as /devices/pci0000:00/0000:00:01.0/0000:01:00.0/0000:02:10.0/0000:03:00.1/sound/card1/input15
[ 21.616445] input: HDA NVidia HDMI as /devices/pci0000:00/0000:00:01.0/0000:01:00.0/0000:02:10.0/0000:03:00.1/sound/card1/input14
000

2080Ti No.1
GPU BURN test
GPU 0: GeForce RTX 2080 Ti (UUID: GPU-e63b2eba-1c12-2cc8-8203-cccb8c25ee44)
10.8% proc’d: 9030 (12517 Gflop/s) errors: 0 temps: 52 C
Summary at: Thu Jan 31 15:57:25 JST 2019

21.7% proc’d: 18060 (12413 Gflop/s) errors: 0 temps: 58 C
Summary at: Thu Jan 31 15:57:38 JST 2019

32.5% proc’d: 27692 (12373 Gflop/s) errors: 0 temps: 62 C
Summary at: Thu Jan 31 15:57:51 JST 2019

43.3% proc’d: 37324 (12264 Gflop/s) errors: 1 (WARNING!) temps: 67 C
Summary at: Thu Jan 31 15:58:04 JST 2019

53.3% proc’d: 45752 (12196 Gflop/s) errors: 649 (WARNING!) temps: 69 C
Summary at: Thu Jan 31 15:58:16 JST 2019

64.2% proc’d: 54782 (12145 Gflop/s) errors: 155850 (WARNING!) temps: 72 C
Summary at: Thu Jan 31 15:58:29 JST 2019

75.0% proc’d: 63812 (12056 Gflop/s) errors: 2313700 (WARNING!) temps: 74 C
Summary at: Thu Jan 31 15:58:42 JST 2019

85.8% proc’d: 72842 (12043 Gflop/s) errors: 19263364 (WARNING!) temps: 77 C
Summary at: Thu Jan 31 15:58:55 JST 2019

96.7% proc’d: 81872 (11982 Gflop/s) errors: 68848666 (WARNING!) temps: 79 C
Summary at: Thu Jan 31 15:59:08 JST 2019

100.0% proc’d: 85484 (11779 Gflop/s) errors: 94160506 (WARNING!) temps: 79 C

2080Ti No.3
GPU BURN test
GPU 0: GeForce RTX 2080 Ti (UUID: GPU-a44a70ed-c3c6-a2d1-a727-bfe251278934)
10.8% proc’d: 6622 (12375 Gflop/s) errors: 0 temps: 52 C
Summary at: Thu Jan 31 15:40:38 JST 2019

21.7% proc’d: 16254 (12237 Gflop/s) errors: 0 temps: 59 C
Summary at: Thu Jan 31 15:40:51 JST 2019

32.5% proc’d: 25284 (12190 Gflop/s) errors: 0 temps: 62 C
Summary at: Thu Jan 31 15:41:04 JST 2019

43.3% proc’d: 34314 (12104 Gflop/s) errors: 0 temps: 66 C
Summary at: Thu Jan 31 15:41:17 JST 2019

53.3% proc’d: 42742 (11994 Gflop/s) errors: 0 temps: 69 C
Summary at: Thu Jan 31 15:41:29 JST 2019

64.2% proc’d: 52374 (13577 Gflop/s) errors: 4386181 (WARNING!) temps: 72 C
Summary at: Thu Jan 31 15:41:42 JST 2019

75.0% proc’d: 61404 (12607 Gflop/s) errors: 0 temps: 73 C
Summary at: Thu Jan 31 15:41:55 JST 2019

85.8% proc’d: 71638 (12608 Gflop/s) errors: 0 temps: 76 C
Summary at: Thu Jan 31 15:42:08 JST 2019

96.7% proc’d: 81872 (13296 Gflop/s) errors: 2345380 (WARNING!) temps: 79 C
Summary at: Thu Jan 31 15:42:21 JST 2019

100.0% proc’d: 85484 (13393 Gflop/s) errors: 0 temps: 79 C

generix · January 31, 2019, 9:15am

Then it’s time to RMA them, they’re obviously degrading more and more. Four out of four must have really been a bad batch. How long did you use them?

DL_worker_UEC · February 1, 2019, 12:51pm

I just used them several days in fact. At the first use for training a deep learning model with Pytorch and CUDNN, one of the 2080Ti got lost during training. After that, I tried training several times after removing the lost GPUs. But the results were the same. The other one got lost as well.

Besides NVIDIA reference model of the 2080Tis, I have two MSI RTX 2080Ti as well. They have no problem at all. They are working well.

“Generix”, thank you for the helpful suggestions and advices. I really appreciate your help.

chinakook · February 2, 2019, 5:00am

I have this problem too, and I have 4 of 6 2080Tis got ERR then DAMAGED(cannot open PC at all).

Topic		Replies	Views
NVIDIA-SMI Shows ERR! on both Fan and Power Usage Linux	32	44776	August 30, 2022
GeForce RTX 2080 ERR! show in nvidia-smi Linux	33	4937	April 12, 2019
GeForce 2080 RTX ti on Ubuntu 18.04 stops working after a while Linux	7	5918	October 12, 2021
RTX 2080 Ti always Power Cap and low utilization Linux	14	4544	November 6, 2019
K80 crashed or wrong computation results on K80 CUDA Programming and Performance	13	4932	September 20, 2015
Nvidia-settings gives errors 3090ti egpu dell laptop Ubuntu Linux ubuntu	8	1224	August 15, 2022
GPU accelerated LAMMPS running for a while then stop with Cuda driver error 600 CUDA Setup and Installation	8	1668	October 1, 2020
Limited clock for the new RTX3090Ti + Ubuntu 20.04 CUDA Programming and Performance	15	2655	December 5, 2022
deviceQuery passes and then fails CUDA Setup and Installation	4	2124	July 6, 2016
Cudnn_status_not_initialized Linux cuda , ubuntu	16	6050	March 18, 2021

2080Ti got ERR soon after starting DL training

Tested 1 GPUs: GPU 0: FAULTY

100.0% proc’d: 91504 (13629 Gflop/s) errors: 17788159 (WARNING!) temps: 72 C

100.0% proc’d: 70434 (12600 Gflop/s) errors: 56450387 (WARNING!) temps: 57 C Summary at: Fri Jan 25 21:04:21 JST 2019

No devices found.

2 2080ti got unrecognized (I mean the error happened in the booting process.) as follows:

Related topics

Tested 1 GPUs:
GPU 0: FAULTY

100.0% proc’d: 70434 (12600 Gflop/s) errors: 56450387 (WARNING!) temps: 57 C
Summary at: Fri Jan 25 21:04:21 JST 2019