2080Ti got ERR soon after starting DL training

My 2080Ti got ERR in the “nvidia-smi” status soon after starting DL training.
Sometime the first one, and at the other time second one.

Do you think this comes from the Micron GDDR6 VRAM issue ?

I’m using the latest driver, CUDA10 and CUDNN 7.4.2 with Pytorch and Keras with Tensorflow-backend.

±----------------------------------------------------------------------------+
| NVIDIA-SMI 415.25 Driver Version: 415.25 CUDA Version: 10.0 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce RTX 208… On | 00000000:01:00.0 Off | N/A |
|ERR! 44C P2 ERR! / 260W | 10MiB / 10989MiB | 0% Default |
±------------------------------±---------------------±---------------------+
| 1 GeForce RTX 208… On | 00000000:02:00.0 Off | N/A |
| 40% 31C P8 11W / 260W | 10MiB / 10989MiB | 0% Default |
±------------------------------±---------------------±---------------------+

±----------------------------------------------------------------------------+
| NVIDIA-SMI 415.25 Driver Version: 415.25 CUDA Version: 10.0 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|====================+======================+======================|
| 0 GeForce RTX 208… On | 00000000:01:00.0 Off | N/A |
| 41% 41C P2 51W / 260W | 2605MiB / 10989MiB | 0% Default |
±------------------------------±---------------------±---------------------+
| 1 GeForce RTX 208… On | 00000000:02:00.0 Off | N/A |
|ERR! 42C P2 ERR! / 260W | 905MiB / 10989MiB | 0% Default |
±------------------------------±---------------------±---------------------+

Use cuda-memtest and gpu-burn to test them.
Please run nvidia-bug-report.sh as root and attach the resulting .gz file to your post. Hovering the mouse over an existing post of yours will reveal a paperclip icon.
https://devtalk.nvidia.com/default/topic/1043347/announcements/attaching-files-to-forum-topics-posts/

Thanks for the comment.

I ran gpu-burn.

I checked four 2080ti of mine one by one.

In addition, I added nvidia-bug-report.sh.gz to this thread.
(The first and second logs were recorded under the two-GPU setting, and the third one was recorded after
booting with only No.4 GPU.)

I hope this information helps investigate the causes of the GPU problem.


For 2080TI no.1
(After reaching 70C, errors happened.)

GPU 0: GeForce RTX 2080 Ti (UUID: GPU-e63b2eba-1c12-2cc8-8203-cccb8c25ee44)
10.8% proc’d: 9030 (12497 Gflop/s) errors: 0 temps: 56 C
Summary at: Fri Jan 25 18:10:37 JST 2019

21.7% proc’d: 18662 (12408 Gflop/s) errors: 0 temps: 60 C
Summary at: Fri Jan 25 18:10:50 JST 2019

32.5% proc’d: 27692 (12402 Gflop/s) errors: 0 temps: 62 C
Summary at: Fri Jan 25 18:11:03 JST 2019

43.3% proc’d: 37324 (12322 Gflop/s) errors: 0 temps: 65 C
Summary at: Fri Jan 25 18:11:16 JST 2019

53.3% proc’d: 45752 (12282 Gflop/s) errors: 0 temps: 66 C
Summary at: Fri Jan 25 18:11:28 JST 2019

64.2% proc’d: 55384 (12275 Gflop/s) errors: 0 temps: 68 C
Summary at: Fri Jan 25 18:11:41 JST 2019

75.0% proc’d: 64414 (12258 Gflop/s) errors: 0 temps: 69 C
Summary at: Fri Jan 25 18:11:54 JST 2019

85.8% proc’d: 73444 (12187 Gflop/s) errors: 36 (WARNING!) temps: 70 C
Summary at: Fri Jan 25 18:12:07 JST 2019

96.7% proc’d: 83076 (12180 Gflop/s) errors: 22 (WARNING!) temps: 71 C
Summary at: Fri Jan 25 18:12:20 JST 2019

100.0% proc’d: 86086 (12165 Gflop/s) errors: 41 (WARNING!) temps: 71 C
Killing processes… done

Tested 1 GPUs:
GPU 0: FAULTY

For 2080Ti no.2
(After reaching 60C, errors happened.)

GPU 0: GeForce RTX 2080 Ti (UUID: GPU-650a6f70-3c29-0fa8-6837-e5adcad6a0b9)
10.8% proc’d: 9030 (12450 Gflop/s) errors: 0 temps: 45 C
Summary at: Fri Jan 25 18:42:02 JST 2019

21.7% proc’d: 18662 (12360 Gflop/s) errors: 0 temps: 51 C
Summary at: Fri Jan 25 18:42:15 JST 2019

32.5% proc’d: 27692 (12346 Gflop/s) errors: 0 temps: 54 C
Summary at: Fri Jan 25 18:42:28 JST 2019

43.3% proc’d: 37324 (12287 Gflop/s) errors: 0 temps: 57 C
Summary at: Fri Jan 25 18:42:41 JST 2019

53.3% proc’d: 45752 (13642 Gflop/s) errors: 80488701 (WARNING!) temps: 60 C
Summary at: Fri Jan 25 18:42:53 JST 2019

64.2% proc’d: 56588 (13650 Gflop/s) errors: 117371138 (WARNING!) temps: 60 C
Summary at: Fri Jan 25 18:43:06 JST 2019

75.0% proc’d: 66822 (13648 Gflop/s) errors: 2910964 (WARNING!) temps: 60 C
Summary at: Fri Jan 25 18:43:19 JST 2019

85.8% proc’d: 77056 (13641 Gflop/s) errors: 5792098 (WARNING!) temps: 72 C
Summary at: Fri Jan 25 18:43:32 JST 2019

96.7% proc’d: 87290 (13629 Gflop/s) errors: 37369313 (WARNING!) temps: 72 C
Summary at: Fri Jan 25 18:43:45 JST 2019

100.0% proc’d: 91504 (13629 Gflop/s) errors: 17788159 (WARNING!) temps: 72 C

For 2080Ti No.3
(When going up over 66C, error happened and the temperature suddenly got down. (sensor error ?))

GPU 0: GeForce RTX 2080 Ti (UUID: GPU-a44a70ed-c3c6-a2d1-a727-bfe251278934)
10.8% proc’d: 9030 (12423 Gflop/s) errors: 0 temps: 49 C
Summary at: Fri Jan 25 21:02:25 JST 2019

21.7% proc’d: 18060 (12320 Gflop/s) errors: 0 temps: 54 C
Summary at: Fri Jan 25 21:02:38 JST 2019

32.5% proc’d: 27692 (12272 Gflop/s) errors: 0 temps: 57 C
Summary at: Fri Jan 25 21:02:51 JST 2019

43.3% proc’d: 36722 (12208 Gflop/s) errors: 0 temps: 60 C
Summary at: Fri Jan 25 21:03:04 JST 2019

53.3% proc’d: 45150 (12165 Gflop/s) errors: 0 temps: 62 C
Summary at: Fri Jan 25 21:03:16 JST 2019

64.2% proc’d: 54782 (12121 Gflop/s) errors: 0 temps: 65 C
Summary at: Fri Jan 25 21:03:29 JST 2019

75.0% proc’d: 63210 (12131 Gflop/s) errors: 0 temps: 66 C
Summary at: Fri Jan 25 21:03:42 JST 2019

100.0% proc’d: 70434 (12600 Gflop/s) errors: 56450387 (WARNING!) temps: 57 C
Summary at: Fri Jan 25 21:04:21 JST 2019

For 2080Ti No.4
(The GPU was not recognized by CUDA. It seems to be completely broken.)

No devices found.

nvidia-bug-report.log.gz (484 KB)
nvidia-bug-report.log.old.gz (1.56 MB)
nvidia-bug-report.log.no4.gz (458 KB)

That doesn’t look good. The fourth, missing card is falling off the bus with an XID 79 though, meaning either overheating or insufficient power supply. Since the temperatures look normal, maybe also the flakeyness of the other gpus comes from an unstable power source, not taking prolonged high power draw, idk. Of course, there’s also the chance of having 4 duds out of 4. So before filling out an RMA form, you should check them one by one in another system with a large enough and stable power supply.

Thank you for the suggestion.

I will check them one by one next week using the system with the CORSAIR 1600W power supply which is expected
to be the most stable one among the power supplies I have.

What’s the brand of the psu in your current system?

It is Cooler Master 1200V Platinum.

I bought this two years ago, and have used this system intensively
with dual TITAN-X (Pascal).

On the other hand, the CORSAIR 1600W is the psu I bought last month,
which must be more stable.

Haven’t heard about any specific issue with that one, so happy testing.

I have just check them with CORSAIR 1600W one by one.

The results were the same. In fact, the situation got worse, probably because of the GPU BURN test.
One more 2080Ti got error during booting.

2080Ti No.1 -> error over 67C when the GPU BURN is running
2080Ti No.2 -> error during booting
2080Ti No.3 -> error over 70C when the GPU BURN is running
2080Ti No.4 -> error during booting

2 2080ti got unrecognized (I mean the error happened in the booting process.) as follows:

[ 25.256152] nvidia: module license ‘NVIDIA’ taints kernel.
[ 25.256154] Disabling lock debugging due to kernel taint
[ 25.265790] nvidia-nvlink: Nvlink Core is being initialized, major device number 248
[ 25.266030] vgaarb: device changed decodes: PCI:0000:03:00.0,olddecodes=io+mem,decodes=none:owns=none
[ 25.283127] init: failsafe main process (787) killed by TERM signal
[ 25.367512] NVRM: loading NVIDIA UNIX x86_64 Kernel Module 415.25 Wed Dec 12 10:22:08 CST 2018 (using threaded interrupts)
[ 25.576514] nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms 415.25 Wed Dec 12 10:02:42 CST 2018
[ 25.580708] [drm] [nvidia-drm] [GPU ID 0x00000300] Loading driver
[ 26.142920] nvidia-uvm: Loaded the UVM driver in 8 mode, major device number 246
[ 26.180250] nvidia 0000:03:00.0: irq 87 for MSI/MSI-X
[ 26.974762] type=1400 audit(1548916214.748:11): apparmor=“STATUS” operation=“profile_replace” profile=“unconfined” name="/sbin/dhclient" pid=1091 comm=“apparmor_parser”
[ 27.450661] NVRM: RmInitAdapter failed! (0x26:0xffff:1098)
[ 27.450725] NVRM: rm_init_adapter failed for device bearing minor number 0
[ 28.280434] init: cups-browsed pre-start process (1178) terminated with status 1
[ 28.449272] ixgbe 0000:09:00.1: registered PHC device on eth5
[ 28.462982] init: alsa-restore main process (1232) terminated with status 99
[ 28.738053] init: nvidia-prime main process (1193) terminated with status 127
[ 29.158630] init: plymouth-upstart-bridge main process ended, respawning
[ 29.652301] NFSD: Using /var/lib/nfs/v4recovery as the NFSv4 state recovery directory
[ 29.653424] NFSD: starting 90-second grace period (net ffffffff81cdad40)
[ 30.181068] nvidia 0000:03:00.0: irq 87 for MSI/MSI-X
[ 34.454569] NVRM: RmInitAdapter failed! (0x26:0xffff:1098)
[ 34.454614] NVRM: rm_init_adapter failed for device bearing minor number 0
[ 34.591211] init: plymouth-stop pre-start process (1590) terminated with status 1
[ 114.871538] NFS: Registering the id_resolver key type
[ 114.871542] Key type id_resolver registered
[ 114.871543] Key type id_legacy registered
[ 114.880415] RPC: AUTH_GSS upcall failed. Please check user daemon is running.
[ 114.928727] cgroup: systemd-logind (605) created nested cgroup for controller “memory” which has incomplete hierarchy support. Nested cgroups may change behavior in the future.
[ 114.928728] cgroup: “memory” requires setting use_hierarchy to 1 on the root.
[ 124.580989] nvidia 0000:03:00.0: irq 87 for MSI/MSI-X
[ 136.827948] NVRM: RmInitAdapter failed! (0x26:0xffff:1098)
[ 136.827986] NVRM: rm_init_adapter failed for device bearing minor number 0

On the other hand, for the rest of ones, the following messages were shown (this means 2080Ti was recognized as GPU successfully.)
[ 20.819679] nvidia-uvm: Loaded the UVM driver in 8 mode, major device number 246
[ 20.862886] nvidia 0000:03:00.0: irq 87 for MSI/MSI-X
[ 20.935471] type=1400 audit(1548917674.712:8): apparmor=“STATUS” operation=“profile_replace” profile=“unconfined” name="/sbin/dhclient" pid=921 comm=“apparmor_parser”
[ 20.935475] type=1400 audit(1548917674.712:9): apparmor=“STATUS” operation=“profile_replace” profile=“unconfined” name="/usr/lib/NetworkManager/nm-dhcp-client.action" pid=921 comm=“apparmor_parser”
[ 20.935476] type=1400 audit(1548917674.712:10): apparmor=“STATUS” operation=“profile_replace” profile=“unconfined” name="/usr/lib/connman/scripts/dhclient-script" pid=921 comm=“apparmor_parser”
[ 20.940604] type=1400 audit(1548917674.720:11): apparmor=“STATUS” operation=“profile_load” profile=“unconfined” name="/usr/sbin/cups-browsed" pid=925 comm=“apparmor_parser”
[ 21.092958] ixgbe 0000:09:00.1: registered PHC device on eth4
[ 21.196890] init: cups-browsed pre-start process (1004) terminated with status 1
[ 21.227381] init: nvidia-prime main process (1016) terminated with status 127
[ 21.238883] init: alsa-restore main process (1061) terminated with status 19
[ 21.249032] input: HDA Intel Front Headphone as /devices/pci0000:00/0000:00:1f.3/sound/card0/input13
[ 21.249381] input: HDA Intel Line Out CLFE as /devices/pci0000:00/0000:00:1f.3/sound/card0/input12
[ 21.249639] input: HDA Intel Line Out Surround as /devices/pci0000:00/0000:00:1f.3/sound/card0/input11
[ 21.249824] input: HDA Intel Line Out Front as /devices/pci0000:00/0000:00:1f.3/sound/card0/input10
[ 21.250035] input: HDA Intel Line as /devices/pci0000:00/0000:00:1f.3/sound/card0/input9
[ 21.250189] input: HDA Intel Rear Mic as /devices/pci0000:00/0000:00:1f.3/sound/card0/input8
[ 21.250382] input: HDA Intel Front Mic as /devices/pci0000:00/0000:00:1f.3/sound/card0/input7
[ 21.252229] hda_intel: Disabling MSI
[ 21.252234] hda-intel 0000:03:00.1: Handle VGA-switcheroo audio client
[ 21.252283] hda-intel 0000:03:00.1: Disabling 64bit DMA
[ 21.256349] hda-intel 0000:03:00.1: Enable delay in RIRB handling
[ 21.532582] autoconfig: line_outs=0 (0x0/0x0/0x0/0x0/0x0) type:line
[ 21.532585] speaker_outs=0 (0x0/0x0/0x0/0x0/0x0)
[ 21.532586] hp_outs=0 (0x0/0x0/0x0/0x0/0x0)
[ 21.532586] mono: mono_out=0x0
[ 21.532587] dig-out=0x4/0x5
[ 21.532588] inputs:
[ 21.616303] input: HDA NVidia HDMI as /devices/pci0000:00/0000:00:01.0/0000:01:00.0/0000:02:10.0/0000:03:00.1/sound/card1/input15
[ 21.616445] input: HDA NVidia HDMI as /devices/pci0000:00/0000:00:01.0/0000:01:00.0/0000:02:10.0/0000:03:00.1/sound/card1/input14
000

2080Ti No.1
GPU BURN test
GPU 0: GeForce RTX 2080 Ti (UUID: GPU-e63b2eba-1c12-2cc8-8203-cccb8c25ee44)
10.8% proc’d: 9030 (12517 Gflop/s) errors: 0 temps: 52 C
Summary at: Thu Jan 31 15:57:25 JST 2019

21.7% proc’d: 18060 (12413 Gflop/s) errors: 0 temps: 58 C
Summary at: Thu Jan 31 15:57:38 JST 2019

32.5% proc’d: 27692 (12373 Gflop/s) errors: 0 temps: 62 C
Summary at: Thu Jan 31 15:57:51 JST 2019

43.3% proc’d: 37324 (12264 Gflop/s) errors: 1 (WARNING!) temps: 67 C
Summary at: Thu Jan 31 15:58:04 JST 2019

53.3% proc’d: 45752 (12196 Gflop/s) errors: 649 (WARNING!) temps: 69 C
Summary at: Thu Jan 31 15:58:16 JST 2019

64.2% proc’d: 54782 (12145 Gflop/s) errors: 155850 (WARNING!) temps: 72 C
Summary at: Thu Jan 31 15:58:29 JST 2019

75.0% proc’d: 63812 (12056 Gflop/s) errors: 2313700 (WARNING!) temps: 74 C
Summary at: Thu Jan 31 15:58:42 JST 2019

85.8% proc’d: 72842 (12043 Gflop/s) errors: 19263364 (WARNING!) temps: 77 C
Summary at: Thu Jan 31 15:58:55 JST 2019

96.7% proc’d: 81872 (11982 Gflop/s) errors: 68848666 (WARNING!) temps: 79 C
Summary at: Thu Jan 31 15:59:08 JST 2019

100.0% proc’d: 85484 (11779 Gflop/s) errors: 94160506 (WARNING!) temps: 79 C

After GPU BURN
±----------------------------------------------------------------------------+
| NVIDIA-SMI 415.25 Driver Version: 415.25 CUDA Version: 10.0 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce RTX 208… On | 00000000:03:00.0 Off | N/A |
|ERR! 52C P2 ERR! / 260W | 51MiB / 10989MiB | 0% Default |
±------------------------------±---------------------±---------------------+


2080Ti No.3
GPU BURN test
GPU 0: GeForce RTX 2080 Ti (UUID: GPU-a44a70ed-c3c6-a2d1-a727-bfe251278934)
10.8% proc’d: 6622 (12375 Gflop/s) errors: 0 temps: 52 C
Summary at: Thu Jan 31 15:40:38 JST 2019

21.7% proc’d: 16254 (12237 Gflop/s) errors: 0 temps: 59 C
Summary at: Thu Jan 31 15:40:51 JST 2019

32.5% proc’d: 25284 (12190 Gflop/s) errors: 0 temps: 62 C
Summary at: Thu Jan 31 15:41:04 JST 2019

43.3% proc’d: 34314 (12104 Gflop/s) errors: 0 temps: 66 C
Summary at: Thu Jan 31 15:41:17 JST 2019

53.3% proc’d: 42742 (11994 Gflop/s) errors: 0 temps: 69 C
Summary at: Thu Jan 31 15:41:29 JST 2019

64.2% proc’d: 52374 (13577 Gflop/s) errors: 4386181 (WARNING!) temps: 72 C
Summary at: Thu Jan 31 15:41:42 JST 2019

75.0% proc’d: 61404 (12607 Gflop/s) errors: 0 temps: 73 C
Summary at: Thu Jan 31 15:41:55 JST 2019

85.8% proc’d: 71638 (12608 Gflop/s) errors: 0 temps: 76 C
Summary at: Thu Jan 31 15:42:08 JST 2019

96.7% proc’d: 81872 (13296 Gflop/s) errors: 2345380 (WARNING!) temps: 79 C
Summary at: Thu Jan 31 15:42:21 JST 2019

100.0% proc’d: 85484 (13393 Gflop/s) errors: 0 temps: 79 C

After GPU BURN
±----------------------------------------------------------------------------+
| NVIDIA-SMI 415.25 Driver Version: 415.25 CUDA Version: 10.0 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce RTX 208… On | 00000000:03:00.0 Off | N/A |
|ERR! 52C P2 ERR! / 260W | 51MiB / 10989MiB | 0% Default |
±------------------------------±---------------------±---------------------+

Then it’s time to RMA them, they’re obviously degrading more and more. Four out of four must have really been a bad batch. How long did you use them?

I just used them several days in fact. At the first use for training a deep learning model with Pytorch and CUDNN, one of the 2080Ti got lost during training. After that, I tried training several times after removing the lost GPUs. But the results were the same. The other one got lost as well.

Besides NVIDIA reference model of the 2080Tis, I have two MSI RTX 2080Ti as well. They have no problem at all. They are working well.

“Generix”, thank you for the helpful suggestions and advices. I really appreciate your help.

I have this problem too, and I have 4 of 6 2080Tis got ERR then DAMAGED(cannot open PC at all).