Huge loss on RTX 2080 Ti issue

d.grzywczak · December 31, 2018, 1:55pm

Hi,

I have 2 RTX 2080 Ti cards used mostly for training DNN. After two weeks one of them, returns huge loss during training. At first, I thought is was network architecture issue but everything is fine on other card. I also ran official docker nvcr.io/nvidia/tensorflow:18.03-py2 for some test. Here are my results:

Input (RTX with no issue observed):

export CUDA_VISIBLE_DEVICES=0
python nvcnn.py --model=resnet50 --batch_size=64 --num_gpus=1

Output:

Training
  Step Epoch Img/sec   Loss   LR
     1     1    15.5   9.865 0.10000
     2     1    35.6  13.123 0.10000
     3     1    58.3  14.424 0.10000
     4     1    84.0  15.467 0.10000
     5     1    91.6  15.365 0.10000
     6     1   110.1  15.450 0.10000
     7     1   111.4  15.062 0.10000
     8     1   120.9  14.706 0.10000
     9     1   131.8  14.827 0.10000
    10     1   147.5  14.493 0.10000

Input (RTX with issue observed):

export CUDA_VISIBLE_DEVICES=1
python nvcnn.py --model=resnet50 --batch_size=64 --num_gpus=1

Output:

Training
  Step Epoch Img/sec   Loss   LR
     1     1    15.9   9.414 0.10000
     2     1    36.4  14.372 0.10000
     3     1    55.5  17.723 0.10000
     4     1    64.9 1052631.125 0.10000
     5     1    89.4 2593921.500 0.10000
     6     1    89.1 4573577.000 0.10000
     7     1   109.1 6924434.500 0.10000
     8     1   109.0 4866526720.000 0.10000
     9     1   135.3     inf 0.10000
    10     1   153.8     nan 0.10000

Does anyone have the same problem or know what is the cause of it ?

21435393 · January 8, 2019, 8:00am

Same issue here, any idea?

pascaltx · April 9, 2020, 1:57pm

Same issue here, I have tested mnist examples, huge loss and accuracy is 0.09 (Tested on keras and pytorch with Windows and Ubuntu).
It looks like 2080 ti has serious hardware issue. Sometimes it shows weird lines and stopped windows and ubuntu.
Can warranty do anything ?

Topic		Replies	Views
Huge loss on 2080 Ti Deep Learning (Training & Inference)	0	489	January 6, 2019
RTX 2070 "InternalError: GPU sync failed" and nan loss Linux cuda , tensorflow	0	391	October 27, 2020
Ubuntu 18.04, RTX 2080 Ti, Tensorflow; NaN values consistently appearing during training of all networks Linux cuda , tensorflow , ubuntu	11	1285	October 12, 2021
Ubuntu 20.04, RTX 3090, nvidia-Tensorflow; NaN values consistently appearing during training of rnn networks Linux cuda	2	1500	March 11, 2022
RTX 2080 cards crashed when training longer a PyTorch model Linux	4	1114	November 6, 2019
GeForce 2080 RTX ti on Ubuntu 18.04 stops working after a while Linux	7	5956	October 12, 2021
Hard shutdown problem on ubuntu16.04 with 2080Ti Frameworks tensorflow	3	553	September 23, 2019
The Nvidia 2080 Ti driver is not working on Ubuntu Linux ubuntu	3	133	January 13, 2025
Problem with RTX card on Ubuntu 18.04 server headless Linux	1	472	May 30, 2019
GPU Lost when using Tensorflow Training Linux	0	419	March 6, 2019

Huge loss on RTX 2080 Ti issue

Related topics