Huge loss on RTX 2080 Ti issue

Hi,

I have 2 RTX 2080 Ti cards used mostly for training DNN. After two weeks one of them, returns huge loss during training. At first, I thought is was network architecture issue but everything is fine on other card. I also ran official docker nvcr.io/nvidia/tensorflow:18.03-py2 for some test. Here are my results:

Input (RTX with no issue observed):

export CUDA_VISIBLE_DEVICES=0
python nvcnn.py --model=resnet50 --batch_size=64 --num_gpus=1

Output:

Training
  Step Epoch Img/sec   Loss   LR
     1     1    15.5   9.865 0.10000
     2     1    35.6  13.123 0.10000
     3     1    58.3  14.424 0.10000
     4     1    84.0  15.467 0.10000
     5     1    91.6  15.365 0.10000
     6     1   110.1  15.450 0.10000
     7     1   111.4  15.062 0.10000
     8     1   120.9  14.706 0.10000
     9     1   131.8  14.827 0.10000
    10     1   147.5  14.493 0.10000

Input (RTX with issue observed):

export CUDA_VISIBLE_DEVICES=1
python nvcnn.py --model=resnet50 --batch_size=64 --num_gpus=1

Output:

Training
  Step Epoch Img/sec   Loss   LR
     1     1    15.9   9.414 0.10000
     2     1    36.4  14.372 0.10000
     3     1    55.5  17.723 0.10000
     4     1    64.9 1052631.125 0.10000
     5     1    89.4 2593921.500 0.10000
     6     1    89.1 4573577.000 0.10000
     7     1   109.1 6924434.500 0.10000
     8     1   109.0 4866526720.000 0.10000
     9     1   135.3     inf 0.10000
    10     1   153.8     nan 0.10000

Does anyone have the same problem or know what is the cause of it ?

1 Like

Same issue here, any idea?

Same issue here, I have tested mnist examples, huge loss and accuracy is 0.09 (Tested on keras and pytorch with Windows and Ubuntu).
It looks like 2080 ti has serious hardware issue. Sometimes it shows weird lines and stopped windows and ubuntu.
Can warranty do anything ?