Huge loss on 2080 Ti

Hi,

I have 2 RTX 2080 Ti cards used mostly for training DNN. After two weeks one of them, returns huge loss during training. At first, I thought is was network architecture issue but everything is fine on other card. I also ran official docker nvcr.io/nvidia/tensorflow:18.03-py2 for some test. Here are my results:

Input (RTX with no issue observed):

export CUDA_VISIBLE_DEVICES=0
python nvcnn.py --model=resnet50 --batch_size=64 --num_gpus=1

Output:

Training
      Step Epoch Img/sec   Loss   LR
         1     1    15.5   9.865 0.10000
         2     1    35.6  13.123 0.10000
         3     1    58.3  14.424 0.10000
         4     1    84.0  15.467 0.10000
         5     1    91.6  15.365 0.10000
         6     1   110.1  15.450 0.10000
         7     1   111.4  15.062 0.10000
         8     1   120.9  14.706 0.10000
         9     1   131.8  14.827 0.10000
        10     1   147.5  14.493 0.10000

Input (RTX with issue observed):

export CUDA_VISIBLE_DEVICES=1
python nvcnn.py --model=resnet50 --batch_size=64 --num_gpus=1

Output:

Training
      Step Epoch Img/sec   Loss   LR
         1     1    15.9   9.414 0.10000
         2     1    36.4  14.372 0.10000
         3     1    55.5  17.723 0.10000
         4     1    64.9 1052631.125 0.10000
         5     1    89.4 2593921.500 0.10000
         6     1    89.1 4573577.000 0.10000
         7     1   109.1 6924434.500 0.10000
         8     1   109.0 4866526720.000 0.10000
         9     1   135.3     inf 0.10000
        10     1   153.8     nan 0.10000

Does anyone have the same problem or know what is the cause of it ?