Hi,
I have 2 RTX 2080 Ti cards used mostly for training DNN. After two weeks one of them, returns huge loss during training. At first, I thought is was network architecture issue but everything is fine on other card. I also ran official docker nvcr.io/nvidia/tensorflow:18.03-py2 for some test. Here are my results:
Input (RTX with no issue observed):
export CUDA_VISIBLE_DEVICES=0
python nvcnn.py --model=resnet50 --batch_size=64 --num_gpus=1
Output:
Training
Step Epoch Img/sec Loss LR
1 1 15.5 9.865 0.10000
2 1 35.6 13.123 0.10000
3 1 58.3 14.424 0.10000
4 1 84.0 15.467 0.10000
5 1 91.6 15.365 0.10000
6 1 110.1 15.450 0.10000
7 1 111.4 15.062 0.10000
8 1 120.9 14.706 0.10000
9 1 131.8 14.827 0.10000
10 1 147.5 14.493 0.10000
Input (RTX with issue observed):
export CUDA_VISIBLE_DEVICES=1
python nvcnn.py --model=resnet50 --batch_size=64 --num_gpus=1
Output:
Training
Step Epoch Img/sec Loss LR
1 1 15.9 9.414 0.10000
2 1 36.4 14.372 0.10000
3 1 55.5 17.723 0.10000
4 1 64.9 1052631.125 0.10000
5 1 89.4 2593921.500 0.10000
6 1 89.1 4573577.000 0.10000
7 1 109.1 6924434.500 0.10000
8 1 109.0 4866526720.000 0.10000
9 1 135.3 inf 0.10000
10 1 153.8 nan 0.10000
Does anyone have the same problem or know what is the cause of it ?