Ubuntu 18.04, RTX 2080 Ti, Tensorflow; NaN values consistently appearing during training of all networks

t.w.jubb · June 2, 2020, 12:50pm

I have recently set up a new machine at home for deep learning;

Ubuntu 18.04
RTX 2080 Ti graphics card

After installing the drivers and CUDA/CuDNN I have found that there is a consistent issue when training many (completely different) models; the loss becomes NaN after some number of epochs (which varies randomly). I created a “minimal working example” of a model with only 2 layers which causes the problem. I have attempted to fix this in the tensorflow/keras code in at least 50 different ways having scoured google for a day or two. Up until the NaN values appear the training is going smoothly and the loss/metrics etc all look sensible and stable.

I am convinced that the source of the issue is not the tensorflow code; but perhaps related to the GPU drivers and Ubuntu (I have run identical code on a windows machine with a different GPU, but same tensorflow, CUDA etc versions and no issues).

I am finding it incredibly hard to track down the source of this and keep coming back to the hardware (as this is the part I understand least).

A few things I’ve tried:

Different tensorflow versions (version 1 and 2)
Different driver versions through nvidia-smi (435 and 440) with reboots between.
Different OS (as I’ve said Windows has had no problem)
Different CNN models (including ones from official tensorflow tutorials)

generix · June 2, 2020, 7:28pm

Please run nvidia-bug-report.sh as root after you hit the issue and attach the resulting nvidia-bug-report.log.gz file to your post. You will have to rename the file ending to something else since the forum software doesn’t accept .gz files (nifty!).

t.w.jubb · June 2, 2020, 8:05pm

Thanks for the reply; ah I didn’t know about this script… attached below is the log after a KeyboardInterrupt following the occurrence of a NaN value in the first epoch (after 40 or so batches of data).

It may be obvious that there are certain aspects of working with the GPU I am not familiar with; any links to articles/reading material is more than welcome!

nvidia-bug-report.log.txt (264.5 KB)

generix · June 2, 2020, 8:49pm

Software setup seems fine.
There were no errors logged from this run but you were previously getting XID 31 errors which might point to a problem with the gpu memory. Since 2080Tis are often sensitive to heat, please monitor temperatures while training until running into the issue. Also, please run gpu-burn for 10 minutes to thoroughly test the gpu for hw failures.

t.w.jubb · January 31, 2021, 2:39pm

Just for a bit of finality; the source of the problems seemed to be the batch size set on the training of the network. If I lowered it enough the problem went away. The errors just didn’t suggest that the training iteration memory consumption had anything to do with it; and I never tracked down the true source of the error beyond just batch size.

generix · January 31, 2021, 2:54pm

You likely ran into this:
https://nvidia.custhelp.com/app/answers/detail/a_id/3029/~/using-cuda-and-x

t.w.jubb · February 12, 2021, 4:57pm

Funnily enough I have the same nan errors appearing again; this time with an entirely different framework (pytorch). I can’t rule out that this is related to the framework rather than hardware; but the behaviour is suspiciously similar.

To test your idea; I have switched to internal graphics for the display (editing /etc/X11/xorg.conf). nvidia-smi no longer shows any X11 processes on my GPU so I assume it worked. I still get the nan issues. I’m on driver 450.102.04 and CUDA 11.0 now. The only process on the GPU is python while the code runs; but still nans after a few training epochs.

Any other suggestions on what to try are welcome.

generix · February 12, 2021, 11:26pm

Sounds like a thermal defect of the gpu board, did you run the gpuburn test?

t.w.jubb · February 13, 2021, 10:34am

Which burn test is this specifically? I don’t see gpu-burn; and various options via google.

I did monitor the behaviour of my minimum working example with nvidia-smi. Couple of points :

The performance state perf is at P2 at all times.
As I train the network, the temperature and fan speed increase from 37C (<50%) to 84C (74%). The usage is at around 90% consistently and the memory usage is 8.8 Gb out of 11 Gb total.
The NaN (or sometimes inf) values appeared at 75C and 63% fan speed.
I then immediately restarted training and got the NaN values within the first 30 seconds each time without the temp ever going above 75C (…interesting).
but Then after the fifth restart, everything went smoothly again, going all the way to 84C without issue.

I’m beginning to think it’s just haunted. I will run a burn test once I know which one. Thanks for all the continued support on this.

Edit : I had a go with https://github.com/wilicc/gpu-burn

Then

make COMPUTE=7.5
./gpu-burn 3600

gives

./gpu_burn 3600
GPU 0: GeForce RTX 2080 Ti (UUID: GPU-101efb24-d631-6e6d-b502-a14124d81dfc)
Initialized device 0 with 11019 MB of memory (10798 MB available, using 9718 MB of it), using FLOATS
3.3%  proc'd: 81675 (11740 Gflop/s)   errors: 13505  (WARNING!)  temps: 75 C

and the errors only began stacking up at around 75C; below that the error count was zero.

generix · February 15, 2021, 3:26pm

So it’s broken.

t.w.jubb · February 15, 2021, 3:36pm

It is broken.

Checking the temperature at which problems occurred was the suggestion that led to finding this out so thanks. Just within warranty luckily.

Topic		Replies	Views
Ubuntu 20.04, RTX 3090, nvidia-Tensorflow; NaN values consistently appearing during training of rnn networks Linux cuda	2	1572	March 11, 2022
3090 RTX NaN Values Linux	3	728	March 11, 2022
RTX 2070 "InternalError: GPU sync failed" and nan loss Linux cuda , tensorflow	0	418	October 27, 2020
Ubuntu 18.04 with 2 RTX 2080 Ti screen frozen when training deep learning models Linux	15	1524	October 4, 2019
Hard shutdown problem on ubuntu16.04 with 2080Ti Frameworks (archived) tensorflow	3	636	September 23, 2019
Unable to determine the device handle for GPU Drivers - Linux, Windows, MacOS	2	527	October 19, 2022
Model returns only NaN values on GTXA5000 but not on 1080TI TensorRT cuda , tensorflow , ubuntu , driver	2	618	February 9, 2022
Huge loss on RTX 2080 Ti issue GPU - Hardware	2	992	April 9, 2020
RTX 2080 cards crashed when training longer a PyTorch model Linux	4	1206	November 6, 2019
Did TensorFlow caused GPU memory crash? CUDA Setup and Installation	5	5096	April 26, 2017

Ubuntu 18.04, RTX 2080 Ti, Tensorflow; NaN values consistently appearing during training of all networks

Related topics