Ubuntu 18.04, RTX 2080 Ti, Tensorflow; NaN values consistently appearing during training of all networks

I have recently set up a new machine at home for deep learning;

  • Ubuntu 18.04
  • RTX 2080 Ti graphics card

After installing the drivers and CUDA/CuDNN I have found that there is a consistent issue when training many (completely different) models; the loss becomes NaN after some number of epochs (which varies randomly). I created a “minimal working example” of a model with only 2 layers which causes the problem. I have attempted to fix this in the tensorflow/keras code in at least 50 different ways having scoured google for a day or two. Up until the NaN values appear the training is going smoothly and the loss/metrics etc all look sensible and stable.

I am convinced that the source of the issue is not the tensorflow code; but perhaps related to the GPU drivers and Ubuntu (I have run identical code on a windows machine with a different GPU, but same tensorflow, CUDA etc versions and no issues).

I am finding it incredibly hard to track down the source of this and keep coming back to the hardware (as this is the part I understand least).

A few things I’ve tried:

  • Different tensorflow versions (version 1 and 2)
  • Different driver versions through nvidia-smi (435 and 440) with reboots between.
  • Different OS (as I’ve said Windows has had no problem)
  • Different CNN models (including ones from official tensorflow tutorials)

Please run nvidia-bug-report.sh as root after you hit the issue and attach the resulting nvidia-bug-report.log.gz file to your post. You will have to rename the file ending to something else since the forum software doesn’t accept .gz files (nifty!).

Thanks for the reply; ah I didn’t know about this script… attached below is the log after a KeyboardInterrupt following the occurrence of a NaN value in the first epoch (after 40 or so batches of data).

It may be obvious that there are certain aspects of working with the GPU I am not familiar with; any links to articles/reading material is more than welcome!

nvidia-bug-report.log.txt (264.5 KB)

Software setup seems fine.
There were no errors logged from this run but you were previously getting XID 31 errors which might point to a problem with the gpu memory. Since 2080Tis are often sensitive to heat, please monitor temperatures while training until running into the issue. Also, please run gpu-burn for 10 minutes to thoroughly test the gpu for hw failures.

Just for a bit of finality; the source of the problems seemed to be the batch size set on the training of the network. If I lowered it enough the problem went away. The errors just didn’t suggest that the training iteration memory consumption had anything to do with it; and I never tracked down the true source of the error beyond just batch size.

You likely ran into this:

Funnily enough I have the same nan errors appearing again; this time with an entirely different framework (pytorch). I can’t rule out that this is related to the framework rather than hardware; but the behaviour is suspiciously similar.

To test your idea; I have switched to internal graphics for the display (editing /etc/X11/xorg.conf). nvidia-smi no longer shows any X11 processes on my GPU so I assume it worked. I still get the nan issues. I’m on driver 450.102.04 and CUDA 11.0 now. The only process on the GPU is python while the code runs; but still nans after a few training epochs.

Any other suggestions on what to try are welcome.

Sounds like a thermal defect of the gpu board, did you run the gpuburn test?

Which burn test is this specifically? I don’t see gpu-burn; and various options via google.

I did monitor the behaviour of my minimum working example with nvidia-smi. Couple of points :

  • The performance state perf is at P2 at all times.
  • As I train the network, the temperature and fan speed increase from 37C (<50%) to 84C (74%). The usage is at around 90% consistently and the memory usage is 8.8 Gb out of 11 Gb total.
  • The NaN (or sometimes inf) values appeared at 75C and 63% fan speed.
  • I then immediately restarted training and got the NaN values within the first 30 seconds each time without the temp ever going above 75C (…interesting).
  • but Then after the fifth restart, everything went smoothly again, going all the way to 84C without issue.

I’m beginning to think it’s just haunted. I will run a burn test once I know which one. Thanks for all the continued support on this.

Edit : I had a go with https://github.com/wilicc/gpu-burn


make COMPUTE=7.5
./gpu-burn 3600


./gpu_burn 3600
GPU 0: GeForce RTX 2080 Ti (UUID: GPU-101efb24-d631-6e6d-b502-a14124d81dfc)
Initialized device 0 with 11019 MB of memory (10798 MB available, using 9718 MB of it), using FLOATS
3.3%  proc'd: 81675 (11740 Gflop/s)   errors: 13505  (WARNING!)  temps: 75 C 

and the errors only began stacking up at around 75C; below that the error count was zero.

So it’s broken.

It is broken.

Checking the temperature at which problems occurred was the suggestion that led to finding this out so thanks. Just within warranty luckily.