Ubuntu 18.04, RTX 2080 Ti, Tensorflow; NaN values consistently appearing during training of all networks

I have recently set up a new machine at home for deep learning;

  • Ubuntu 18.04
  • RTX 2080 Ti graphics card

After installing the drivers and CUDA/CuDNN I have found that there is a consistent issue when training many (completely different) models; the loss becomes NaN after some number of epochs (which varies randomly). I created a “minimal working example” of a model with only 2 layers which causes the problem. I have attempted to fix this in the tensorflow/keras code in at least 50 different ways having scoured google for a day or two. Up until the NaN values appear the training is going smoothly and the loss/metrics etc all look sensible and stable.

I am convinced that the source of the issue is not the tensorflow code; but perhaps related to the GPU drivers and Ubuntu (I have run identical code on a windows machine with a different GPU, but same tensorflow, CUDA etc versions and no issues).

I am finding it incredibly hard to track down the source of this and keep coming back to the hardware (as this is the part I understand least).

A few things I’ve tried:

  • Different tensorflow versions (version 1 and 2)
  • Different driver versions through nvidia-smi (435 and 440) with reboots between.
  • Different OS (as I’ve said Windows has had no problem)
  • Different CNN models (including ones from official tensorflow tutorials)

Please run nvidia-bug-report.sh as root after you hit the issue and attach the resulting nvidia-bug-report.log.gz file to your post. You will have to rename the file ending to something else since the forum software doesn’t accept .gz files (nifty!).

Thanks for the reply; ah I didn’t know about this script… attached below is the log after a KeyboardInterrupt following the occurrence of a NaN value in the first epoch (after 40 or so batches of data).

It may be obvious that there are certain aspects of working with the GPU I am not familiar with; any links to articles/reading material is more than welcome!

nvidia-bug-report.log.txt (264.5 KB)

Software setup seems fine.
There were no errors logged from this run but you were previously getting XID 31 errors which might point to a problem with the gpu memory. Since 2080Tis are often sensitive to heat, please monitor temperatures while training until running into the issue. Also, please run gpu-burn for 10 minutes to thoroughly test the gpu for hw failures.

Just for a bit of finality; the source of the problems seemed to be the batch size set on the training of the network. If I lowered it enough the problem went away. The errors just didn’t suggest that the training iteration memory consumption had anything to do with it; and I never tracked down the true source of the error beyond just batch size.

You likely ran into this:
https://nvidia.custhelp.com/app/answers/detail/a_id/3029/~/using-cuda-and-x

Funnily enough I have the same nan errors appearing again; this time with an entirely different framework (pytorch). I can’t rule out that this is related to the framework rather than hardware; but the behaviour is suspiciously similar.

To test your idea; I have switched to internal graphics for the display (editing /etc/X11/xorg.conf). nvidia-smi no longer shows any X11 processes on my GPU so I assume it worked. I still get the nan issues. I’m on driver 450.102.04 and CUDA 11.0 now. The only process on the GPU is python while the code runs; but still nans after a few training epochs.

Any other suggestions on what to try are welcome.

Sounds like a thermal defect of the gpu board, did you run the gpuburn test?

1 Like

Which burn test is this specifically? I don’t see gpu-burn; and various options via google.

I did monitor the behaviour of my minimum working example with nvidia-smi. Couple of points :

  • The performance state perf is at P2 at all times.
  • As I train the network, the temperature and fan speed increase from 37C (<50%) to 84C (74%). The usage is at around 90% consistently and the memory usage is 8.8 Gb out of 11 Gb total.
  • The NaN (or sometimes inf) values appeared at 75C and 63% fan speed.
  • I then immediately restarted training and got the NaN values within the first 30 seconds each time without the temp ever going above 75C (…interesting).
  • but Then after the fifth restart, everything went smoothly again, going all the way to 84C without issue.

I’m beginning to think it’s just haunted. I will run a burn test once I know which one. Thanks for all the continued support on this.

Edit : I had a go with https://github.com/wilicc/gpu-burn

Then

make COMPUTE=7.5
./gpu-burn 3600

gives

./gpu_burn 3600
GPU 0: GeForce RTX 2080 Ti (UUID: GPU-101efb24-d631-6e6d-b502-a14124d81dfc)
Initialized device 0 with 11019 MB of memory (10798 MB available, using 9718 MB of it), using FLOATS
3.3%  proc'd: 81675 (11740 Gflop/s)   errors: 13505  (WARNING!)  temps: 75 C 

and the errors only began stacking up at around 75C; below that the error count was zero.

So it’s broken.

1 Like

It is broken.

Checking the temperature at which problems occurred was the suggestion that led to finding this out so thanks. Just within warranty luckily.