Ubuntu 20.04, RTX 3090, nvidia-Tensorflow; NaN values consistently appearing during training of rnn networks

I have recently set up a new machine at home for deep learning;

  • Ubuntu 20.04
  • RTX 3090 graphics card

After installing the drivers and CUDA/CuDNN I have found that there is a consistent issue when training many (completely different) models; the loss becomes NaN after some number of epochs (which varies randomly). I created a “minimal working example” of a model with only 2 layers which causes the problem. I have attempted to fix this in the tensorflow/keras code in at least 50 different ways having scoured google for a day or two. Up until the NaN values appear the training is going smoothly and the loss/metrics etc all look sensible and stable.

I am convinced that the source of the issue is not the tensorflow code; but perhaps related to the GPU drivers and Ubuntu (I have run identical code on a windows machine with a different GPU, but same tensorflow, CUDA etc versions and no issues).

I am finding it incredibly hard to track down the source of this and keep coming back to the hardware (as this is the part I understand least).

A few things I’ve tried:

  • Different tensorflow versions (version 1 and 2)
  • Different driver versions through nvidia-smi (455 and 465) with reboots between.
  • Different OS (as I’ve said Windows has had no problem)
  • Different RNN(LSTM, GRU) models (including ones from official tensorflow tutorials)

hi … just to say I’m finding a similar problem, but with PyTorch, and with Windows. It’s a FE 3090 that seems to cause random NaN when training is otherwise fine. Running the exact same model on other GPUs seems to be fine. I haven’t figured it out. Sometimes it keeps running fine, but sometimes it NaNs after a few epochs. I assumed it might have been Windows power saving on the display adapter, but that has never been a problem before.

Hello, did you find out the cause of the problem?
I’m seeing exactly the same behaviour with a GTX 1660 TI, while everything runs fine on a GTX 1050.
I’m on Windows and I’ve tried installing multiple driver versions (both the studio driver variant and the gaming driver variant), but it didn’t help.
Any help would be appreciated.
Thank you.