I have recently set up a new machine at home for deep learning;
- Ubuntu 20.04
- RTX 3090 graphics card
After installing the drivers and CUDA/CuDNN I have found that there is a consistent issue when training many (completely different) models; the loss becomes NaN after some number of epochs (which varies randomly). I created a “minimal working example” of a model with only 2 layers which causes the problem. I have attempted to fix this in the tensorflow/keras code in at least 50 different ways having scoured google for a day or two. Up until the NaN values appear the training is going smoothly and the loss/metrics etc all look sensible and stable.
I am convinced that the source of the issue is not the tensorflow code; but perhaps related to the GPU drivers and Ubuntu (I have run identical code on a windows machine with a different GPU, but same tensorflow, CUDA etc versions and no issues).
I am finding it incredibly hard to track down the source of this and keep coming back to the hardware (as this is the part I understand least).
A few things I’ve tried:
- Different tensorflow versions (version 1 and 2)
- Different driver versions through
nvidia-smi
(455 and 465) with reboots between. - Different OS (as I’ve said Windows has had no problem)
- Different RNN(LSTM, GRU) models (including ones from official tensorflow tutorials)