I attempted to train a model using the Noise2Void plugin in ImageJ (a bioimage analysis software) with GPU support on an RTX 4000 Ada GPU. Initially, the training started correctly and proceeded for a few epochs. However, the process encountered numerical instability, displaying NaN
(Not a Number) values for validation loss, mean squared error (MSE), and absolute difference after 1-5 epochs, with the problem persisting across all subsequent epochs. The training and validation losses showed high variablity and non-convergence.
I have trained a model using the same plugin, and same data, but using CPU only. So I am wondering if this kind of behaviour is due to GPU and CUDA, cuDNN, or tensorflow version incompatibility. The plugin requires Tensorflow 1.15, CUDA 10.0, and cuDNN >= 7.4.1. I have all those installed and path variables set and it should be working as I can see GPU resources being used during training. The results just seem like nonsense.
I posted on bioimage analysis forum about the problem as well, there are a few pictures about the training progress: Noise2Void model training with GPU RTX 4000 Ada issue - Usage & Issues - Image.sc Forum
Any help would be appreciated, I am still a novice in machine learning.