I am writing to report a correctness issue I encountered while using the RTX 4090 GPU with Tensorflow.
I have a Tensorflow model which gives very close results when running on my RTX 3090 GPU and CPU. However, when I run it on RTX 4090 with Nvidia Tensorflow (version: nv22.11), it gives significantly different results when compared to running on the CPU.
OS: Ubuntu 20.04.5 LTS
CPU: Intel(R) Core™ i9-13900K
GPU: NVIDIA GeForce RTX 4090
Docker image: nvidia tensorflow, tag: 22.11-tf1-py3
My model is a regression U-net model. The change of results when switched to RTX4090 and this version of Tensorflow affects the performance. The range of the output of the model if about -3 to 3 and the difference in results running on different devices can be up to 0.08. When running on RTX3090 with older version of CUDA (11.1) and cuDNN (8.0), the difference is < 0.01.
Difference of results between RTX 4090 and CPU:
Maximum difference of output channel 2: 0.042948246002197266 Values of different runs at pixel (160, 219): 2.384153366088867, 2.4271016120910645 Maximum difference of output channel 3: 0.0736396312713623 Values of different runs at pixel (164, 175): 1.8166956901550293, 1.8903353214263916 Maximum difference of output channel 4: 0.05873298645019531 Values of different runs at pixel (137, 225): 2.382530927658081, 2.4412639141082764 Maximum difference of output channel 5: 0.07898902893066406 Values of different runs at pixel (170, 234): 2.7947018146514893, 2.715712785720825
sample_program.zip (1.9 MB)
I attached here a sample program to repeat the problem. The script run.sh runs inference on a GPU and the CPU and compare the results.
Since I will soon upgrade the GPU cards to newer models with new libraries. I am afraid that the performance will degrade. I would like to know that how I can get consistent result with the new GPU card. I look forward to hearing back from you soon.