3090 RTX NaN Values

Hello everyone,

I saw a weird behavior in my 3090 GPU. I ran different AI models with different datasets and the output is almost always NaN. Exactly the same models with the same datasets in other computer with the same GPU (3090) and same configuration (cuda version, nvidia drivers, python version…etc) work perfectly.

What I tried so far:

  • Install different nvidia/cuda/python versions
  • Reinstall Ubuntu
  • Exchange the GPUS of the computers. The non-working GPU was still not working on another computer, and the working GPU was still working on the computer with the problem.

My guess after this tests: The problem is the hardware, could be that the GPU is broken.

I am going to try some stress test, do you know an official tool to verify my hypothesis? Or… Do you have other hypothesis?

Thank you in advance!

You could check your logs for XiD errors (maybe run nvidia-bug-report.sh as root, to collect them).
And there also is the CUDA memtest tool.

https://docs.nvidia.com/deploy/xid-errors/index.html#topic_5_2

Hello,
I’ve observed exactly the same behaviour with my PNY NVIDIA GTX 1660 TI 6GB, which I bought second hand and for which the seller guaranteed that it was never used in mining, or intensive gaming.
Concerning the issue, I’ve ran extensive stress tests but all seems to be ok. However any machine learning python app I was running was suffering to bugs due to NaN values popping everywhere.
In your case did you manage to trace what is causing this problem?
I tried installing different driver versions (for both the studio drivers and the gaming drivers) but it didn’t help.
If you have found anything please let me know as soon as possible because I’m in the middle of returning the GPU to the seller, but if there’s a workaround I would gladly prefer keeping it.
Thanks a bunch in advance for your help!

Alf