3090 RTX NaN Values

Hello everyone,

I saw a weird behavior in my 3090 GPU. I ran different AI models with different datasets and the output is almost always NaN. Exactly the same models with the same datasets in other computer with the same GPU (3090) and same configuration (cuda version, nvidia drivers, python version…etc) work perfectly.

What I tried so far:

  • Install different nvidia/cuda/python versions
  • Reinstall Ubuntu
  • Exchange the GPUS of the computers. The non-working GPU was still not working on another computer, and the working GPU was still working on the computer with the problem.

My guess after this tests: The problem is the hardware, could be that the GPU is broken.

I am going to try some stress test, do you know an official tool to verify my hypothesis? Or… Do you have other hypothesis?

Thank you in advance!

You could check your logs for XiD errors (maybe run nvidia-bug-report.sh as root, to collect them).
And there also is the CUDA memtest tool.

https://docs.nvidia.com/deploy/xid-errors/index.html#topic_5_2