GPU:2 * 3090
Hi! I have a machine learning workflow separately on two 3090s. And the two code configurations are completely the same. I have run several times and haven’t met any problem until last week. The GPU0 has fallen off the bus after 80 hours working , while the GPU1 still works fine. And I can not use ‘nvidia-smi’ to check the GPU state. After rebooting the problem seems fixed. But it appears again on the same GPU when I tried another training.
Cause the GPU1 can work fine under the same configuration while the GPU0 can not, it appears not a problem about the psu.
So I print the temperature before the fallen occured. The temperature is around 75 and far away from the Shutdown Temp. It seems that it’s not about overheating.
I would like your suggestion in:
- how can I find the problem and get rid of it.
Here is the log file.
nvidia-bug-report.log.gz (367.6 KB)