I frequently encounter GPU disconnection during training

As shown in the figure, I frequently encounter GPU disconnection during training. What could be the reason for this? My setup includes: two RTX 3090 GPUs, Ubuntu 18.04, driver 515.65.01, and CUDA 11.7. Do I need to provide additional information to help diagnose this issue?

Disconnections during heavy usage are most commonly caused by insufficient power delivery from the PSU: probably worth checking.
Also have a look at this post.

It’s always a good idea to provide nvidia-bug-report.log.gz, but given that the Nvidia Linux driver team apparently has 3 engineers and the looooong list of outstanding critical bugs, the chances that someone will actually look into this are not very high…

Thank you very much for your reply.

I referred to many similar cases on the forum and the NVIDIA official documentation (1. Introduction — XID Errors r555 documentation). Finally, I identified the issue as being caused by excessive GPU temperature. After replacing the GPUs with two units that have better cooling performance, the problem disappeared.

1 Like