GPU doesn't respond after training with PyTorch

Current system:
OS: Debian GNU/Linux bookworm/sid x86_64
Kernel: 5.18.0-2-amd64
CPU: 11th Gen Intel i9-11900K (16) @ 5.100GHz
GPU: NVIDIA GeForce RTX 3080 Ti
Desktop env.: plasmashell 5.27.5

Nvidia driver: 535.104.05
CUDA version: 12.2

lib versions
pytorch: 2.0.0
pytorch-lightning: 2.0.3

After few minutes of CNN training with torch the program hangs with no error codewise. Executing “nvidia-smi” prompts the following error: “Unable to determine the device handle for GPU 0000:01:00.0”.
In addition, the desktop env doesn’t respond when it halts. However, most of the time, the system remains accessible through ssh.

The training has been carried out with several architectures and configurations. Eventually all of them ended up halting.

The only workaround to make the gpu work again has been rebooting the machine after the error.

I attach the nvidia log:
nvidia-bug-report.log.gz (395.5 KB)

I have tried using cuda’s debugging tools such as compute-sanitizer along with memcheck, but it didn’t prompt any error whatsoever.

You’re getting a XID 79, fallen off the bus. Most common reasons are overheating or lack of power. Monitor temperatures, check/replace PSU.

1 Like

After replacing the PSU for a similar one the error did not happen again. I’m going to intensively test the GPU with a brand new PSU. I’ll post the results when finished.

It’s advisable to take a different brand/model.