OS: Debian GNU/Linux bookworm/sid x86_64
CPU: 11th Gen Intel i9-11900K (16) @ 5.100GHz
GPU: NVIDIA GeForce RTX 3080 Ti
Nvidia driver: 510.73.08
CUDA version: 11.6
After few minutes of CNN training with torch the program hangs with no error codewise. Executing “nvidia-smi” prompts the following error: “Unable to determine the device handle for GPU 0000:01:00.0”
The training has been carried out with several architectures and configurations. Eventually all of them ended up halting.
The only workaround to make the gpu work again has been rebooting the machine after the error.
I attach the nvidia log:
nvidia-bug-report.log.gz (395.5 KB)