Hi, I am training a deep learning model using a RTX 4070 (12 GB). However, after several epochs, around 150, it crashes, throwing the following error:
“RuntimeError: min(): Expected reduction dim to be specified for input.numel() == 0. Specify the reduction dim with the ‘dim’ argument.”
After the crash, the PC does not recognize the GPU and shows the following error:
“NVIDIA-SMI has failed because it couldn’t communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.”
I have uninstalled and installed the drivers again, but the same issue persists. If I wait for some hours and unplug the GPU and plug it again, then the PC recognizes it again. My machine is running Ubuntu 22.04 LTS. In such a scenario, is there any way to fix the issue? Thanks in advance.