nvidia-bug-report.log.gz (1.2 MB)
OS: Ubuntu Server 22.04.5
Driver Version: 550.127.05
GPUs: 3 NVIDIA A100 80GB PCIe
CUDA Version according to NVIDIA-SMI: 12.4
After starting machine learning model training with YOLO, with the 3 GPUs simultaneously, GPU 0 is not used and GPUs 1 and 2 freeze at 100% in Volatile GPU-Util. in the initial data validation process and eventually the process does not progress to training, eventually the server interrupts and when submitting the ‘nvidia-smi’ command it gives the error:
unable to determine the device handle for gpu0000:34:00.0: unknown error
Eventually, after this situation, the server automatically reboots and everything returns to normal.
However this does not happen when training a custom CNN.
Can someone help ?