GPU keeps shutting down when training with cuda and NEMO

Hello there,

I’m experiencing an issue with a shared workstation with ubuntu server 22.04 in my lab these days (double rtx 309024 GB). In particular, when a collaborator starts a fine-tuning process of Nemo’s fast-conformer, the training starts regularly arriving at half of the epochs or more and suddenly stops without providing any error.

Running the nvidia-smi command results in the following: “Unable to determine the device handle for GPU0000:02:00.0: Unknown Error”.

The only way to recover is to try rebooting the system, which also stuck on startup, so I usually avoid to do it since I would need to be physically in the lab to solve the problem.

Here is the output of the nvidia-bug-report.sh script
nvidia-bug-report.log (387.2 KB)

I appreciate your help.