GPU keeps shutting down when training with cuda and NEMO

vincenzonorman · July 6, 2024, 1:55pm

Hello there,

I’m experiencing an issue with a shared workstation with ubuntu server 22.04 in my lab these days (double rtx 309024 GB). In particular, when a collaborator starts a fine-tuning process of Nemo’s fast-conformer, the training starts regularly arriving at half of the epochs or more and suddenly stops without providing any error.

Running the nvidia-smi command results in the following: “Unable to determine the device handle for GPU0000:02:00.0: Unknown Error”.

The only way to recover is to try rebooting the system, which also stuck on startup, so I usually avoid to do it since I would need to be physically in the lab to solve the problem.

Here is the output of the nvidia-bug-report.sh script
nvidia-bug-report.log (387.2 KB)

I appreciate your help.

Topic		Replies	Views
Unable to determine the device handle for GPU 0000:01:00.0: Unknown Error after executing nvidia-smi Linux	0	953	August 16, 2023
GPU freezes and stops responding during inference - "Unable to determine the device handle for GPU 0000:03:00.0: Unknown Error" Linux cuda , pytorch	0	459	October 9, 2023
Unable to determine the device handle for GPU 0000:01:00.0: Not Found Linux cuda , ubuntu , nvidia-smi	1	2363	November 17, 2022
Graphic card got stuck/hang randomly while training a neural network, nvidia-smi return error Linux kernel	0	638	May 12, 2023
Unable to determine the device for GPU, unknown error Linux	3	362	March 8, 2022
GPU is lost. Reboot the system to recover this GPU CUDA Setup and Installation	1	4023	October 1, 2019
Ubuntu Server 22.04.5 crashes when training ML models in multi-gpu NVIDIA A100 80GB Linux	0	29	December 4, 2024
Unable to determine the device handle for GPU 0000:01:00.0: Unknown Error Linux kernel , nvidia-smi	7	4237	September 8, 2022
Unable to determine the device handle for GPU CUDA Setup and Installation	0	794	April 30, 2021
GPU lost after sometime of using it CUDA Setup and Installation	0	540	February 11, 2020

GPU keeps shutting down when training with cuda and NEMO

Related topics