Unable to determine the device handle for GPU0000:01:00.0: Unknown Error

syxuming · September 4, 2023, 7:00pm

OS: Ubuntu22.04
Driver Version: NVIDIA driver metapackage from nvidia-driver-535
GPUs: 2*3090

When I train the llama2-13b model (using dual gpu), the gpu on top of my pcie1 seems to come off.

When I try to check it with nvidia-smi, I get
“Unable to determine the device handle for GPU0000:01:00.0: Unknown Error”

I’ve run and failed many times since this afternoon. At first the model seemed to train for over an hour before reporting failure. As of this evening, the model training runs fail within a few minutes and are accompanied by a fan spinning very loudly as it fails.

Some of my own experiments:

reboot, didn’t work.
reinstalled graphics drivers, no help
at night i tested each card individually.
3.1 training with a single graphics card on pcie1, quickly reporting an error (1 second before the error, the gpu core temperature is only 40-50 C)
3.2 Using a single card on pcie3, I can train the full model normally.

Here is my nvidia-bug-report.log
nvidia-bug-report.log.gz (643.1 KB)

Could someone give me a hand? Thank you very much!

benrosmine · May 25, 2024, 11:24pm

Any luck solving this? I’m having the same issue RTX 6000 ADA: Unable to determine the device handle for GPU0000:42:00.0: Unknown Error

Topic		Replies	Views
Unable to determine the device handle for GPU0000:01:00.0: Unknown Error Drivers - Linux, Windows, MacOS	1	383	September 14, 2024
Unable to determine the device handle for GPU 0000:19:00.0: Unknown Error Linux	1	1268	September 30, 2022
Unable to determine the device handle for GPU 0000:68:00.0: Unknown Error Linux	5	13838	July 27, 2021
Unable to determine the device handle for GPU0000:01:00.0: Unknown Error Linux ubuntu , driver , linux-driver	1	808	September 3, 2023
Unable to determine the device handle for GPU xxxxxxxx: Unknown Error Linux	1	323	April 16, 2024
Unable to determine the device handle for GPU 0000:01:00.0: Unknown Error Linux nvidia-smi	2	5441	November 9, 2022
Unable to determine the GPU device handle Linux	1	290	July 26, 2024
Unable to determine the device handle for GPU0000:3E:00.0: Unknown Error Linux	1	144	October 7, 2024
Unable to determine the device handle for GPU 0000:02:00.0: Unknown Error Linux ubuntu , nvidia-smi	7	4318	March 12, 2024
Unable to determine the device handle for GPU0000:86:00.0: Unknown Error Linux cuda , nvbugs	1	572	October 20, 2023

Unable to determine the device handle for GPU0000:01:00.0: Unknown Error

Related topics