RTX A2000 becomes unavailable less than 5 minutes after the installation both in Linux and Windows (Unable to determine the device handle for ...))

JohnDo0ne · July 3, 2022, 4:09pm

Hi,
I received an MSI WF7611UJ laptop from my workplace a few days back and struggling to use CUDA on it. It has an RTX A2000 NVIDIA graphic card with 4GB dedicated memory.

On Windows side, after each clean installation of CUDA and restarting the system, I am able to run a test program for a minute or so and then the GPU becomes unresponsive or better to say ineffective. Using nvidia-smi -L
I receive this famous response:
“Unable to determine the device handle for GPU 0000:01:00.0: GPU is lost. Reboot the system to recover this GPU”
On the next restart it seems that the GPU is completely gone. My own test code using ( throw thrust::system_error and code and thrust::cuda_category() returns:

cudaErrorNoDevice: no CUDA-capable device is detected

and nvidia-smi returns:

Unable to determine the device handle for gpu 0000:01:00.0: Unknown Error

I tried CUDA 11.4 to 11.7 and all the available driver versions I can put my hand on from 471.41 up to latest 496.49 with the same results. I managed to capture the CUDA-Z specs in that 1~2 minutes that the CARD is still working:

I tried the same approach with less effort on Ubuntu 20.04 LTS. Same thing happens, the driver/device works for less than 5 minutes and it dies for good. Here is the nvidia-bug-report:

nvidia-bug-report.log.gz (252.3 KB)

That reports losing the BUS (XID 79) and I suspect it could be an overheating or motherboard problem based on this discussion:

I tried to limit the GPU clock but I am not sure whether that is supported on this graphic card and also on a couple of occasions the card is gone before I can issue any persistent nvidia clock limit. Is there anything can be done about this?

generix · July 4, 2022, 3:58pm

Please try reseating the card in its slot, check if it works in another system.

JohnDo0ne · July 20, 2022, 2:30pm

Thank you. It is on a laptop so I cannot touch it. It turned out that it was a defective card or mainboard. I sent it for warranty. I recommend anyone who faces a similar problem to first use " nvidia-bug-report.sh and look for XID79 in the report before wasting a lot of time like what I did.

Topic		Replies	Views
Unable to determine the device handle for GPU Linux	14	10048	September 14, 2022
Unable to determine the device handle for GPU 0000:02:00.0: GPU is lost. Reboot the system to recover this GPU Linux	3	4227	April 6, 2020
Unable to determine the device handle for GPU: GPU is lost Other Tools cuda	0	594	June 2, 2020
Unable to determine the device handle for GPU0000:C1:00.0: Unknown Error Linux cuda , kernel , ubuntu	4	502	February 23, 2024
Unable to determine the device handle for GPU 0000:21:00.0: GPU is lost. Reboot the system to recover this GPU Linux	4	928	January 18, 2022
Nvidia-smi shows ‘no devices were found’ after RTX 2080 Ti crashed during cuda job Linux	4	1204	March 19, 2020
GPU is lost ramdomly and nvidia-smi asks for a reboot to recover it Linux	3	2893	October 1, 2021
Unable to determine the device handle for GPU 0000:05:00.0: GPU is lost. Reboot the system to recover this GPU Linux cuda , ubuntu	2	1034	October 28, 2022
Unable to determine the device handle for gpu Linux	1	722	February 28, 2022
Unable to determine the device handle for GPU 0000:02:00.0: GPU is lost. Reboot the system to recover this GPU Linux	9	7587	October 12, 2021

RTX A2000 becomes unavailable less than 5 minutes after the installation both in Linux and Windows (Unable to determine the device handle for ...))

Related topics