RTX A2000 becomes unavailable less than 5 minutes after the installation both in Linux and Windows (Unable to determine the device handle for ...))

I received an MSI WF7611UJ laptop from my workplace a few days back and struggling to use CUDA on it. It has an RTX A2000 NVIDIA graphic card with 4GB dedicated memory.

On Windows side, after each clean installation of CUDA and restarting the system, I am able to run a test program for a minute or so and then the GPU becomes unresponsive or better to say ineffective. Using nvidia-smi -L
I receive this famous response:
“Unable to determine the device handle for GPU 0000:01:00.0: GPU is lost. Reboot the system to recover this GPU”
On the next restart it seems that the GPU is completely gone. My own test code using ( throw thrust::system_error and code and thrust::cuda_category() returns:

cudaErrorNoDevice: no CUDA-capable device is detected

and nvidia-smi returns:

Unable to determine the device handle for gpu 0000:01:00.0: Unknown Error

I tried CUDA 11.4 to 11.7 and all the available driver versions I can put my hand on from 471.41 up to latest 496.49 with the same results. I managed to capture the CUDA-Z specs in that 1~2 minutes that the CARD is still working:

I tried the same approach with less effort on Ubuntu 20.04 LTS. Same thing happens, the driver/device works for less than 5 minutes and it dies for good. Here is the nvidia-bug-report:

nvidia-bug-report.log.gz (252.3 KB)

That reports losing the BUS (XID 79) and I suspect it could be an overheating or motherboard problem based on this discussion:

I tried to limit the GPU clock but I am not sure whether that is supported on this graphic card and also on a couple of occasions the card is gone before I can issue any persistent nvidia clock limit. Is there anything can be done about this?

Please try reseating the card in its slot, check if it works in another system.

1 Like

Thank you. It is on a laptop so I cannot touch it. It turned out that it was a defective card or mainboard. I sent it for warranty. I recommend anyone who faces a similar problem to first use " nvidia-bug-report.sh and look for XID79 in the report before wasting a lot of time like what I did.