I received an MSI WF7611UJ laptop from my workplace a few days back and struggling to use CUDA on it. It has an RTX A2000 NVIDIA graphic card with 4GB dedicated memory.
On Windows side, after each clean installation of CUDA and restarting the system, I am able to run a test program for a minute or so and then the GPU becomes unresponsive or better to say ineffective. Using nvidia-smi -L
I receive this famous response:
“Unable to determine the device handle for GPU 0000:01:00.0: GPU is lost. Reboot the system to recover this GPU”
On the next restart it seems that the GPU is completely gone. My own test code using ( throw thrust::system_error and code and thrust::cuda_category() returns:
cudaErrorNoDevice: no CUDA-capable device is detected
and nvidia-smi returns:
Unable to determine the device handle for gpu 0000:01:00.0: Unknown Error
I tried CUDA 11.4 to 11.7 and all the available driver versions I can put my hand on from 471.41 up to latest 496.49 with the same results. I managed to capture the CUDA-Z specs in that 1~2 minutes that the CARD is still working:
I tried the same approach with less effort on Ubuntu 20.04 LTS. Same thing happens, the driver/device works for less than 5 minutes and it dies for good. Here is the nvidia-bug-report:
nvidia-bug-report.log.gz (252.3 KB)
That reports losing the BUS (XID 79) and I suspect it could be an overheating or motherboard problem based on this discussion:
I tried to limit the GPU clock but I am not sure whether that is supported on this graphic card and also on a couple of occasions the card is gone before I can issue any persistent nvidia clock limit. Is there anything can be done about this?