Hello everyone,
First of all, I apologize for double posting. I removed the previous one immediately.
During a CUDA job, I got the following message: ‘Unable to determine the device handle for GPU 0000:01:00.0: GPU is lost. Reboot the system to recover this GPU’ and I rebooted the system as suggested. Unfortunately, after the reboot, nvidia-smi started to show ‘no devices were found’.
I see that it is a very common problem but some of the proposed solutions or diagnoses (like ‘it is a hardware problem’) were stated without the explicit reason so I figured I should repeat the question.
My OS is Ubuntu 18.04 and the GPU is RTX 2080Ti. I connect to the system via ssh, and I do not have physical access to the machine at the moment.
I tried removing the nvidia drivers and reinstalling them but the problem persists.
I attach the nvidia-bug-report file which I ran after the crash. Also I include the latest output of ‘dmesg’ attached. (For some reason I couldn’t find the ‘paperclip icon’ after I created the topic so I’m including a Google Drive folder with both files: NVIDIA attachments - Google Drive )
Please let me know if I need to provide any other information.
Thanks in advance for your help!
Given the symptoms you described, this really does sound like a hardware problem. It’s possible for software or system-level things to cause the GPU to fall off the bus, but it’s pretty unlikely to be a software problem if the GPU is working fine for a while and then suddenly stops responding in the middle of a heavy task.
You might want to consider increasing fan speeds and possibly improving system ventilation to make sure the GPU is not getting too hot. Another very common cause of instability, especially during heavy load, is an inadequate or failing power supply that is not able to keep up with the demands of the system as a whole.
The thing describing the paperclip icon is leftover from the old forum software. The forum was switched to the new software just this last weekend, so some of the sticky posts are a bit stale.
Thanks aplattner for the fast response.
I had a similar problem a month ago but I didn’t think of getting a bug report or writing here. The same GPU crashed during a cuda job, and we were able to make it work after a BIOS update. In the case that it was the same problem, I assume that it didn’t damage the hardware itself when it happened because I was able to use it for a week now. I guess my BIOS is up-to-date and I wonder if there is a software solution (e.g. updating drivers) that would make the computer recognize the GPU again.
As an extra piece of information, I was monitoring the temperature right before it crashed, and it was around 70C. My impression is that it’s not that high so I’m not sure if it’s overheating related. Also, my power supply is 750W. I know that their efficiency is lower than the advertised numbers but I once asked the technical support of ASUS whether if that might be the problem in the first crash, and they said that 750W should be more than enough.
Do you have any other suggestions on what might be the underlying issues? Are there any points you would disagree on my assumptions about it’s not an overheating or power supply-related issue?
Thanks again for the help!
I agree that it sounds like it’s not overheating, but it can’t hurt to try cranking up the fans (or lowering the power limit with nvidia-smi -pl
) as an experiment.
It’s hard to say about the power supply. They seem really variable in general and have a tendency to run fine for a long time only to flake out when the system is under heavy stress for a long time. If you have a spare power supply you could swap out to see if it improves things, that might be worth trying.
It’s also possible that the GPU needs to be reseated in the socket, or have its contacts cleaned, or that your motherboard or memory is flaky. Running memtest86+ also couldn’t hurt.
I’m sorry I don’t have a better answer for you. It’s almost never the case that symptoms like this can be easily tracked down to a simple software bug, or even reproduced in-house. They’re almost always specific to one particular set of hardware.
I will be trying lowering the power limit as soon as the GPU is working again, thanks for the advice.
Weirdly, after trying to reinstall the drivers to see if that’ll solve the problem, I started to have Failed to initialize NVML: Driver/library version mismatch
output from nvidia-smi. This is the output of dmesg
:
NVRM: API mismatch: the client has the version 440.59, but
NVRM: this kernel module has the version 440.64. Please
NVRM: make sure that this kernel module and all NVIDIA driver
NVRM: components have the same version.
To give the complete information: I had the driver version 440.59 but then I installed 440.64 with a .run
file. When my problems started, I removed the 440.64 version with running the .run
file with --uninstall
option. And installed the latest driver by ppa
and ubuntu-drivers autoinstall
method. Now I get that message I put above.
I see that other people had similar problems when installing/using cuda libraries but at the moment I just have the nvidia driver (I’ll reinstall cuda later on if it works). Do you have any suggestions what I could do? Or is there any information I should provide?
Thanks again for the help!!