Nvidia-smi is giving GPU is lost on ubuntu-16.04 with AMD Ryzen 7 2700X CPU, X470 Motherboard and two 1080 Ti GPUs

abir.das · July 20, 2020, 12:15pm

Hi I am getting “Unable to determine the device handle for GPU 0000:0B:00.0: GPU is lost. Reboot the system to recover this GPU” when I am doing nvidia-smi. Its a ubuntu 16.04 system with two GeForce GTX 1080 Ti, AMD Ryzen 7 2700X octacore CPU on X470 Motherboard.
I ran “nvidia-bug-report.sh” and am attaching the generated “nvidia-bug-report.log”. I am having a Xid 79. Searching the forum, this might mean overheating or power problem. I ran ‘inxi -b’ and see that the two cards are getting detected as follows.
–=====
Graphics: Card-1: NVIDIA GP102 [GeForce GTX 1080 Ti]
Card-2: NVIDIA GP102 [GeForce GTX 1080 Ti]
Display Server: N/A driver: nvidia tty size: 158x47 Advanced Data: N/A out of X
–=====
If I do ‘nvidia-smi -i 0’ the first card information is shown. I am giving that output too.
–=====
±----------------------------------------------------------------------------+
| NVIDIA-SMI 415.27 Driver Version: 415.27 CUDA Version: 10.0 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 108… Off | 00000000:0A:00.0 On | N/A |
| 23% 37C P8 9W / 250W | 36MiB / 11175MiB | 0% Default |
±------------------------------±---------------------±---------------------
----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 1739 G /usr/lib/xorg/Xorg 33MiB |
±----------------------------------------------------------------------------+
-=====
I just downgraded the driver to check if it resolves the issue.

I will have access to the server room where it stays (its not a headless server though) tomorrow as now the office time is over. However, is there a way to check if the card is dead? I mean even by physical examination, is it possible to check if the card is dead/damaged due to overheating? I was running some pytorch code and the temperature does not go generally beyond 87 C. It mainly remains in 84-85 C range. Is there a way to check what was the last recorded temperature?
Any pointer or help will be good to have.

Many thanks,
Abirnvidia-bug-report.log (3.2 MB)

abir.das · July 22, 2020, 4:40am

A simple reinstall of the cards in the slots mitigated the issue.

Topic		Replies	Views
GPU is lost ramdomly and nvidia-smi asks for a reboot to recover it Linux	3	2908	October 1, 2021
GPU is lost. Reboot the system to recover this GPU CUDA Setup and Installation	1	4038	October 1, 2019
Unable to determine the device handle for GPU Linux	14	10088	September 14, 2022
Unable to determine the device handle for GPU 0000:02:00.0: GPU is lost. Reboot the system to recover this GPU Linux	3	4240	April 6, 2020
one gpu card can not be founded by nvidia-smi CUDA Programming and Performance	1	1661	January 7, 2018
Unable to determine the device handle for GPU 0000:05:00.0: GPU is lost. Reboot the system to recover this GPU Linux cuda , ubuntu	2	1048	October 28, 2022
Unable to determine the device handle for GPU 0000:02:00.0: GPU is lost. Reboot the system to recover this GPU Linux	9	7698	October 12, 2021
Keep losing RTX 2080 GPU. Linux	3	528	October 1, 2019
Second GPU is lost (nvidia-smi) after seconds/minutes in Ubuntu Linux	3	955	October 12, 2021
GPU is lost on Ubuntu 14.04.4 LTS with driver 384.130 Linux	0	733	August 9, 2018

Nvidia-smi is giving GPU is lost on ubuntu-16.04 with AMD Ryzen 7 2700X CPU, X470 Motherboard and two 1080 Ti GPUs

Related topics