GPU is lost. Reboot the system to recover this GPU

godeffroy.valet · February 24, 2022, 10:44am

Hello,

The 4 V100 are usually used at almost 100% all the time, and it has run without issues for a long time. It runs PyTorch trainings.

But since last week, I repeatedly get the following error after some time when running nvidia-smi :

GPU is lost. Reboot the system to recover this GPU.

The kernel logs the following:

Feb 24 03:51:40 DGX-Station kernel: [147935.988522] pcieport 0000:00:02.0: AER: Uncorrected (Non-Fatal) error received: id=0010 
Feb 24 03:51:40 DGX-Station kernel: [147935.988528] pcieport 0000:00:02.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, i 
d=0010(Requester ID) 
Feb 24 03:51:40 DGX-Station kernel: [147935.988532] NVRM: Xid (PCI:0000:07:00): 79, pid=3042, GPU has fallen off the bus. 
Feb 24 03:51:40 DGX-Station kernel: [147935.988535] NVRM: GPU 0000:07:00.0: GPU has fallen off the bus. 
Feb 24 03:51:40 DGX-Station kernel: [147935.988539] pcieport 0000:00:02.0:   device [8086:6f04] error status/mask=00004000/00000000 
Feb 24 03:51:40 DGX-Station kernel: [147935.988543] pcieport 0000:00:02.0:    [14] Completion Timeout     (First) 
Feb 24 03:51:40 DGX-Station kernel: [147935.988546] NVRM: GPU 0000:07:00.0: GPU is on Board 0330818002654. 
Feb 24 03:51:40 DGX-Station kernel: [147935.988550] pcieport 0000:00:02.0: broadcast error_detected message 
Feb 24 03:51:40 DGX-Station kernel: [147935.988554] pcieport 0000:00:02.0: AER: Device recovery failed 
Feb 24 03:51:40 DGX-Station kernel: [147935.988569] NVRM: A GPU crash dump has been created. If possible, please run 
Feb 24 03:51:40 DGX-Station kernel: [147935.988569] NVRM: nvidia-bug-report.sh as root to collect this data before 
Feb 24 03:51:40 DGX-Station kernel: [147935.988569] NVRM: the NVIDIA kernel module is unloaded.

The output of nvidia-bug-report.sh is here:
nvidia-bug-report.log.gz (618.2 KB)

VBIOS for Tesla V100-DGXS-32GB is 00000000:0F:00.0, 88.00.48.00.05

I have read that common causes are lack of power supply or overheating. We have checked the temperature and water cooling system and it looks normal. We haven’t got any power outage.

godeffroy.valet · February 24, 2022, 2:30pm

I’ve just got it again, and here is the output of nvidia-smi a few minutes before.

Thu Feb 24 15:13:24 2022       
+-----------------------------------------------------------------------------+ 
| NVIDIA-SMI 450.102.04   Driver Version: 450.102.04   CUDA Version: 11.0     | 
|-------------------------------+----------------------+----------------------+ 
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC | 
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. | 
|                               |                      |               MIG M. | 
|===============================+======================+======================| 
|   0  Tesla V100-DGXS...  Off  | 00000000:07:00.0 Off |                    0 | 
| N/A   82C    P0    69W / 300W |   9576MiB / 32505MiB |    100%      Default | 
|                               |                      |                  N/A | 
+-------------------------------+----------------------+----------------------+ 
|   1  Tesla V100-DGXS...  Off  | 00000000:08:00.0 Off |                    0 | 
| N/A   81C    P0    78W / 300W |  22913MiB / 32508MiB |    100%      Default | 
|                               |                      |                  N/A | 
+-------------------------------+----------------------+----------------------+ 
|   2  Tesla V100-DGXS...  Off  | 00000000:0E:00.0 Off |                    0 | 
| N/A   91C    P0    61W / 300W |   8057MiB / 32508MiB |    100%      Default | 
|                               |                      |                  N/A | 
+-------------------------------+----------------------+----------------------+ 
|   3  Tesla V100-DGXS...  Off  | 00000000:0F:00.0 Off |                    0 | 
| N/A   83C    P0    92W / 300W |  27659MiB / 32508MiB |    100%      Default | 
|                               |                      |                  N/A | 
+-------------------------------+----------------------+----------------------+

And this is after moving our DGX to a different power supply.

Is 91C too hot on GPU 2 ? If so, shouldn’t it be detected and throttled before crashing the system ? Or maybe accelerate the liquid flow of the water cooling ?

EDIT:
The next morning, 2h after rebooting and running no task, GPU 2 is still quite hot.

Is this temperature profile normal ?

jbessoudo · March 7, 2022, 6:52pm

Hi Godeffroy,

Sorry to hear that your GPU is overheating and dropping off the PCIe bus. I’d recommend following up with NVIDIA Enterprise Support. Please log into your Enterprise Support Portal and enter a ticket.

Jacques

godeffroy.valet · March 8, 2022, 10:30am

We have found that the issue is indeed overheating. The DGX documentation recommends only to check the water level visually and to refill when needed. In our case, the water level was good.

But we have learnt too late that the water cooling system needs regular maintenance such as purging the system, cleaning all the tubes, etc. It is apparently recommended to do so every 4 to 6 months, but no one from NVidia has told us and nobody in the company was familiar with water cooling. After about 4 years of use, purging the tubes is not enough because some of the GPU water blocks seem to be clogged as well.

Our company is now trying to get hardware support from NVidia to replace damaged parts.

Topic		Replies	Views
Unable to determine the device handle for GPU :GPU is lost Linux	10	31935	August 11, 2021
Something goes wrong with PCIe and Ubuntu freezes only mouse can move but cannot click several times a day on dgx station v100 Linux pcie , cuda , kernel	6	790	January 6, 2023
GPU is lost ramdomly and nvidia-smi asks for a reboot to recover it Linux	3	2899	October 1, 2021
GPU loss Linux	7	13689	April 3, 2019
GPU has fallen off the bus issues on daily basis (RTX 4090) Linux pcie , cuda , ubuntu , rtx	9	1478	April 26, 2025
GPU has fallen off the bus Linux	7	9810	September 12, 2023
Unable to detect second GPU Ubuntu 16.04/18.04 CUDA Setup and Installation	22	12606	July 21, 2020
GTX 1080 Ti falling off bus Linux	19	2332	September 3, 2018
GPU has fallen off the bus GPU - Hardware	0	976	October 25, 2019
Missing GPU Linux	5	1855	October 12, 2021

GPU is lost. Reboot the system to recover this GPU

Related topics