GPU is lost. Reboot the system to recover this GPU

Hello,

The 4 V100 are usually used at almost 100% all the time, and it has run without issues for a long time. It runs PyTorch trainings.

But since last week, I repeatedly get the following error after some time when running nvidia-smi :

GPU is lost. Reboot the system to recover this GPU.

The kernel logs the following:

Feb 24 03:51:40 DGX-Station kernel: [147935.988522] pcieport 0000:00:02.0: AER: Uncorrected (Non-Fatal) error received: id=0010 
Feb 24 03:51:40 DGX-Station kernel: [147935.988528] pcieport 0000:00:02.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, i 
d=0010(Requester ID) 
Feb 24 03:51:40 DGX-Station kernel: [147935.988532] NVRM: Xid (PCI:0000:07:00): 79, pid=3042, GPU has fallen off the bus. 
Feb 24 03:51:40 DGX-Station kernel: [147935.988535] NVRM: GPU 0000:07:00.0: GPU has fallen off the bus. 
Feb 24 03:51:40 DGX-Station kernel: [147935.988539] pcieport 0000:00:02.0:   device [8086:6f04] error status/mask=00004000/00000000 
Feb 24 03:51:40 DGX-Station kernel: [147935.988543] pcieport 0000:00:02.0:    [14] Completion Timeout     (First) 
Feb 24 03:51:40 DGX-Station kernel: [147935.988546] NVRM: GPU 0000:07:00.0: GPU is on Board 0330818002654. 
Feb 24 03:51:40 DGX-Station kernel: [147935.988550] pcieport 0000:00:02.0: broadcast error_detected message 
Feb 24 03:51:40 DGX-Station kernel: [147935.988554] pcieport 0000:00:02.0: AER: Device recovery failed 
Feb 24 03:51:40 DGX-Station kernel: [147935.988569] NVRM: A GPU crash dump has been created. If possible, please run 
Feb 24 03:51:40 DGX-Station kernel: [147935.988569] NVRM: nvidia-bug-report.sh as root to collect this data before 
Feb 24 03:51:40 DGX-Station kernel: [147935.988569] NVRM: the NVIDIA kernel module is unloaded.

The output of nvidia-bug-report.sh is here:
nvidia-bug-report.log.gz (618.2 KB)

VBIOS for Tesla V100-DGXS-32GB is 00000000:0F:00.0, 88.00.48.00.05

I have read that common causes are lack of power supply or overheating. We have checked the temperature and water cooling system and it looks normal. We haven’t got any power outage.

I’ve just got it again, and here is the output of nvidia-smi a few minutes before.

Thu Feb 24 15:13:24 2022       
+-----------------------------------------------------------------------------+ 
| NVIDIA-SMI 450.102.04   Driver Version: 450.102.04   CUDA Version: 11.0     | 
|-------------------------------+----------------------+----------------------+ 
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC | 
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. | 
|                               |                      |               MIG M. | 
|===============================+======================+======================| 
|   0  Tesla V100-DGXS...  Off  | 00000000:07:00.0 Off |                    0 | 
| N/A   82C    P0    69W / 300W |   9576MiB / 32505MiB |    100%      Default | 
|                               |                      |                  N/A | 
+-------------------------------+----------------------+----------------------+ 
|   1  Tesla V100-DGXS...  Off  | 00000000:08:00.0 Off |                    0 | 
| N/A   81C    P0    78W / 300W |  22913MiB / 32508MiB |    100%      Default | 
|                               |                      |                  N/A | 
+-------------------------------+----------------------+----------------------+ 
|   2  Tesla V100-DGXS...  Off  | 00000000:0E:00.0 Off |                    0 | 
| N/A   91C    P0    61W / 300W |   8057MiB / 32508MiB |    100%      Default | 
|                               |                      |                  N/A | 
+-------------------------------+----------------------+----------------------+ 
|   3  Tesla V100-DGXS...  Off  | 00000000:0F:00.0 Off |                    0 | 
| N/A   83C    P0    92W / 300W |  27659MiB / 32508MiB |    100%      Default | 
|                               |                      |                  N/A | 
+-------------------------------+----------------------+----------------------+

And this is after moving our DGX to a different power supply.

Is 91C too hot on GPU 2 ? If so, shouldn’t it be detected and throttled before crashing the system ? Or maybe accelerate the liquid flow of the water cooling ?

EDIT:
The next morning, 2h after rebooting and running no task, GPU 2 is still quite hot.


Is this temperature profile normal ?

Hi Godeffroy,

Sorry to hear that your GPU is overheating and dropping off the PCIe bus. I’d recommend following up with NVIDIA Enterprise Support. Please log into your Enterprise Support Portal and enter a ticket.

Jacques

We have found that the issue is indeed overheating. The DGX documentation recommends only to check the water level visually and to refill when needed. In our case, the water level was good.

But we have learnt too late that the water cooling system needs regular maintenance such as purging the system, cleaning all the tubes, etc. It is apparently recommended to do so every 4 to 6 months, but no one from NVidia has told us and nobody in the company was familiar with water cooling. After about 4 years of use, purging the tubes is not enough because some of the GPU water blocks seem to be clogged as well.

Our company is now trying to get hardware support from NVidia to replace damaged parts.